2005-07-14 07:13:45 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
|
|
|
* autovacuum.c
|
|
|
|
*
|
|
|
|
* PostgreSQL Integrated Autovacuum Daemon
|
|
|
|
*
|
2007-06-25 18:09:03 +02:00
|
|
|
* The autovacuum system is structured in two different kinds of processes: the
|
|
|
|
* autovacuum launcher and the autovacuum worker. The launcher is an
|
|
|
|
* always-running process, started by the postmaster when the autovacuum GUC
|
|
|
|
* parameter is set. The launcher schedules autovacuum workers to be started
|
|
|
|
* when appropriate. The workers are the processes which execute the actual
|
|
|
|
* vacuuming; they connect to a database as determined in the launcher, and
|
|
|
|
* once connected they examine the catalogs to select the tables to vacuum.
|
|
|
|
*
|
|
|
|
* The autovacuum launcher cannot start the worker processes by itself,
|
|
|
|
* because doing so would cause robustness issues (namely, failure to shut
|
|
|
|
* them down on exceptional conditions, and also, since the launcher is
|
|
|
|
* connected to shared memory and is thus subject to corruption there, it is
|
|
|
|
* not as robust as the postmaster). So it leaves that task to the postmaster.
|
|
|
|
*
|
|
|
|
* There is an autovacuum shared memory area, where the launcher stores
|
|
|
|
* information about the database it wants vacuumed. When it wants a new
|
|
|
|
* worker to start, it sets a flag in shared memory and sends a signal to the
|
|
|
|
* postmaster. Then postmaster knows nothing more than it must start a worker;
|
|
|
|
* so it forks a new child, which turns into a worker. This new process
|
|
|
|
* connects to shared memory, and there it can inspect the information that the
|
|
|
|
* launcher has set up.
|
|
|
|
*
|
|
|
|
* If the fork() call fails in the postmaster, it sets a flag in the shared
|
|
|
|
* memory area, and sends a signal to the launcher. The launcher, upon
|
|
|
|
* noticing the flag, can try starting the worker again by resending the
|
|
|
|
* signal. Note that the failure can only be transient (fork failure due to
|
|
|
|
* high load, memory pressure, too many processes, etc); more permanent
|
|
|
|
* problems, like failure to connect to a database, are detected later in the
|
|
|
|
* worker and dealt with just by having the worker exit normally. The launcher
|
|
|
|
* will launch a new worker again later, per schedule.
|
|
|
|
*
|
2009-08-31 21:41:00 +02:00
|
|
|
* When the worker is done vacuuming it sends SIGUSR2 to the launcher. The
|
2007-06-25 18:09:03 +02:00
|
|
|
* launcher then wakes up and is able to launch another worker, if the schedule
|
|
|
|
* is so tight that a new worker is needed immediately. At this time the
|
|
|
|
* launcher can also balance the settings for the various remaining workers'
|
|
|
|
* cost-based vacuum delay feature.
|
|
|
|
*
|
|
|
|
* Note that there can be more than one worker in a database concurrently.
|
|
|
|
* They will store the table they are currently vacuuming in shared memory, so
|
|
|
|
* that other workers avoid being blocked waiting for the vacuum lock for that
|
2022-04-07 06:29:46 +02:00
|
|
|
* table. They will also fetch the last time the table was vacuumed from
|
|
|
|
* pgstats just before vacuuming each table, to avoid vacuuming a table that
|
|
|
|
* was just finished being vacuumed by another worker and thus is no longer
|
|
|
|
* noted in shared memory. However, there is a small window (due to not yet
|
|
|
|
* holding the relation lock) during which a worker may choose a table that was
|
|
|
|
* already vacuumed; this is a bug in the current design.
|
2005-07-14 07:13:45 +02:00
|
|
|
*
|
2024-01-04 02:49:05 +01:00
|
|
|
* Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
|
2005-07-14 07:13:45 +02:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
|
|
|
*
|
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/backend/postmaster/autovacuum.c
|
2005-07-14 07:13:45 +02:00
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
#include "postgres.h"
|
|
|
|
|
|
|
|
#include <signal.h>
|
2007-06-13 23:24:56 +02:00
|
|
|
#include <sys/time.h>
|
2005-07-14 07:13:45 +02:00
|
|
|
#include <unistd.h>
|
|
|
|
|
|
|
|
#include "access/heapam.h"
|
2012-08-30 22:15:44 +02:00
|
|
|
#include "access/htup_details.h"
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
#include "access/multixact.h"
|
2009-02-09 21:57:59 +01:00
|
|
|
#include "access/reloptions.h"
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
#include "access/tableam.h"
|
2006-07-13 18:49:20 +02:00
|
|
|
#include "access/transam.h"
|
|
|
|
#include "access/xact.h"
|
2008-07-01 04:09:34 +02:00
|
|
|
#include "catalog/dependency.h"
|
2005-07-29 21:30:09 +02:00
|
|
|
#include "catalog/namespace.h"
|
2005-08-11 23:11:50 +02:00
|
|
|
#include "catalog/pg_database.h"
|
2007-06-29 19:07:39 +02:00
|
|
|
#include "commands/dbcommands.h"
|
2005-07-14 07:13:45 +02:00
|
|
|
#include "commands/vacuum.h"
|
2024-02-16 21:05:36 +01:00
|
|
|
#include "common/int.h"
|
2012-10-16 22:36:30 +02:00
|
|
|
#include "lib/ilist.h"
|
2005-07-14 07:13:45 +02:00
|
|
|
#include "libpq/pqsignal.h"
|
|
|
|
#include "miscadmin.h"
|
2017-10-04 00:53:44 +02:00
|
|
|
#include "nodes/makefuncs.h"
|
2005-07-14 07:13:45 +02:00
|
|
|
#include "pgstat.h"
|
|
|
|
#include "postmaster/autovacuum.h"
|
|
|
|
#include "postmaster/fork_process.h"
|
2019-12-17 19:14:28 +01:00
|
|
|
#include "postmaster/interrupt.h"
|
2005-07-14 07:13:45 +02:00
|
|
|
#include "postmaster/postmaster.h"
|
2008-05-12 02:00:54 +02:00
|
|
|
#include "storage/bufmgr.h"
|
2005-07-14 07:13:45 +02:00
|
|
|
#include "storage/ipc.h"
|
2011-08-10 18:20:30 +02:00
|
|
|
#include "storage/latch.h"
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
#include "storage/lmgr.h"
|
2007-02-16 00:23:23 +01:00
|
|
|
#include "storage/pmsignal.h"
|
2005-07-14 07:13:45 +02:00
|
|
|
#include "storage/proc.h"
|
2009-07-31 22:26:23 +02:00
|
|
|
#include "storage/procsignal.h"
|
2008-07-01 04:09:34 +02:00
|
|
|
#include "storage/sinvaladt.h"
|
2017-08-15 18:35:12 +02:00
|
|
|
#include "storage/smgr.h"
|
2005-07-14 07:13:45 +02:00
|
|
|
#include "tcop/tcopprot.h"
|
|
|
|
#include "utils/fmgroids.h"
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
#include "utils/fmgrprotos.h"
|
Split up guc.c for better build speed and ease of maintenance.
guc.c has grown to be one of our largest .c files, making it
a bottleneck for compilation. It's also acquired a bunch of
knowledge that'd be better kept elsewhere, because of our not
very good habit of putting variable-specific check hooks here.
Hence, split it up along these lines:
* guc.c itself retains just the core GUC housekeeping mechanisms.
* New file guc_funcs.c contains the SET/SHOW interfaces and some
SQL-accessible functions for GUC manipulation.
* New file guc_tables.c contains the data arrays that define the
built-in GUC variables, along with some already-exported constant
tables.
* GUC check/assign/show hook functions are moved to the variable's
home module, whenever that's clearly identifiable. A few hard-
to-classify hooks ended up in commands/variable.c, which was
already a home for miscellaneous GUC hook functions.
To avoid cluttering a lot more header files with #include "guc.h",
I also invented a new header file utils/guc_hooks.h and put all
the GUC hook functions' declarations there, regardless of their
originating module. That allowed removal of #include "guc.h"
from some existing headers. The fallout from that (hopefully
all caught here) demonstrates clearly why such inclusions are
best minimized: there are a lot of files that, for example,
were getting array.h at two or more levels of remove, despite
not having any connection at all to GUCs in themselves.
There is some very minor code beautification here, such as
renaming a couple of inconsistently-named hook functions
and improving some comments. But mostly this just moves
code from point A to point B and deals with the ensuing
needs for #include adjustments and exporting a few functions
that previously weren't exported.
Patch by me, per a suggestion from Andres Freund; thanks also
to Michael Paquier for the idea to invent guc_funcs.c.
Discussion: https://postgr.es/m/587607.1662836699@sss.pgh.pa.us
2022-09-13 17:05:07 +02:00
|
|
|
#include "utils/guc_hooks.h"
|
2006-07-11 18:35:33 +02:00
|
|
|
#include "utils/lsyscache.h"
|
2005-07-14 07:13:45 +02:00
|
|
|
#include "utils/memutils.h"
|
|
|
|
#include "utils/ps_status.h"
|
2011-02-23 18:18:09 +01:00
|
|
|
#include "utils/rel.h"
|
2009-08-31 21:41:00 +02:00
|
|
|
#include "utils/snapmgr.h"
|
2006-05-04 00:45:26 +02:00
|
|
|
#include "utils/syscache.h"
|
Introduce timeout handling framework
Management of timeouts was getting a little cumbersome; what we
originally had was more than enough back when we were only concerned
about deadlocks and query cancel; however, when we added timeouts for
standby processes, the code got considerably messier. Since there are
plans to add more complex timeouts, this seems a good time to introduce
a central timeout handling module.
External modules register their timeout handlers during process
initialization, and later enable and disable them as they see fit using
a simple API; timeout.c is in charge of keeping track of which timeouts
are in effect at any time, installing a common SIGALRM signal handler,
and calling setitimer() as appropriate to ensure timely firing of
external handlers.
timeout.c additionally supports pluggable modules to add their own
timeouts, though this capability isn't exercised anywhere yet.
Additionally, as of this commit, walsender processes are aware of
timeouts; we had a preexisting bug there that made those ignore SIGALRM,
thus being subject to unhandled deadlocks, particularly during the
authentication phase. This has already been fixed in back branches in
commit 0bf8eb2a, which see for more details.
Main author: Zoltán Böszörményi
Some review and cleanup by Álvaro Herrera
Extensive reworking by Tom Lane
2012-07-17 00:43:21 +02:00
|
|
|
#include "utils/timeout.h"
|
2011-09-09 19:23:41 +02:00
|
|
|
#include "utils/timestamp.h"
|
2005-07-14 07:13:45 +02:00
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* GUC parameters
|
|
|
|
*/
|
|
|
|
bool autovacuum_start_daemon = false;
|
2007-04-16 20:30:04 +02:00
|
|
|
int autovacuum_max_workers;
|
2013-12-12 12:42:39 +01:00
|
|
|
int autovacuum_work_mem = -1;
|
2005-07-14 07:13:45 +02:00
|
|
|
int autovacuum_naptime;
|
|
|
|
int autovacuum_vac_thresh;
|
|
|
|
double autovacuum_vac_scale;
|
Trigger autovacuum based on number of INSERTs
Traditionally autovacuum has only ever invoked a worker based on the
estimated number of dead tuples in a table and for anti-wraparound
purposes. For the latter, with certain classes of tables such as
insert-only tables, anti-wraparound vacuums could be the first vacuum that
the table ever receives. This could often lead to autovacuum workers being
busy for extended periods of time due to having to potentially freeze
every page in the table. This could be particularly bad for very large
tables. New clusters, or recently pg_restored clusters could suffer even
more as many large tables may have the same relfrozenxid, which could
result in large numbers of tables requiring an anti-wraparound vacuum all
at once.
Here we aim to reduce the work required by anti-wraparound and aggressive
vacuums in general, by triggering autovacuum when the table has received
enough INSERTs. This is controlled by adding two new GUCs and reloptions;
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor. These work exactly the same as the
existing scale factor and threshold controls, only base themselves off the
number of inserts since the last vacuum, rather than the number of dead
tuples. New controls were added rather than reusing the existing
controls, to allow these new vacuums to be tuned independently and perhaps
even completely disabled altogether, which can be done by setting
autovacuum_vacuum_insert_threshold to -1.
We make no attempt to skip index cleanup operations on these vacuums as
they may trigger for an insert-mostly table which continually doesn't have
enough dead tuples to trigger an autovacuum for the purpose of removing
those dead tuples. If we were to skip cleaning the indexes in this case,
then it is possible for the index(es) to become bloated over time.
There are additional benefits to triggering autovacuums based on inserts,
as tables which never contain enough dead tuples to trigger an autovacuum
are now more likely to receive a vacuum, which can mark more of the table
as "allvisible" and encourage the query planner to make use of Index Only
Scans.
Currently, we still obey vacuum_freeze_min_age when triggering these new
autovacuums based on INSERTs. For large insert-only tables, it may be
beneficial to lower the table's autovacuum_freeze_min_age so that tuples
are eligible to be frozen sooner. Here we've opted not to zero that for
these types of vacuums, since the table may just be insert-mostly and we
may otherwise freeze tuples that are still destined to be updated or
removed in the near future.
There was some debate to what exactly the new scale factor and threshold
should default to. For now, these are set to 0.2 and 1000, respectively.
There may be some motivation to adjust these before the release.
Author: Laurenz Albe, Darafei Praliaskouski
Reviewed-by: Alvaro Herrera, Masahiko Sawada, Chris Travers, Andres Freund, Justin Pryzby
Discussion: https://postgr.es/m/CAC8Q8t%2Bj36G_bLF%3D%2B0iMo6jGNWnLnWb1tujXuJr-%2Bx8ZCCTqoQ%40mail.gmail.com
2020-03-28 07:20:12 +01:00
|
|
|
int autovacuum_vac_ins_thresh;
|
|
|
|
double autovacuum_vac_ins_scale;
|
2005-07-14 07:13:45 +02:00
|
|
|
int autovacuum_anl_thresh;
|
|
|
|
double autovacuum_anl_scale;
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
int autovacuum_freeze_max_age;
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
int autovacuum_multixact_freeze_max_age;
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2019-03-10 20:01:39 +01:00
|
|
|
double autovacuum_vac_cost_delay;
|
2005-08-11 23:11:50 +02:00
|
|
|
int autovacuum_vac_cost_limit;
|
|
|
|
|
2021-12-13 15:48:04 +01:00
|
|
|
int Log_autovacuum_min_duration = 600000;
|
2007-04-18 18:44:18 +02:00
|
|
|
|
2009-06-09 21:36:28 +02:00
|
|
|
/* the minimum allowed time between two awakenings of the launcher */
|
2009-06-09 18:41:02 +02:00
|
|
|
#define MIN_AUTOVAC_SLEEPTIME 100.0 /* milliseconds */
|
2015-06-19 17:44:36 +02:00
|
|
|
#define MAX_AUTOVAC_SLEEPTIME 300 /* seconds */
|
2007-06-13 23:24:56 +02:00
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
/* Flags to tell if we are in an autovacuum process */
|
2007-02-16 00:23:23 +01:00
|
|
|
static bool am_autovacuum_launcher = false;
|
|
|
|
static bool am_autovacuum_worker = false;
|
2005-07-14 07:13:45 +02:00
|
|
|
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
/*
|
|
|
|
* Variables to save the cost-related storage parameters for the current
|
|
|
|
* relation being vacuumed by this autovacuum worker. Using these, we can
|
|
|
|
* ensure we don't overwrite the values of vacuum_cost_delay and
|
|
|
|
* vacuum_cost_limit after reloading the configuration file. They are
|
|
|
|
* initialized to "invalid" values to indicate that no cost-related storage
|
|
|
|
* parameters were specified and will be set in do_autovacuum() after checking
|
|
|
|
* the storage parameters in table_recheck_autovac().
|
|
|
|
*/
|
|
|
|
static double av_storage_param_cost_delay = -1;
|
|
|
|
static int av_storage_param_cost_limit = -1;
|
|
|
|
|
2007-07-01 20:30:54 +02:00
|
|
|
/* Flags set by signal handlers */
|
2009-08-31 21:41:00 +02:00
|
|
|
static volatile sig_atomic_t got_SIGUSR2 = false;
|
2007-07-01 20:30:54 +02:00
|
|
|
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
/* Comparison points for determining whether freeze_max_age is exceeded */
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
static TransactionId recentXid;
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
static MultiXactId recentMulti;
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
|
2009-01-16 14:27:24 +01:00
|
|
|
/* Default freeze ages to use for autovacuum (varies by database) */
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
static int default_freeze_min_age;
|
2009-01-16 14:27:24 +01:00
|
|
|
static int default_freeze_table_age;
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
static int default_multixact_freeze_min_age;
|
|
|
|
static int default_multixact_freeze_table_age;
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
|
2005-08-11 23:11:50 +02:00
|
|
|
/* Memory context for long-lived data */
|
|
|
|
static MemoryContext AutovacMemCxt;
|
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
/* struct to keep track of databases in launcher */
|
|
|
|
typedef struct avl_dbase
|
2005-07-14 07:13:45 +02:00
|
|
|
{
|
2007-04-16 20:30:04 +02:00
|
|
|
Oid adl_datid; /* hash key -- must be first */
|
|
|
|
TimestampTz adl_next_worker;
|
|
|
|
int adl_score;
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_node adl_node;
|
2007-04-16 20:30:04 +02:00
|
|
|
} avl_dbase;
|
|
|
|
|
|
|
|
/* struct to keep track of databases in worker */
|
|
|
|
typedef struct avw_dbase
|
|
|
|
{
|
|
|
|
Oid adw_datid;
|
|
|
|
char *adw_name;
|
|
|
|
TransactionId adw_frozenxid;
|
2013-09-16 20:45:00 +02:00
|
|
|
MultiXactId adw_minmulti;
|
2007-04-16 20:30:04 +02:00
|
|
|
PgStat_StatDBEntry *adw_entry;
|
|
|
|
} avw_dbase;
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2007-03-29 00:17:12 +02:00
|
|
|
/* struct to keep track of tables to vacuum and/or analyze, in 1st pass */
|
|
|
|
typedef struct av_relation
|
|
|
|
{
|
2008-08-13 02:07:50 +02:00
|
|
|
Oid ar_toastrelid; /* hash key - must be first */
|
2007-03-29 00:17:12 +02:00
|
|
|
Oid ar_relid;
|
2009-02-09 21:57:59 +01:00
|
|
|
bool ar_hasrelopts;
|
|
|
|
AutoVacOpts ar_reloptions; /* copy of AutoVacOpts from the main table's
|
|
|
|
* reloptions, or NULL if none */
|
2007-03-29 00:17:12 +02:00
|
|
|
} av_relation;
|
|
|
|
|
2007-03-27 22:36:03 +02:00
|
|
|
/* struct to keep track of tables to vacuum and/or analyze, after rechecking */
|
2005-08-11 23:11:50 +02:00
|
|
|
typedef struct autovac_table
|
|
|
|
{
|
2007-03-27 22:36:03 +02:00
|
|
|
Oid at_relid;
|
2015-03-18 15:52:33 +01:00
|
|
|
VacuumParams at_params;
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
double at_storage_param_vac_cost_delay;
|
|
|
|
int at_storage_param_vac_cost_limit;
|
Don't balance vacuum cost delay when per-table settings are in effect
When there are cost-delay-related storage options set for a table,
trying to make that table participate in the autovacuum cost-limit
balancing algorithm produces undesirable results: instead of using the
configured values, the global values are always used,
as illustrated by Mark Kirkwood in
http://www.postgresql.org/message-id/52FACF15.8020507@catalyst.net.nz
Since the mechanism is already complicated, just disable it for those
cases rather than trying to make it cope. There are undesirable
side-effects from this too, namely that the total I/O impact on the
system will be higher whenever such tables are vacuumed. However, this
is seen as less harmful than slowing down vacuum, because that would
cause bloat to accumulate. Anyway, in the new system it is possible to
tweak options to get the precise behavior one wants, whereas with the
previous system one was simply hosed.
This has been broken forever, so backpatch to all supported branches.
This might affect systems where cost_limit and cost_delay have been set
for individual tables.
2014-10-03 18:01:27 +02:00
|
|
|
bool at_dobalance;
|
2016-05-10 21:23:54 +02:00
|
|
|
bool at_sharedrel;
|
2008-07-17 23:02:31 +02:00
|
|
|
char *at_relname;
|
|
|
|
char *at_nspname;
|
|
|
|
char *at_datname;
|
2005-08-11 23:11:50 +02:00
|
|
|
} autovac_table;
|
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
/*-------------
|
|
|
|
* This struct holds information about a single worker's whereabouts. We keep
|
|
|
|
* an array of these in shared memory, sized according to
|
|
|
|
* autovacuum_max_workers.
|
|
|
|
*
|
|
|
|
* wi_links entry into free list or running list
|
|
|
|
* wi_dboid OID of the database this worker is supposed to work on
|
2010-11-20 04:28:20 +01:00
|
|
|
* wi_tableoid OID of the table currently being vacuumed, if any
|
2016-05-10 21:23:54 +02:00
|
|
|
* wi_sharedrel flag indicating whether table is marked relisshared
|
2007-10-24 21:08:25 +02:00
|
|
|
* wi_proc pointer to PGPROC of the running worker, NULL if not started
|
2007-04-16 20:30:04 +02:00
|
|
|
* wi_launchtime Time at which this worker was launched
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
* wi_dobalance Whether this worker should be included in balance calculations
|
2007-04-16 20:30:04 +02:00
|
|
|
*
|
2018-03-13 17:28:15 +01:00
|
|
|
* All fields are protected by AutovacuumLock, except for wi_tableoid and
|
|
|
|
* wi_sharedrel which are protected by AutovacuumScheduleLock (note these
|
|
|
|
* two fields are read-only for everyone except that worker itself).
|
2007-04-16 20:30:04 +02:00
|
|
|
*-------------
|
|
|
|
*/
|
|
|
|
typedef struct WorkerInfoData
|
|
|
|
{
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_node wi_links;
|
2007-04-16 20:30:04 +02:00
|
|
|
Oid wi_dboid;
|
|
|
|
Oid wi_tableoid;
|
2007-10-24 21:08:25 +02:00
|
|
|
PGPROC *wi_proc;
|
2007-04-16 20:30:04 +02:00
|
|
|
TimestampTz wi_launchtime;
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
pg_atomic_flag wi_dobalance;
|
2016-05-10 21:23:54 +02:00
|
|
|
bool wi_sharedrel;
|
2007-04-16 20:30:04 +02:00
|
|
|
} WorkerInfoData;
|
|
|
|
|
|
|
|
typedef struct WorkerInfoData *WorkerInfo;
|
|
|
|
|
2007-06-25 18:09:03 +02:00
|
|
|
/*
|
|
|
|
* Possible signals received by the launcher from remote processes. These are
|
|
|
|
* stored atomically in shared memory so that other processes can set them
|
|
|
|
* without locking.
|
|
|
|
*/
|
|
|
|
typedef enum
|
|
|
|
{
|
|
|
|
AutoVacForkFailed, /* failed trying to start a worker */
|
|
|
|
AutoVacRebalance, /* rebalance the cost limits */
|
2008-11-12 11:10:32 +01:00
|
|
|
AutoVacNumSignals, /* must be last */
|
2007-06-25 18:09:03 +02:00
|
|
|
} AutoVacuumSignal;
|
|
|
|
|
2017-08-15 23:14:07 +02:00
|
|
|
/*
|
|
|
|
* Autovacuum workitem array, stored in AutoVacuumShmem->av_workItems. This
|
|
|
|
* list is mostly protected by AutovacuumLock, except that if an item is
|
|
|
|
* marked 'active' other processes must not modify the work-identifying
|
|
|
|
* members.
|
|
|
|
*/
|
|
|
|
typedef struct AutoVacuumWorkItem
|
|
|
|
{
|
|
|
|
AutoVacuumWorkItemType avw_type;
|
|
|
|
bool avw_used; /* below data is valid */
|
|
|
|
bool avw_active; /* being processed */
|
|
|
|
Oid avw_database;
|
|
|
|
Oid avw_relation;
|
|
|
|
BlockNumber avw_blockNumber;
|
|
|
|
} AutoVacuumWorkItem;
|
|
|
|
|
|
|
|
#define NUM_WORKITEMS 256
|
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
/*-------------
|
|
|
|
* The main autovacuum shmem struct. On shared memory we store this main
|
|
|
|
* struct and the array of WorkerInfo structs. This struct keeps:
|
|
|
|
*
|
2007-06-25 18:09:03 +02:00
|
|
|
* av_signal set by other processes to indicate various conditions
|
2007-04-16 20:30:04 +02:00
|
|
|
* av_launcherpid the PID of the autovacuum launcher
|
|
|
|
* av_freeWorkers the WorkerInfo freelist
|
|
|
|
* av_runningWorkers the WorkerInfo non-free queue
|
|
|
|
* av_startingWorker pointer to WorkerInfo currently being started (cleared by
|
|
|
|
* the worker itself as soon as it's up and running)
|
2017-08-15 23:14:07 +02:00
|
|
|
* av_workItems work item array
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
* av_nworkersForBalance the number of autovacuum workers to use when
|
|
|
|
* calculating the per worker cost limit
|
2007-04-16 20:30:04 +02:00
|
|
|
*
|
2007-06-25 18:09:03 +02:00
|
|
|
* This struct is protected by AutovacuumLock, except for av_signal and parts
|
2017-08-15 23:14:07 +02:00
|
|
|
* of the worker list (see above).
|
2007-04-16 20:30:04 +02:00
|
|
|
*-------------
|
|
|
|
*/
|
2007-02-16 00:23:23 +01:00
|
|
|
typedef struct
|
|
|
|
{
|
2007-06-25 18:09:03 +02:00
|
|
|
sig_atomic_t av_signal[AutoVacNumSignals];
|
2007-04-16 20:30:04 +02:00
|
|
|
pid_t av_launcherpid;
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_head av_freeWorkers;
|
|
|
|
dlist_head av_runningWorkers;
|
2008-11-02 22:24:52 +01:00
|
|
|
WorkerInfo av_startingWorker;
|
2017-08-15 23:14:07 +02:00
|
|
|
AutoVacuumWorkItem av_workItems[NUM_WORKITEMS];
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
pg_atomic_uint32 av_nworkersForBalance;
|
2007-02-16 00:23:23 +01:00
|
|
|
} AutoVacuumShmemStruct;
|
|
|
|
|
|
|
|
static AutoVacuumShmemStruct *AutoVacuumShmem;
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2012-10-16 22:36:30 +02:00
|
|
|
/*
|
|
|
|
* the database list (of avl_dbase elements) in the launcher, and the context
|
|
|
|
* that contains it
|
|
|
|
*/
|
|
|
|
static dlist_head DatabaseList = DLIST_STATIC_INIT(DatabaseList);
|
2007-04-16 20:30:04 +02:00
|
|
|
static MemoryContext DatabaseListCxt = NULL;
|
|
|
|
|
|
|
|
/* Pointer to my own WorkerInfo, valid on each worker */
|
|
|
|
static WorkerInfo MyWorkerInfo = NULL;
|
|
|
|
|
|
|
|
/* PID of launcher, valid only in worker while shutting down */
|
|
|
|
int AutovacuumLauncherPid = 0;
|
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
#ifdef EXEC_BACKEND
|
2007-02-16 00:23:23 +01:00
|
|
|
static pid_t avlauncher_forkexec(void);
|
|
|
|
static pid_t avworker_forkexec(void);
|
2005-07-14 07:13:45 +02:00
|
|
|
#endif
|
2015-03-11 14:19:54 +01:00
|
|
|
NON_EXEC_STATIC void AutoVacWorkerMain(int argc, char *argv[]) pg_attribute_noreturn();
|
|
|
|
NON_EXEC_STATIC void AutoVacLauncherMain(int argc, char *argv[]) pg_attribute_noreturn();
|
2007-02-16 00:23:23 +01:00
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
static Oid do_start_worker(void);
|
2019-12-17 18:55:13 +01:00
|
|
|
static void HandleAutoVacLauncherInterrupts(void);
|
2019-12-17 19:56:19 +01:00
|
|
|
static void AutoVacLauncherShutdown(void) pg_attribute_noreturn();
|
2007-06-13 23:24:56 +02:00
|
|
|
static void launcher_determine_sleep(bool canlaunch, bool recursing,
|
|
|
|
struct timeval *nap);
|
2007-04-16 20:30:04 +02:00
|
|
|
static void launch_worker(TimestampTz now);
|
|
|
|
static List *get_database_list(void);
|
|
|
|
static void rebuild_database_list(Oid newdb);
|
|
|
|
static int db_comparator(const void *a, const void *b);
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
static void autovac_recalculate_workers_for_balance(void);
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2007-03-29 00:17:12 +02:00
|
|
|
static void do_autovacuum(void);
|
2007-04-16 20:30:04 +02:00
|
|
|
static void FreeWorkerInfo(int code, Datum arg);
|
2007-03-29 00:17:12 +02:00
|
|
|
|
2009-02-09 21:57:59 +01:00
|
|
|
static autovac_table *table_recheck_autovac(Oid relid, HTAB *table_toast_map,
|
2015-05-08 18:09:14 +02:00
|
|
|
TupleDesc pg_class_desc,
|
|
|
|
int effective_multixact_freeze_max_age);
|
Speed up rechecking if relation needs to be vacuumed or analyze in autovacuum.
After autovacuum collects the relations to vacuum or analyze, it rechecks
whether each relation still needs to be vacuumed or analyzed before actually
doing that. Previously this recheck could be a significant overhead
especially when there were a very large number of relations. This was
because each recheck forced the statistics to be refreshed, and the refresh
of the statistics for a very large number of relations could cause heavy
overhead. There was the report that this issue caused autovacuum workers
to have gotten “stuck” in a tight loop of table_recheck_autovac() that
rechecks whether a relation needs to be vacuumed or analyzed.
This commit speeds up the recheck by making autovacuum worker reuse
the previously-read statistics for the recheck if possible. Then if that
"stale" statistics says that a relation still needs to be vacuumed or analyzed,
autovacuum refreshes the statistics and does the recheck again.
The benchmark shows that the more relations exist and autovacuum workers
are running concurrently, the more this change reduces the autovacuum
execution time. For example, when there are 20,000 tables and 10 autovacuum
workers are running, the benchmark showed that the change improved
the performance of autovacuum more than three times. On the other hand,
even when there are only 1000 tables and only a single autovacuum worker
is running, the benchmark didn't show any big performance regression by
the change.
Firstly POC patch was proposed by Jim Nasby. As the result of discussion,
we used Tatsuhito Kasahara's version of the patch using the approach
suggested by Tom Lane.
Reported-by: Jim Nasby
Author: Tatsuhito Kasahara
Reviewed-by: Masahiko Sawada, Fujii Masao
Discussion: https://postgr.es/m/3FC6C2F2-8A47-44C0-B997-28830B5716D0@amazon.com
2020-12-08 15:59:39 +01:00
|
|
|
static void recheck_relation_needs_vacanalyze(Oid relid, AutoVacOpts *avopts,
|
|
|
|
Form_pg_class classForm,
|
|
|
|
int effective_multixact_freeze_max_age,
|
|
|
|
bool *dovacuum, bool *doanalyze, bool *wraparound);
|
2009-02-09 21:57:59 +01:00
|
|
|
static void relation_needs_vacanalyze(Oid relid, AutoVacOpts *relopts,
|
2007-03-29 00:17:12 +02:00
|
|
|
Form_pg_class classForm,
|
2009-02-09 21:57:59 +01:00
|
|
|
PgStat_StatTabEntry *tabentry,
|
2015-05-08 18:09:14 +02:00
|
|
|
int effective_multixact_freeze_max_age,
|
2009-02-09 21:57:59 +01:00
|
|
|
bool *dovacuum, bool *doanalyze, bool *wraparound);
|
2007-03-29 00:17:12 +02:00
|
|
|
|
2008-07-17 23:02:31 +02:00
|
|
|
static void autovacuum_do_vac_analyze(autovac_table *tab,
|
2007-05-30 22:12:03 +02:00
|
|
|
BufferAccessStrategy bstrategy);
|
2009-02-09 21:57:59 +01:00
|
|
|
static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
|
|
|
|
TupleDesc pg_class_desc);
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
static void perform_work_item(AutoVacuumWorkItem *workitem);
|
2008-07-17 23:02:31 +02:00
|
|
|
static void autovac_report_activity(autovac_table *tab);
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
|
|
|
|
const char *nspname, const char *relname);
|
2009-08-31 21:41:00 +02:00
|
|
|
static void avl_sigusr2_handler(SIGNAL_ARGS);
|
2005-07-14 07:13:45 +02:00
|
|
|
|
|
|
|
|
2007-02-16 00:23:23 +01:00
|
|
|
|
|
|
|
/********************************************************************
|
|
|
|
* AUTOVACUUM LAUNCHER CODE
|
|
|
|
********************************************************************/
|
|
|
|
|
|
|
|
#ifdef EXEC_BACKEND
|
2005-07-14 07:13:45 +02:00
|
|
|
/*
|
2007-02-16 00:23:23 +01:00
|
|
|
* forkexec routine for the autovacuum launcher process.
|
2005-07-14 07:13:45 +02:00
|
|
|
*
|
2007-02-16 00:23:23 +01:00
|
|
|
* Format up the arglist, then fork and exec.
|
2005-07-14 07:13:45 +02:00
|
|
|
*/
|
2007-02-16 00:23:23 +01:00
|
|
|
static pid_t
|
|
|
|
avlauncher_forkexec(void)
|
2005-07-14 07:13:45 +02:00
|
|
|
{
|
2007-02-16 00:23:23 +01:00
|
|
|
char *av[10];
|
|
|
|
int ac = 0;
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2007-02-16 00:23:23 +01:00
|
|
|
av[ac++] = "postgres";
|
|
|
|
av[ac++] = "--forkavlauncher";
|
|
|
|
av[ac++] = NULL; /* filled in by postmaster_forkexec */
|
|
|
|
av[ac] = NULL;
|
2006-07-10 18:20:52 +02:00
|
|
|
|
2007-02-16 00:23:23 +01:00
|
|
|
Assert(ac < lengthof(av));
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2007-02-16 00:23:23 +01:00
|
|
|
return postmaster_forkexec(ac, av);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Main entry point for autovacuum launcher process, to be called from the
|
|
|
|
* postmaster.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
StartAutoVacLauncher(void)
|
|
|
|
{
|
|
|
|
pid_t AutoVacPID;
|
2005-07-14 07:13:45 +02:00
|
|
|
|
|
|
|
#ifdef EXEC_BACKEND
|
2007-02-16 00:23:23 +01:00
|
|
|
switch ((AutoVacPID = avlauncher_forkexec()))
|
2005-07-14 07:13:45 +02:00
|
|
|
#else
|
|
|
|
switch ((AutoVacPID = fork_process()))
|
|
|
|
#endif
|
|
|
|
{
|
|
|
|
case -1:
|
|
|
|
ereport(LOG,
|
2008-02-20 15:01:45 +01:00
|
|
|
(errmsg("could not fork autovacuum launcher process: %m")));
|
2005-07-14 07:13:45 +02:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
#ifndef EXEC_BACKEND
|
|
|
|
case 0:
|
|
|
|
/* in postmaster child ... */
|
2015-01-13 13:12:37 +01:00
|
|
|
InitPostmasterChild();
|
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
/* Close the postmaster's sockets */
|
|
|
|
ClosePostmasterPorts(false);
|
|
|
|
|
2007-02-16 00:23:23 +01:00
|
|
|
AutoVacLauncherMain(0, NULL);
|
2005-07-14 07:13:45 +02:00
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
default:
|
|
|
|
return (int) AutoVacPID;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* shouldn't get here */
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2007-02-16 00:23:23 +01:00
|
|
|
* Main loop for the autovacuum launcher process.
|
2005-07-14 07:13:45 +02:00
|
|
|
*/
|
2007-02-16 00:23:23 +01:00
|
|
|
NON_EXEC_STATIC void
|
|
|
|
AutoVacLauncherMain(int argc, char *argv[])
|
2005-07-14 07:13:45 +02:00
|
|
|
{
|
2007-02-16 00:23:23 +01:00
|
|
|
sigjmp_buf local_sigjmp_buf;
|
|
|
|
|
|
|
|
am_autovacuum_launcher = true;
|
|
|
|
|
2020-03-11 16:36:40 +01:00
|
|
|
MyBackendType = B_AUTOVAC_LAUNCHER;
|
|
|
|
init_ps_display(NULL);
|
2007-02-16 00:23:23 +01:00
|
|
|
|
2017-03-10 21:18:38 +01:00
|
|
|
ereport(DEBUG1,
|
2021-02-17 11:24:46 +01:00
|
|
|
(errmsg_internal("autovacuum launcher started")));
|
2009-08-31 21:41:00 +02:00
|
|
|
|
2007-10-24 21:08:25 +02:00
|
|
|
if (PostAuthDelay)
|
|
|
|
pg_usleep(PostAuthDelay * 1000000L);
|
|
|
|
|
2007-02-16 00:23:23 +01:00
|
|
|
SetProcessingMode(InitProcessing);
|
|
|
|
|
|
|
|
/*
|
2009-08-31 21:41:00 +02:00
|
|
|
* Set up signal handlers. We operate on databases much like a regular
|
|
|
|
* backend, so we use the same signal handling. See equivalent code in
|
|
|
|
* tcop/postgres.c.
|
2007-02-16 00:23:23 +01:00
|
|
|
*/
|
2019-12-17 19:14:28 +01:00
|
|
|
pqsignal(SIGHUP, SignalHandlerForConfigReload);
|
2009-08-31 21:41:00 +02:00
|
|
|
pqsignal(SIGINT, StatementCancelHandler);
|
2019-12-17 19:14:28 +01:00
|
|
|
pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
|
Centralize setup of SIGQUIT handling for postmaster child processes.
We decided that the policy established in commit 7634bd4f6 for
the bgwriter, checkpointer, walwriter, and walreceiver processes,
namely that they should accept SIGQUIT at all times, really ought
to apply uniformly to all postmaster children. Therefore, get
rid of the duplicative and inconsistent per-process code for
establishing that signal handler and removing SIGQUIT from BlockSig.
Instead, make InitPostmasterChild do it.
The handler set up by InitPostmasterChild is SignalHandlerForCrashExit,
which just summarily does _exit(2). In interactive backends, we
almost immediately replace that with quickdie, since we would prefer
to try to tell the client that we're dying. However, this patch is
changing the behavior of autovacuum (both launcher and workers), as
well as walsenders. Those processes formerly also used quickdie,
but AFAICS that was just mindless copy-and-paste: they don't have
any interactive client that's likely to benefit from being told this.
The stats collector continues to be an outlier, in that it thinks
SIGQUIT means normal exit. That should probably be changed for
consistency, but there's another patch set where that's being
dealt with, so I didn't do so here.
Discussion: https://postgr.es/m/644875.1599933441@sss.pgh.pa.us
2020-09-16 22:04:36 +02:00
|
|
|
/* SIGQUIT handler was already set up by InitPostmasterChild */
|
2009-08-31 21:41:00 +02:00
|
|
|
|
Introduce timeout handling framework
Management of timeouts was getting a little cumbersome; what we
originally had was more than enough back when we were only concerned
about deadlocks and query cancel; however, when we added timeouts for
standby processes, the code got considerably messier. Since there are
plans to add more complex timeouts, this seems a good time to introduce
a central timeout handling module.
External modules register their timeout handlers during process
initialization, and later enable and disable them as they see fit using
a simple API; timeout.c is in charge of keeping track of which timeouts
are in effect at any time, installing a common SIGALRM signal handler,
and calling setitimer() as appropriate to ensure timely firing of
external handlers.
timeout.c additionally supports pluggable modules to add their own
timeouts, though this capability isn't exercised anywhere yet.
Additionally, as of this commit, walsender processes are aware of
timeouts; we had a preexisting bug there that made those ignore SIGALRM,
thus being subject to unhandled deadlocks, particularly during the
authentication phase. This has already been fixed in back branches in
commit 0bf8eb2a, which see for more details.
Main author: Zoltán Böszörményi
Some review and cleanup by Álvaro Herrera
Extensive reworking by Tom Lane
2012-07-17 00:43:21 +02:00
|
|
|
InitializeTimeouts(); /* establishes SIGALRM handler */
|
2007-02-16 00:23:23 +01:00
|
|
|
|
|
|
|
pqsignal(SIGPIPE, SIG_IGN);
|
2009-08-31 21:41:00 +02:00
|
|
|
pqsignal(SIGUSR1, procsignal_sigusr1_handler);
|
|
|
|
pqsignal(SIGUSR2, avl_sigusr2_handler);
|
2007-02-16 00:23:23 +01:00
|
|
|
pqsignal(SIGFPE, FloatExceptionHandler);
|
|
|
|
pqsignal(SIGCHLD, SIG_DFL);
|
|
|
|
|
|
|
|
/*
|
2023-12-03 15:39:18 +01:00
|
|
|
* Create a per-backend PGPROC struct in shared memory. We must do this
|
|
|
|
* before we can use LWLocks or access any shared memory.
|
2007-02-16 00:23:23 +01:00
|
|
|
*/
|
2009-08-31 21:41:00 +02:00
|
|
|
InitProcess();
|
2007-02-16 00:23:23 +01:00
|
|
|
|
2021-08-05 23:37:09 +02:00
|
|
|
/* Early initialization */
|
|
|
|
BaseInit();
|
|
|
|
|
2023-10-11 05:31:49 +02:00
|
|
|
InitPostgres(NULL, InvalidOid, NULL, InvalidOid, 0, NULL);
|
2009-08-31 21:41:00 +02:00
|
|
|
|
|
|
|
SetProcessingMode(NormalProcessing);
|
|
|
|
|
2007-02-16 00:23:23 +01:00
|
|
|
/*
|
|
|
|
* Create a memory context that we will do all our work in. We do this so
|
|
|
|
* that we can reset the context during error recovery and thereby avoid
|
|
|
|
* possible memory leaks.
|
|
|
|
*/
|
2007-04-16 20:30:04 +02:00
|
|
|
AutovacMemCxt = AllocSetContextCreate(TopMemoryContext,
|
|
|
|
"Autovacuum Launcher",
|
Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
|
|
|
ALLOCSET_DEFAULT_SIZES);
|
2007-04-16 20:30:04 +02:00
|
|
|
MemoryContextSwitchTo(AutovacMemCxt);
|
2007-02-16 00:23:23 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If an exception is encountered, processing resumes here.
|
|
|
|
*
|
2009-08-31 21:41:00 +02:00
|
|
|
* This code is a stripped down version of PostgresMain error recovery.
|
Accept SIGQUIT during error recovery in auxiliary processes.
The bgwriter, checkpointer, walwriter, and walreceiver processes
claimed to allow SIGQUIT "at all times". In reality SIGQUIT
would get re-blocked during error recovery, because we didn't
update the actual signal mask immediately, so sigsetjmp() would
save and reinstate a mask that includes SIGQUIT.
This appears to be simply a coding oversight. There's never a
good reason to hold off SIGQUIT in these processes, because it's
going to just call _exit(2) which should be safe enough, especially
since the postmaster is going to tear down shared memory afterwards.
Hence, stick in PG_SETMASK() calls to install the modified BlockSig
mask immediately.
Also try to improve the comments around sigsetjmp blocks. Most of
them were just referencing postgres.c, which is misleading because
actually postgres.c manages the signals differently.
No back-patch, since there's no evidence that this is causing any
problems in the field.
Discussion: https://postgr.es/m/CALDaNm1d1hHPZUg3xU4XjtWBOLCrA+-2cJcLpw-cePZ=GgDVfA@mail.gmail.com
2020-09-11 22:01:28 +02:00
|
|
|
*
|
|
|
|
* Note that we use sigsetjmp(..., 1), so that the prevailing signal mask
|
|
|
|
* (to wit, BlockSig) will be restored when longjmp'ing to here. Thus,
|
Centralize setup of SIGQUIT handling for postmaster child processes.
We decided that the policy established in commit 7634bd4f6 for
the bgwriter, checkpointer, walwriter, and walreceiver processes,
namely that they should accept SIGQUIT at all times, really ought
to apply uniformly to all postmaster children. Therefore, get
rid of the duplicative and inconsistent per-process code for
establishing that signal handler and removing SIGQUIT from BlockSig.
Instead, make InitPostmasterChild do it.
The handler set up by InitPostmasterChild is SignalHandlerForCrashExit,
which just summarily does _exit(2). In interactive backends, we
almost immediately replace that with quickdie, since we would prefer
to try to tell the client that we're dying. However, this patch is
changing the behavior of autovacuum (both launcher and workers), as
well as walsenders. Those processes formerly also used quickdie,
but AFAICS that was just mindless copy-and-paste: they don't have
any interactive client that's likely to benefit from being told this.
The stats collector continues to be an outlier, in that it thinks
SIGQUIT means normal exit. That should probably be changed for
consistency, but there's another patch set where that's being
dealt with, so I didn't do so here.
Discussion: https://postgr.es/m/644875.1599933441@sss.pgh.pa.us
2020-09-16 22:04:36 +02:00
|
|
|
* signals other than SIGQUIT will be blocked until we complete error
|
|
|
|
* recovery. It might seem that this policy makes the HOLD_INTERRUPTS()
|
|
|
|
* call redundant, but it is not since InterruptPending might be set
|
|
|
|
* already.
|
2007-02-16 00:23:23 +01:00
|
|
|
*/
|
|
|
|
if (sigsetjmp(local_sigjmp_buf, 1) != 0)
|
|
|
|
{
|
|
|
|
/* since not using PG_TRY, must reset error stack by hand */
|
|
|
|
error_context_stack = NULL;
|
|
|
|
|
|
|
|
/* Prevents interrupts while cleaning up */
|
|
|
|
HOLD_INTERRUPTS();
|
|
|
|
|
Introduce timeout handling framework
Management of timeouts was getting a little cumbersome; what we
originally had was more than enough back when we were only concerned
about deadlocks and query cancel; however, when we added timeouts for
standby processes, the code got considerably messier. Since there are
plans to add more complex timeouts, this seems a good time to introduce
a central timeout handling module.
External modules register their timeout handlers during process
initialization, and later enable and disable them as they see fit using
a simple API; timeout.c is in charge of keeping track of which timeouts
are in effect at any time, installing a common SIGALRM signal handler,
and calling setitimer() as appropriate to ensure timely firing of
external handlers.
timeout.c additionally supports pluggable modules to add their own
timeouts, though this capability isn't exercised anywhere yet.
Additionally, as of this commit, walsender processes are aware of
timeouts; we had a preexisting bug there that made those ignore SIGALRM,
thus being subject to unhandled deadlocks, particularly during the
authentication phase. This has already been fixed in back branches in
commit 0bf8eb2a, which see for more details.
Main author: Zoltán Böszörményi
Some review and cleanup by Álvaro Herrera
Extensive reworking by Tom Lane
2012-07-17 00:43:21 +02:00
|
|
|
/* Forget any pending QueryCancel or timeout request */
|
|
|
|
disable_all_timeouts(false);
|
Fix assorted race conditions in the new timeout infrastructure.
Prevent handle_sig_alarm from losing control partway through due to a query
cancel (either an asynchronous SIGINT, or a cancel triggered by one of the
timeout handler functions). That would at least result in failure to
schedule any required future interrupt, and might result in actual
corruption of timeout.c's data structures, if the interrupt happened while
we were updating those.
We could still lose control if an asynchronous SIGINT arrives just as the
function is entered. This wouldn't break any data structures, but it would
have the same effect as if the SIGALRM interrupt had been silently lost:
we'd not fire any currently-due handlers, nor schedule any new interrupt.
To forestall that scenario, forcibly reschedule any pending timer interrupt
during AbortTransaction and AbortSubTransaction. We can avoid any extra
kernel call in most cases by not doing that until we've allowed
LockErrorCleanup to kill the DEADLOCK_TIMEOUT and LOCK_TIMEOUT events.
Another hazard is that some platforms (at least Linux and *BSD) block a
signal before calling its handler and then unblock it on return. When we
longjmp out of the handler, the unblock doesn't happen, and the signal is
left blocked indefinitely. Again, we can fix that by forcibly unblocking
signals during AbortTransaction and AbortSubTransaction.
These latter two problems do not manifest when the longjmp reaches
postgres.c, because the error recovery code there kills all pending timeout
events anyway, and it uses sigsetjmp(..., 1) so that the appropriate signal
mask is restored. So errors thrown outside any transaction should be OK
already, and cleaning up in AbortTransaction and AbortSubTransaction should
be enough to fix these issues. (We're assuming that any code that catches
a query cancel error and doesn't re-throw it will do at least a
subtransaction abort to clean up; but that was pretty much required already
by other subsystems.)
Lastly, ProcSleep should not clear the LOCK_TIMEOUT indicator flag when
disabling that event: if a lock timeout interrupt happened after the lock
was granted, the ensuing query cancel is still going to happen at the next
CHECK_FOR_INTERRUPTS, and we want to report it as a lock timeout not a user
cancel.
Per reports from Dan Wood.
Back-patch to 9.3 where the new timeout handling infrastructure was
introduced. We may at some point decide to back-patch the signal
unblocking changes further, but I'll desist from that until we hear
actual field complaints about it.
2013-11-29 22:41:00 +01:00
|
|
|
QueryCancelPending = false; /* second to avoid race condition */
|
2009-08-31 21:41:00 +02:00
|
|
|
|
2007-02-16 00:23:23 +01:00
|
|
|
/* Report the error to the server log */
|
|
|
|
EmitErrorReport();
|
|
|
|
|
2009-08-31 21:41:00 +02:00
|
|
|
/* Abort the current transaction in order to recover */
|
|
|
|
AbortCurrentTransaction();
|
2007-02-16 00:23:23 +01:00
|
|
|
|
2017-08-15 18:35:12 +02:00
|
|
|
/*
|
|
|
|
* Release any other resources, for the case where we were not in a
|
|
|
|
* transaction.
|
|
|
|
*/
|
|
|
|
LWLockReleaseAll();
|
|
|
|
pgstat_report_wait_end();
|
|
|
|
UnlockBuffers();
|
Use a ResourceOwner to track buffer pins in all cases.
Historically, we've allowed auxiliary processes to take buffer pins without
tracking them in a ResourceOwner. However, that creates problems for error
recovery. In particular, we've seen multiple reports of assertion crashes
in the startup process when it gets an error while holding a buffer pin,
as for example if it gets ENOSPC during a write. In a non-assert build,
the process would simply exit without releasing the pin at all. We've
gotten away with that so far just because a failure exit of the startup
process translates to a database crash anyhow; but any similar behavior
in other aux processes could result in stuck pins and subsequent problems
in vacuum.
To improve this, institute a policy that we must *always* have a resowner
backing any attempt to pin a buffer, which we can enforce just by removing
the previous special-case code in resowner.c. Add infrastructure to make
it easy to create a process-lifespan AuxProcessResourceOwner and clear
out its contents at appropriate times. Replace existing ad-hoc resowner
management in bgwriter.c and other aux processes with that. (Thus, while
the startup process gains a resowner where it had none at all before, some
other aux process types are replacing an ad-hoc resowner with this code.)
Also use the AuxProcessResourceOwner to manage buffer pins taken during
StartupXLOG and ShutdownXLOG, even when those are being run in a bootstrap
process or a standalone backend rather than a true auxiliary process.
In passing, remove some other ad-hoc resource owner creations that had
gotten cargo-culted into various other places. As far as I can tell
that was all unnecessary, and if it had been necessary it was incomplete,
due to lacking any provision for clearing those resowners later.
(Also worth noting in this connection is that a process that hasn't called
InitBufferPoolBackend has no business accessing buffers; so there's more
to do than just add the resowner if we want to touch buffers in processes
not covered by this patch.)
Although this fixes a very old bug, no back-patch, because there's no
evidence of any significant problem in non-assert builds.
Patch by me, pursuant to a report from Justin Pryzby. Thanks to
Robert Haas and Kyotaro Horiguchi for reviews.
Discussion: https://postgr.es/m/20180627233939.GA10276@telsasoft.com
2018-07-18 18:15:16 +02:00
|
|
|
/* this is probably dead code, but let's be safe: */
|
|
|
|
if (AuxProcessResourceOwner)
|
|
|
|
ReleaseAuxProcessResources(false);
|
2017-08-15 18:35:12 +02:00
|
|
|
AtEOXact_Buffers(false);
|
|
|
|
AtEOXact_SMgr();
|
2018-04-28 23:45:02 +02:00
|
|
|
AtEOXact_Files(false);
|
2017-08-15 18:35:12 +02:00
|
|
|
AtEOXact_HashTables(false);
|
|
|
|
|
2007-02-16 00:23:23 +01:00
|
|
|
/*
|
|
|
|
* Now return to normal top-level context and clear ErrorContext for
|
|
|
|
* next time.
|
|
|
|
*/
|
2007-04-16 20:30:04 +02:00
|
|
|
MemoryContextSwitchTo(AutovacMemCxt);
|
2007-02-16 00:23:23 +01:00
|
|
|
FlushErrorState();
|
|
|
|
|
|
|
|
/* Flush any leaked data in the top-level context */
|
2023-11-15 20:42:30 +01:00
|
|
|
MemoryContextReset(AutovacMemCxt);
|
2007-04-16 20:30:04 +02:00
|
|
|
|
|
|
|
/* don't leave dangling pointers to freed memory */
|
|
|
|
DatabaseListCxt = NULL;
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_init(&DatabaseList);
|
2007-02-16 00:23:23 +01:00
|
|
|
|
|
|
|
/* Now we can allow interrupts again */
|
|
|
|
RESUME_INTERRUPTS();
|
|
|
|
|
2015-04-08 18:19:49 +02:00
|
|
|
/* if in shutdown mode, no need for anything further; just go away */
|
2019-12-17 19:14:28 +01:00
|
|
|
if (ShutdownRequestPending)
|
2019-12-17 18:55:13 +01:00
|
|
|
AutoVacLauncherShutdown();
|
2015-04-08 18:19:49 +02:00
|
|
|
|
2007-02-16 00:23:23 +01:00
|
|
|
/*
|
|
|
|
* Sleep at least 1 second after any error. We don't want to be
|
|
|
|
* filling the error logs as fast as we can.
|
|
|
|
*/
|
|
|
|
pg_usleep(1000000L);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* We can now handle ereport(ERROR) */
|
|
|
|
PG_exception_stack = &local_sigjmp_buf;
|
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
/* must unblock signals before calling rebuild_database_list */
|
2023-02-02 22:34:56 +01:00
|
|
|
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
|
2007-02-16 00:23:23 +01:00
|
|
|
|
Empty search_path in Autovacuum and non-psql/pgbench clients.
This makes the client programs behave as documented regardless of the
connect-time search_path and regardless of user-created objects. Today,
a malicious user with CREATE permission on a search_path schema can take
control of certain of these clients' queries and invoke arbitrary SQL
functions under the client identity, often a superuser. This is
exploitable in the default configuration, where all users have CREATE
privilege on schema "public".
This changes behavior of user-defined code stored in the database, like
pg_index.indexprs and pg_extension_config_dump(). If they reach code
bearing unqualified names, "does not exist" or "no schema has been
selected to create in" errors might appear. Users may fix such errors
by schema-qualifying affected names. After upgrading, consider watching
server logs for these errors.
The --table arguments of src/bin/scripts clients have been lax; for
example, "vacuumdb -Zt pg_am\;CHECKPOINT" performed a checkpoint. That
now fails, but for now, "vacuumdb -Zt 'pg_am(amname);CHECKPOINT'" still
performs a checkpoint.
Back-patch to 9.3 (all supported versions).
Reviewed by Tom Lane, though this fix strategy was not his first choice.
Reported by Arseniy Sharoglazov.
Security: CVE-2018-1058
2018-02-26 16:39:44 +01:00
|
|
|
/*
|
|
|
|
* Set always-secure search path. Launcher doesn't connect to a database,
|
|
|
|
* so this has no effect.
|
|
|
|
*/
|
|
|
|
SetConfigOption("search_path", "", PGC_SUSET, PGC_S_OVERRIDE);
|
|
|
|
|
2011-11-30 04:39:16 +01:00
|
|
|
/*
|
|
|
|
* Force zero_damaged_pages OFF in the autovac process, even if it is set
|
|
|
|
* in postgresql.conf. We don't really want such a dangerous option being
|
|
|
|
* applied non-interactively.
|
|
|
|
*/
|
|
|
|
SetConfigOption("zero_damaged_pages", "false", PGC_SUSET, PGC_S_OVERRIDE);
|
|
|
|
|
|
|
|
/*
|
2016-06-15 16:52:53 +02:00
|
|
|
* Force settable timeouts off to avoid letting these settings prevent
|
|
|
|
* regular maintenance from being executed.
|
2011-11-30 04:39:16 +01:00
|
|
|
*/
|
|
|
|
SetConfigOption("statement_timeout", "0", PGC_SUSET, PGC_S_OVERRIDE);
|
2024-02-15 22:34:11 +01:00
|
|
|
SetConfigOption("transaction_timeout", "0", PGC_SUSET, PGC_S_OVERRIDE);
|
2013-03-17 04:22:17 +01:00
|
|
|
SetConfigOption("lock_timeout", "0", PGC_SUSET, PGC_S_OVERRIDE);
|
2016-06-15 16:52:53 +02:00
|
|
|
SetConfigOption("idle_in_transaction_session_timeout", "0",
|
|
|
|
PGC_SUSET, PGC_S_OVERRIDE);
|
2011-11-30 04:39:16 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Force default_transaction_isolation to READ COMMITTED. We don't want
|
|
|
|
* to pay the overhead of serializable mode, nor add any risk of causing
|
|
|
|
* deadlocks or delaying other transactions.
|
|
|
|
*/
|
|
|
|
SetConfigOption("default_transaction_isolation", "read committed",
|
|
|
|
PGC_SUSET, PGC_S_OVERRIDE);
|
|
|
|
|
2022-04-07 06:29:46 +02:00
|
|
|
/*
|
|
|
|
* Even when system is configured to use a different fetch consistency,
|
|
|
|
* for autovac we always want fresh stats.
|
|
|
|
*/
|
|
|
|
SetConfigOption("stats_fetch_consistency", "none", PGC_SUSET, PGC_S_OVERRIDE);
|
|
|
|
|
2015-04-08 18:19:49 +02:00
|
|
|
/*
|
|
|
|
* In emergency mode, just start a worker (unless shutdown was requested)
|
|
|
|
* and go away.
|
|
|
|
*/
|
2007-09-24 05:12:23 +02:00
|
|
|
if (!AutoVacuumingActive())
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
2019-12-17 19:14:28 +01:00
|
|
|
if (!ShutdownRequestPending)
|
2015-04-08 18:19:49 +02:00
|
|
|
do_start_worker();
|
2007-04-16 20:30:04 +02:00
|
|
|
proc_exit(0); /* done */
|
|
|
|
}
|
|
|
|
|
|
|
|
AutoVacuumShmem->av_launcherpid = MyProcPid;
|
|
|
|
|
2007-02-16 00:23:23 +01:00
|
|
|
/*
|
2007-04-16 20:30:04 +02:00
|
|
|
* Create the initial database list. The invariant we want this list to
|
|
|
|
* keep is that it's ordered by decreasing next_time. As soon as an entry
|
|
|
|
* is updated to a higher time, it will be moved to the front (which is
|
|
|
|
* correct because the only operation is to add autovacuum_naptime to the
|
|
|
|
* entry, and time always increases).
|
2007-02-16 00:23:23 +01:00
|
|
|
*/
|
2007-04-16 20:30:04 +02:00
|
|
|
rebuild_database_list(InvalidOid);
|
2007-02-16 00:23:23 +01:00
|
|
|
|
2015-04-08 18:19:49 +02:00
|
|
|
/* loop until shutdown request */
|
2019-12-17 19:14:28 +01:00
|
|
|
while (!ShutdownRequestPending)
|
2007-02-16 00:23:23 +01:00
|
|
|
{
|
2007-06-13 23:24:56 +02:00
|
|
|
struct timeval nap;
|
2007-04-16 20:30:04 +02:00
|
|
|
TimestampTz current_time = 0;
|
2007-06-25 18:09:03 +02:00
|
|
|
bool can_launch;
|
2007-02-16 00:23:23 +01:00
|
|
|
|
|
|
|
/*
|
2011-08-10 18:20:30 +02:00
|
|
|
* This loop is a bit different from the normal use of WaitLatch,
|
|
|
|
* because we'd like to sleep before the first launch of a child
|
|
|
|
* process. So it's WaitLatch, then ResetLatch, then check for
|
|
|
|
* wakening conditions.
|
2007-02-16 00:23:23 +01:00
|
|
|
*/
|
|
|
|
|
2012-10-16 22:36:30 +02:00
|
|
|
launcher_determine_sleep(!dlist_is_empty(&AutoVacuumShmem->av_freeWorkers),
|
2008-11-02 22:24:52 +01:00
|
|
|
false, &nap);
|
2007-06-13 23:24:56 +02:00
|
|
|
|
|
|
|
/*
|
2011-08-10 18:20:30 +02:00
|
|
|
* Wait until naptime expires or we get some type of signal (all the
|
|
|
|
* signal handlers will wake us by calling SetLatch).
|
2007-06-13 23:24:56 +02:00
|
|
|
*/
|
Add WL_EXIT_ON_PM_DEATH pseudo-event.
Users of the WaitEventSet and WaitLatch() APIs can now choose between
asking for WL_POSTMASTER_DEATH and then handling it explicitly, or asking
for WL_EXIT_ON_PM_DEATH to trigger immediate exit on postmaster death.
This reduces code duplication, since almost all callers want the latter.
Repair all code that was previously ignoring postmaster death completely,
or requesting the event but ignoring it, or requesting the event but then
doing an unconditional PostmasterIsAlive() call every time through its
event loop (which is an expensive syscall on platforms for which we don't
have USE_POSTMASTER_DEATH_SIGNAL support).
Assert that callers of WaitLatchXXX() under the postmaster remember to
ask for either WL_POSTMASTER_DEATH or WL_EXIT_ON_PM_DEATH, to prevent
future bugs.
The only process that doesn't handle postmaster death is syslogger. It
waits until all backends holding the write end of the syslog pipe
(including the postmaster) have closed it by exiting, to be sure to
capture any parting messages. By using the WaitEventSet API directly
it avoids the new assertion, and as a by-product it may be slightly
more efficient on platforms that have epoll().
Author: Thomas Munro
Reviewed-by: Kyotaro Horiguchi, Heikki Linnakangas, Tom Lane
Discussion: https://postgr.es/m/CAEepm%3D1TCviRykkUb69ppWLr_V697rzd1j3eZsRMmbXvETfqbQ%40mail.gmail.com,
https://postgr.es/m/CAEepm=2LqHzizbe7muD7-2yHUbTOoF7Q+qkSD5Q41kuhttRTwA@mail.gmail.com
2018-11-23 08:16:41 +01:00
|
|
|
(void) WaitLatch(MyLatch,
|
|
|
|
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
|
|
|
|
(nap.tv_sec * 1000L) + (nap.tv_usec / 1000L),
|
|
|
|
WAIT_EVENT_AUTOVACUUM_MAIN);
|
2007-06-13 23:24:56 +02:00
|
|
|
|
2015-01-14 18:45:22 +01:00
|
|
|
ResetLatch(MyLatch);
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2019-12-17 18:55:13 +01:00
|
|
|
HandleAutoVacLauncherInterrupts();
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2007-06-25 18:09:03 +02:00
|
|
|
/*
|
2020-06-07 15:06:51 +02:00
|
|
|
* a worker finished, or postmaster signaled failure to start a worker
|
2007-06-25 18:09:03 +02:00
|
|
|
*/
|
2009-08-31 21:41:00 +02:00
|
|
|
if (got_SIGUSR2)
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
2009-08-31 21:41:00 +02:00
|
|
|
got_SIGUSR2 = false;
|
2007-04-16 20:30:04 +02:00
|
|
|
|
|
|
|
/* rebalance cost limits, if needed */
|
2007-06-25 18:09:03 +02:00
|
|
|
if (AutoVacuumShmem->av_signal[AutoVacRebalance])
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
|
|
|
LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
|
2007-06-25 18:09:03 +02:00
|
|
|
AutoVacuumShmem->av_signal[AutoVacRebalance] = false;
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
autovac_recalculate_workers_for_balance();
|
2007-04-16 20:30:04 +02:00
|
|
|
LWLockRelease(AutovacuumLock);
|
|
|
|
}
|
2007-06-25 18:09:03 +02:00
|
|
|
|
|
|
|
if (AutoVacuumShmem->av_signal[AutoVacForkFailed])
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If the postmaster failed to start a new worker, we sleep
|
|
|
|
* for a little while and resend the signal. The new worker's
|
|
|
|
* state is still in memory, so this is sufficient. After
|
|
|
|
* that, we restart the main loop.
|
|
|
|
*
|
|
|
|
* XXX should we put a limit to the number of times we retry?
|
|
|
|
* I don't think it makes much sense, because a future start
|
|
|
|
* of a worker will continue to fail in the same way.
|
|
|
|
*/
|
|
|
|
AutoVacuumShmem->av_signal[AutoVacForkFailed] = false;
|
2009-08-24 19:23:02 +02:00
|
|
|
pg_usleep(1000000L); /* 1s */
|
2007-06-25 18:09:03 +02:00
|
|
|
SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_WORKER);
|
|
|
|
continue;
|
|
|
|
}
|
2007-02-16 00:23:23 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2007-04-16 20:30:04 +02:00
|
|
|
* There are some conditions that we need to check before trying to
|
2015-06-20 16:45:59 +02:00
|
|
|
* start a worker. First, we need to make sure that there is a worker
|
|
|
|
* slot available. Second, we need to make sure that no other worker
|
2007-06-25 18:09:03 +02:00
|
|
|
* failed while starting up.
|
2007-02-16 00:23:23 +01:00
|
|
|
*/
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2007-06-25 18:09:03 +02:00
|
|
|
current_time = GetCurrentTimestamp();
|
2007-02-16 00:23:23 +01:00
|
|
|
LWLockAcquire(AutovacuumLock, LW_SHARED);
|
|
|
|
|
2012-10-16 22:36:30 +02:00
|
|
|
can_launch = !dlist_is_empty(&AutoVacuumShmem->av_freeWorkers);
|
2007-02-16 00:23:23 +01:00
|
|
|
|
2008-11-02 22:24:52 +01:00
|
|
|
if (AutoVacuumShmem->av_startingWorker != NULL)
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
2007-06-25 18:09:03 +02:00
|
|
|
int waittime;
|
2008-11-02 22:24:52 +01:00
|
|
|
WorkerInfo worker = AutoVacuumShmem->av_startingWorker;
|
2007-04-16 20:30:04 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We can't launch another worker when another one is still
|
2007-06-25 18:09:03 +02:00
|
|
|
* starting up (or failed while doing so), so just sleep for a bit
|
|
|
|
* more; that worker will wake us up again as soon as it's ready.
|
|
|
|
* We will only wait autovacuum_naptime seconds (up to a maximum
|
|
|
|
* of 60 seconds) for this to happen however. Note that failure
|
|
|
|
* to connect to a particular database is not a problem here,
|
|
|
|
* because the worker removes itself from the startingWorker
|
|
|
|
* pointer before trying to connect. Problems detected by the
|
|
|
|
* postmaster (like fork() failure) are also reported and handled
|
|
|
|
* differently. The only problems that may cause this code to
|
|
|
|
* fire are errors in the earlier sections of AutoVacWorkerMain,
|
|
|
|
* before the worker removes the WorkerInfo from the
|
|
|
|
* startingWorker pointer.
|
2007-04-16 20:30:04 +02:00
|
|
|
*/
|
2007-06-25 18:09:03 +02:00
|
|
|
waittime = Min(autovacuum_naptime, 60) * 1000;
|
2007-05-02 20:27:57 +02:00
|
|
|
if (TimestampDifferenceExceeds(worker->wi_launchtime, current_time,
|
2007-06-25 18:09:03 +02:00
|
|
|
waittime))
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
|
|
|
LWLockRelease(AutovacuumLock);
|
|
|
|
LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
|
2007-11-15 22:14:46 +01:00
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
/*
|
|
|
|
* No other process can put a worker in starting mode, so if
|
|
|
|
* startingWorker is still INVALID after exchanging our lock,
|
|
|
|
* we assume it's the same one we saw above (so we don't
|
|
|
|
* recheck the launch time).
|
|
|
|
*/
|
2008-11-02 22:24:52 +01:00
|
|
|
if (AutoVacuumShmem->av_startingWorker != NULL)
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
2008-11-02 22:24:52 +01:00
|
|
|
worker = AutoVacuumShmem->av_startingWorker;
|
2007-04-16 20:30:04 +02:00
|
|
|
worker->wi_dboid = InvalidOid;
|
|
|
|
worker->wi_tableoid = InvalidOid;
|
2016-05-10 21:23:54 +02:00
|
|
|
worker->wi_sharedrel = false;
|
2007-10-24 21:08:25 +02:00
|
|
|
worker->wi_proc = NULL;
|
2007-04-16 20:30:04 +02:00
|
|
|
worker->wi_launchtime = 0;
|
2012-10-19 01:04:20 +02:00
|
|
|
dlist_push_head(&AutoVacuumShmem->av_freeWorkers,
|
|
|
|
&worker->wi_links);
|
2008-11-02 22:24:52 +01:00
|
|
|
AutoVacuumShmem->av_startingWorker = NULL;
|
2021-11-22 16:55:36 +01:00
|
|
|
ereport(WARNING,
|
|
|
|
errmsg("autovacuum worker took too long to start; canceled"));
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
|
|
|
}
|
2007-02-16 00:23:23 +01:00
|
|
|
else
|
2007-04-16 20:30:04 +02:00
|
|
|
can_launch = false;
|
2007-02-16 00:23:23 +01:00
|
|
|
}
|
2007-04-16 20:30:04 +02:00
|
|
|
LWLockRelease(AutovacuumLock); /* either shared or exclusive */
|
2007-02-16 00:23:23 +01:00
|
|
|
|
2007-06-25 18:09:03 +02:00
|
|
|
/* if we can't do anything, just go back to sleep */
|
|
|
|
if (!can_launch)
|
|
|
|
continue;
|
2007-02-16 00:23:23 +01:00
|
|
|
|
2007-06-25 18:09:03 +02:00
|
|
|
/* We're OK to start a new worker */
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2012-10-16 22:36:30 +02:00
|
|
|
if (dlist_is_empty(&DatabaseList))
|
2007-06-25 18:09:03 +02:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Special case when the list is empty: start a worker right away.
|
|
|
|
* This covers the initial case, when no database is in pgstats
|
|
|
|
* (thus the list is empty). Note that the constraints in
|
|
|
|
* launcher_determine_sleep keep us from starting workers too
|
|
|
|
* quickly (at most once every autovacuum_naptime when the list is
|
|
|
|
* empty).
|
|
|
|
*/
|
|
|
|
launch_worker(current_time);
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
2012-10-16 22:36:30 +02:00
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* because rebuild_database_list constructs a list with most
|
|
|
|
* distant adl_next_worker first, we obtain our database from the
|
|
|
|
* tail of the list.
|
|
|
|
*/
|
2012-10-19 01:04:20 +02:00
|
|
|
avl_dbase *avdb;
|
|
|
|
|
2012-10-16 22:36:30 +02:00
|
|
|
avdb = dlist_tail_element(avl_dbase, adl_node, &DatabaseList);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* launch a worker if next_worker is right now or it is in the
|
|
|
|
* past
|
|
|
|
*/
|
|
|
|
if (TimestampDifferenceExceeds(avdb->adl_next_worker,
|
|
|
|
current_time, 0))
|
|
|
|
launch_worker(current_time);
|
|
|
|
}
|
2007-02-16 00:23:23 +01:00
|
|
|
}
|
|
|
|
|
2019-12-17 18:55:13 +01:00
|
|
|
AutoVacLauncherShutdown();
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Process any new interrupts.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
HandleAutoVacLauncherInterrupts(void)
|
|
|
|
{
|
|
|
|
/* the normal shutdown case */
|
2019-12-17 19:14:28 +01:00
|
|
|
if (ShutdownRequestPending)
|
2019-12-17 18:55:13 +01:00
|
|
|
AutoVacLauncherShutdown();
|
|
|
|
|
2019-12-17 19:03:57 +01:00
|
|
|
if (ConfigReloadPending)
|
2019-12-17 18:55:13 +01:00
|
|
|
{
|
2019-12-17 19:03:57 +01:00
|
|
|
ConfigReloadPending = false;
|
2019-12-17 18:55:13 +01:00
|
|
|
ProcessConfigFile(PGC_SIGHUP);
|
|
|
|
|
|
|
|
/* shutdown requested in config file? */
|
|
|
|
if (!AutoVacuumingActive())
|
|
|
|
AutoVacLauncherShutdown();
|
|
|
|
|
|
|
|
/* rebuild the list in case the naptime changed */
|
|
|
|
rebuild_database_list(InvalidOid);
|
|
|
|
}
|
|
|
|
|
2019-12-19 20:56:20 +01:00
|
|
|
/* Process barrier events */
|
|
|
|
if (ProcSignalBarrierPending)
|
|
|
|
ProcessProcSignalBarrier();
|
|
|
|
|
2021-10-12 02:50:17 +02:00
|
|
|
/* Perform logging of memory contexts of this process */
|
|
|
|
if (LogMemoryContextPending)
|
|
|
|
ProcessLogMemoryContextInterrupt();
|
|
|
|
|
2019-12-17 18:55:13 +01:00
|
|
|
/* Process sinval catchup interrupts that happened while sleeping */
|
|
|
|
ProcessCatchupInterrupt();
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Perform a normal exit from the autovac launcher.
|
|
|
|
*/
|
|
|
|
static void
|
2020-05-21 17:31:16 +02:00
|
|
|
AutoVacLauncherShutdown(void)
|
2019-12-17 18:55:13 +01:00
|
|
|
{
|
2017-03-10 21:18:38 +01:00
|
|
|
ereport(DEBUG1,
|
2021-02-17 11:24:46 +01:00
|
|
|
(errmsg_internal("autovacuum launcher shutting down")));
|
2007-04-16 20:30:04 +02:00
|
|
|
AutoVacuumShmem->av_launcherpid = 0;
|
2007-02-16 00:23:23 +01:00
|
|
|
|
|
|
|
proc_exit(0); /* done */
|
|
|
|
}
|
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
/*
|
2007-06-13 23:24:56 +02:00
|
|
|
* Determine the time to sleep, based on the database list.
|
2007-04-16 20:30:04 +02:00
|
|
|
*
|
|
|
|
* The "canlaunch" parameter indicates whether we can start a worker right now,
|
2007-06-13 23:24:56 +02:00
|
|
|
* for example due to the workers being all busy. If this is false, we will
|
|
|
|
* cause a long sleep, which will be interrupted when a worker exits.
|
2007-04-16 20:30:04 +02:00
|
|
|
*/
|
2007-06-13 23:24:56 +02:00
|
|
|
static void
|
|
|
|
launcher_determine_sleep(bool canlaunch, bool recursing, struct timeval *nap)
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We sleep until the next scheduled vacuum. We trust that when the
|
|
|
|
* database list was built, care was taken so that no entries have times
|
|
|
|
* in the past; if the first entry has too close a next_worker value, or a
|
|
|
|
* time in the past, we will sleep a small nominal time.
|
|
|
|
*/
|
|
|
|
if (!canlaunch)
|
|
|
|
{
|
2007-06-13 23:24:56 +02:00
|
|
|
nap->tv_sec = autovacuum_naptime;
|
|
|
|
nap->tv_usec = 0;
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
2012-10-16 22:36:30 +02:00
|
|
|
else if (!dlist_is_empty(&DatabaseList))
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
|
|
|
TimestampTz current_time = GetCurrentTimestamp();
|
|
|
|
TimestampTz next_wakeup;
|
2012-10-19 01:04:20 +02:00
|
|
|
avl_dbase *avdb;
|
2007-06-13 23:24:56 +02:00
|
|
|
long secs;
|
|
|
|
int usecs;
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2012-10-16 22:36:30 +02:00
|
|
|
avdb = dlist_tail_element(avl_dbase, adl_node, &DatabaseList);
|
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
next_wakeup = avdb->adl_next_worker;
|
|
|
|
TimestampDifference(current_time, next_wakeup, &secs, &usecs);
|
2007-06-13 23:24:56 +02:00
|
|
|
|
|
|
|
nap->tv_sec = secs;
|
|
|
|
nap->tv_usec = usecs;
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* list is empty, sleep for whole autovacuum_naptime seconds */
|
2007-06-13 23:24:56 +02:00
|
|
|
nap->tv_sec = autovacuum_naptime;
|
|
|
|
nap->tv_usec = 0;
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the result is exactly zero, it means a database had an entry with
|
|
|
|
* time in the past. Rebuild the list so that the databases are evenly
|
|
|
|
* distributed again, and recalculate the time to sleep. This can happen
|
|
|
|
* if there are more tables needing vacuum than workers, and they all take
|
|
|
|
* longer to vacuum than autovacuum_naptime.
|
|
|
|
*
|
|
|
|
* We only recurse once. rebuild_database_list should always return times
|
|
|
|
* in the future, but it seems best not to trust too much on that.
|
|
|
|
*/
|
2007-07-01 20:30:54 +02:00
|
|
|
if (nap->tv_sec == 0 && nap->tv_usec == 0 && !recursing)
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
|
|
|
rebuild_database_list(InvalidOid);
|
2007-06-13 23:24:56 +02:00
|
|
|
launcher_determine_sleep(canlaunch, true, nap);
|
|
|
|
return;
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
|
|
|
|
2009-06-09 18:41:02 +02:00
|
|
|
/* The smallest time we'll allow the launcher to sleep. */
|
|
|
|
if (nap->tv_sec <= 0 && nap->tv_usec <= MIN_AUTOVAC_SLEEPTIME * 1000)
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
2007-07-01 20:30:54 +02:00
|
|
|
nap->tv_sec = 0;
|
2009-06-09 18:41:02 +02:00
|
|
|
nap->tv_usec = MIN_AUTOVAC_SLEEPTIME * 1000;
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
2015-06-19 17:44:36 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the sleep time is too large, clamp it to an arbitrary maximum (plus
|
|
|
|
* any fractional seconds, for simplicity). This avoids an essentially
|
|
|
|
* infinite sleep in strange cases like the system clock going backwards a
|
|
|
|
* few years.
|
|
|
|
*/
|
|
|
|
if (nap->tv_sec > MAX_AUTOVAC_SLEEPTIME)
|
|
|
|
nap->tv_sec = MAX_AUTOVAC_SLEEPTIME;
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Build an updated DatabaseList. It must only contain databases that appear
|
|
|
|
* in pgstats, and must be sorted by next_worker from highest to lowest,
|
|
|
|
* distributed regularly across the next autovacuum_naptime interval.
|
|
|
|
*
|
|
|
|
* Receives the Oid of the database that made this list be generated (we call
|
|
|
|
* this the "new" database, because when the database was already present on
|
|
|
|
* the list, we expect that this function is not called at all). The
|
|
|
|
* preexisting list, if any, will be used to preserve the order of the
|
|
|
|
* databases in the autovacuum_naptime period. The new database is put at the
|
|
|
|
* end of the interval. The actual values are not saved, which should not be
|
|
|
|
* much of a problem.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
rebuild_database_list(Oid newdb)
|
|
|
|
{
|
|
|
|
List *dblist;
|
|
|
|
ListCell *cell;
|
|
|
|
MemoryContext newcxt;
|
|
|
|
MemoryContext oldcxt;
|
|
|
|
MemoryContext tmpcxt;
|
|
|
|
HASHCTL hctl;
|
|
|
|
int score;
|
|
|
|
int nelems;
|
|
|
|
HTAB *dbhash;
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_iter iter;
|
2007-04-16 20:30:04 +02:00
|
|
|
|
|
|
|
newcxt = AllocSetContextCreate(AutovacMemCxt,
|
2021-11-22 16:55:36 +01:00
|
|
|
"Autovacuum database list",
|
Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
|
|
|
ALLOCSET_DEFAULT_SIZES);
|
2007-04-16 20:30:04 +02:00
|
|
|
tmpcxt = AllocSetContextCreate(newcxt,
|
2021-11-22 16:55:36 +01:00
|
|
|
"Autovacuum database list (tmp)",
|
Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
|
|
|
ALLOCSET_DEFAULT_SIZES);
|
2007-04-16 20:30:04 +02:00
|
|
|
oldcxt = MemoryContextSwitchTo(tmpcxt);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Implementing this is not as simple as it sounds, because we need to put
|
|
|
|
* the new database at the end of the list; next the databases that were
|
|
|
|
* already on the list, and finally (at the tail of the list) all the
|
|
|
|
* other databases that are not on the existing list.
|
|
|
|
*
|
|
|
|
* To do this, we build an empty hash table of scored databases. We will
|
|
|
|
* start with the lowest score (zero) for the new database, then
|
|
|
|
* increasing scores for the databases in the existing list, in order, and
|
|
|
|
* lastly increasing scores for all databases gotten via
|
|
|
|
* get_database_list() that are not already on the hash.
|
|
|
|
*
|
|
|
|
* Then we will put all the hash elements into an array, sort the array by
|
|
|
|
* score, and finally put the array elements into the new doubly linked
|
|
|
|
* list.
|
|
|
|
*/
|
|
|
|
hctl.keysize = sizeof(Oid);
|
|
|
|
hctl.entrysize = sizeof(avl_dbase);
|
|
|
|
hctl.hcxt = tmpcxt;
|
2021-11-22 16:55:36 +01:00
|
|
|
dbhash = hash_create("autovacuum db hash", 20, &hctl, /* magic number here
|
|
|
|
* FIXME */
|
Improve hash_create's API for selecting simple-binary-key hash functions.
Previously, if you wanted anything besides C-string hash keys, you had to
specify a custom hashing function to hash_create(). Nearly all such
callers were specifying tag_hash or oid_hash; which is tedious, and rather
error-prone, since a caller could easily miss the opportunity to optimize
by using hash_uint32 when appropriate. Replace this with a design whereby
callers using simple binary-data keys just specify HASH_BLOBS and don't
need to mess with specific support functions. hash_create() itself will
take care of optimizing when the key size is four bytes.
This nets out saving a few hundred bytes of code space, and offers
a measurable performance improvement in tidbitmap.c (which was not
exploiting the opportunity to use hash_uint32 for its 4-byte keys).
There might be some wins elsewhere too, I didn't analyze closely.
In future we could look into offering a similar optimized hashing function
for 8-byte keys. Under this design that could be done in a centralized
and machine-independent fashion, whereas getting it right for keys of
platform-dependent sizes would've been notationally painful before.
For the moment, the old way still works fine, so as not to break source
code compatibility for loadable modules. Eventually we might want to
remove tag_hash and friends from the exported API altogether, since there's
no real need for them to be explicitly referenced from outside dynahash.c.
Teodor Sigaev and Tom Lane
2014-12-18 19:36:29 +01:00
|
|
|
HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
|
2007-04-16 20:30:04 +02:00
|
|
|
|
|
|
|
/* start by inserting the new database */
|
|
|
|
score = 0;
|
|
|
|
if (OidIsValid(newdb))
|
|
|
|
{
|
|
|
|
avl_dbase *db;
|
|
|
|
PgStat_StatDBEntry *entry;
|
|
|
|
|
|
|
|
/* only consider this database if it has a pgstat entry */
|
|
|
|
entry = pgstat_fetch_stat_dbentry(newdb);
|
|
|
|
if (entry != NULL)
|
|
|
|
{
|
|
|
|
/* we assume it isn't found because the hash was just created */
|
|
|
|
db = hash_search(dbhash, &newdb, HASH_ENTER, NULL);
|
|
|
|
|
|
|
|
/* hash_search already filled in the key */
|
|
|
|
db->adl_score = score++;
|
|
|
|
/* next_worker is filled in later */
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Now insert the databases from the existing list */
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_foreach(iter, &DatabaseList)
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
2012-10-16 22:36:30 +02:00
|
|
|
avl_dbase *avdb = dlist_container(avl_dbase, adl_node, iter.cur);
|
|
|
|
avl_dbase *db;
|
|
|
|
bool found;
|
|
|
|
PgStat_StatDBEntry *entry;
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2012-10-16 22:36:30 +02:00
|
|
|
/*
|
|
|
|
* skip databases with no stat entries -- in particular, this gets rid
|
|
|
|
* of dropped databases
|
|
|
|
*/
|
|
|
|
entry = pgstat_fetch_stat_dbentry(avdb->adl_datid);
|
|
|
|
if (entry == NULL)
|
|
|
|
continue;
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2012-10-16 22:36:30 +02:00
|
|
|
db = hash_search(dbhash, &(avdb->adl_datid), HASH_ENTER, &found);
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2012-10-16 22:36:30 +02:00
|
|
|
if (!found)
|
|
|
|
{
|
|
|
|
/* hash_search already filled in the key */
|
|
|
|
db->adl_score = score++;
|
|
|
|
/* next_worker is filled in later */
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* finally, insert all qualifying databases not previously inserted */
|
|
|
|
dblist = get_database_list();
|
|
|
|
foreach(cell, dblist)
|
|
|
|
{
|
|
|
|
avw_dbase *avdb = lfirst(cell);
|
|
|
|
avl_dbase *db;
|
|
|
|
bool found;
|
|
|
|
PgStat_StatDBEntry *entry;
|
|
|
|
|
|
|
|
/* only consider databases with a pgstat entry */
|
|
|
|
entry = pgstat_fetch_stat_dbentry(avdb->adw_datid);
|
|
|
|
if (entry == NULL)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
db = hash_search(dbhash, &(avdb->adw_datid), HASH_ENTER, &found);
|
|
|
|
/* only update the score if the database was not already on the hash */
|
|
|
|
if (!found)
|
|
|
|
{
|
|
|
|
/* hash_search already filled in the key */
|
|
|
|
db->adl_score = score++;
|
|
|
|
/* next_worker is filled in later */
|
|
|
|
}
|
|
|
|
}
|
|
|
|
nelems = score;
|
|
|
|
|
|
|
|
/* from here on, the allocated memory belongs to the new list */
|
|
|
|
MemoryContextSwitchTo(newcxt);
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_init(&DatabaseList);
|
2007-04-16 20:30:04 +02:00
|
|
|
|
|
|
|
if (nelems > 0)
|
|
|
|
{
|
|
|
|
TimestampTz current_time;
|
|
|
|
int millis_increment;
|
|
|
|
avl_dbase *dbary;
|
|
|
|
avl_dbase *db;
|
|
|
|
HASH_SEQ_STATUS seq;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/* put all the hash elements into an array */
|
|
|
|
dbary = palloc(nelems * sizeof(avl_dbase));
|
|
|
|
|
|
|
|
i = 0;
|
|
|
|
hash_seq_init(&seq, dbhash);
|
|
|
|
while ((db = hash_seq_search(&seq)) != NULL)
|
|
|
|
memcpy(&(dbary[i++]), db, sizeof(avl_dbase));
|
|
|
|
|
|
|
|
/* sort the array */
|
|
|
|
qsort(dbary, nelems, sizeof(avl_dbase), db_comparator);
|
|
|
|
|
2009-06-09 18:41:02 +02:00
|
|
|
/*
|
|
|
|
* Determine the time interval between databases in the schedule. If
|
|
|
|
* we see that the configured naptime would take us to sleep times
|
|
|
|
* lower than our min sleep time (which launcher_determine_sleep is
|
|
|
|
* coded not to allow), silently use a larger naptime (but don't touch
|
|
|
|
* the GUC variable).
|
|
|
|
*/
|
2007-04-16 20:30:04 +02:00
|
|
|
millis_increment = 1000.0 * autovacuum_naptime / nelems;
|
2009-06-09 18:41:02 +02:00
|
|
|
if (millis_increment <= MIN_AUTOVAC_SLEEPTIME)
|
|
|
|
millis_increment = MIN_AUTOVAC_SLEEPTIME * 1.1;
|
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
current_time = GetCurrentTimestamp();
|
|
|
|
|
|
|
|
/*
|
2019-07-08 06:15:09 +02:00
|
|
|
* move the elements from the array into the dlist, setting the
|
2007-04-16 20:30:04 +02:00
|
|
|
* next_worker while walking the array
|
|
|
|
*/
|
|
|
|
for (i = 0; i < nelems; i++)
|
|
|
|
{
|
2022-08-25 16:35:40 +02:00
|
|
|
db = &(dbary[i]);
|
2007-04-16 20:30:04 +02:00
|
|
|
|
|
|
|
current_time = TimestampTzPlusMilliseconds(current_time,
|
|
|
|
millis_increment);
|
|
|
|
db->adl_next_worker = current_time;
|
|
|
|
|
|
|
|
/* later elements should go closer to the head of the list */
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_push_head(&DatabaseList, &db->adl_node);
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* all done, clean up memory */
|
|
|
|
if (DatabaseListCxt != NULL)
|
|
|
|
MemoryContextDelete(DatabaseListCxt);
|
|
|
|
MemoryContextDelete(tmpcxt);
|
|
|
|
DatabaseListCxt = newcxt;
|
|
|
|
MemoryContextSwitchTo(oldcxt);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* qsort comparator for avl_dbase, using adl_score */
|
|
|
|
static int
|
|
|
|
db_comparator(const void *a, const void *b)
|
|
|
|
{
|
2024-02-16 21:05:36 +01:00
|
|
|
return pg_cmp_s32(((const avl_dbase *) a)->adl_score,
|
|
|
|
((const avl_dbase *) b)->adl_score);
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
|
|
|
|
2007-03-23 22:45:17 +01:00
|
|
|
/*
|
|
|
|
* do_start_worker
|
|
|
|
*
|
|
|
|
* Bare-bones procedure for starting an autovacuum worker from the launcher.
|
|
|
|
* It determines what database to work on, sets up shared memory stuff and
|
2007-04-16 20:30:04 +02:00
|
|
|
* signals postmaster to start the worker. It fails gracefully if invoked when
|
|
|
|
* autovacuum_workers are already active.
|
|
|
|
*
|
|
|
|
* Return value is the OID of the database that the worker is going to process,
|
|
|
|
* or InvalidOid if no worker was actually started.
|
2007-03-23 22:45:17 +01:00
|
|
|
*/
|
2007-04-16 20:30:04 +02:00
|
|
|
static Oid
|
2007-03-23 22:45:17 +01:00
|
|
|
do_start_worker(void)
|
|
|
|
{
|
|
|
|
List *dblist;
|
2007-04-16 20:30:04 +02:00
|
|
|
ListCell *cell;
|
2007-03-23 22:45:17 +01:00
|
|
|
TransactionId xidForceLimit;
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
MultiXactId multiForceLimit;
|
2007-04-16 20:30:04 +02:00
|
|
|
bool for_xid_wrap;
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
bool for_multi_wrap;
|
2007-04-16 20:30:04 +02:00
|
|
|
avw_dbase *avdb;
|
|
|
|
TimestampTz current_time;
|
|
|
|
bool skipit = false;
|
2007-09-13 00:14:59 +02:00
|
|
|
Oid retval = InvalidOid;
|
|
|
|
MemoryContext tmpcxt,
|
|
|
|
oldcxt;
|
2007-04-16 20:30:04 +02:00
|
|
|
|
|
|
|
/* return quickly when there are no free workers */
|
|
|
|
LWLockAcquire(AutovacuumLock, LW_SHARED);
|
2012-10-16 22:36:30 +02:00
|
|
|
if (dlist_is_empty(&AutoVacuumShmem->av_freeWorkers))
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
|
|
|
LWLockRelease(AutovacuumLock);
|
|
|
|
return InvalidOid;
|
|
|
|
}
|
|
|
|
LWLockRelease(AutovacuumLock);
|
|
|
|
|
2007-09-13 00:14:59 +02:00
|
|
|
/*
|
|
|
|
* Create and switch to a temporary context to avoid leaking the memory
|
|
|
|
* allocated for the database list.
|
|
|
|
*/
|
|
|
|
tmpcxt = AllocSetContextCreate(CurrentMemoryContext,
|
2021-11-22 16:55:36 +01:00
|
|
|
"Autovacuum start worker (tmp)",
|
Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
|
|
|
ALLOCSET_DEFAULT_SIZES);
|
2007-09-13 00:14:59 +02:00
|
|
|
oldcxt = MemoryContextSwitchTo(tmpcxt);
|
|
|
|
|
2007-03-23 22:45:17 +01:00
|
|
|
/* Get a list of databases */
|
2007-04-16 20:30:04 +02:00
|
|
|
dblist = get_database_list();
|
2007-03-23 22:45:17 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Determine the oldest datfrozenxid/relfrozenxid that we will allow to
|
|
|
|
* pass without forcing a vacuum. (This limit can be tightened for
|
|
|
|
* particular tables, but not loosened.)
|
|
|
|
*/
|
2021-02-15 01:03:10 +01:00
|
|
|
recentXid = ReadNextTransactionId();
|
2007-03-23 22:45:17 +01:00
|
|
|
xidForceLimit = recentXid - autovacuum_freeze_max_age;
|
|
|
|
/* ensure it's a "normal" XID, else TransactionIdPrecedes misbehaves */
|
2011-05-09 05:59:31 +02:00
|
|
|
/* this can cause the limit to go backwards by 3, but that's OK */
|
2007-03-23 22:45:17 +01:00
|
|
|
if (xidForceLimit < FirstNormalTransactionId)
|
|
|
|
xidForceLimit -= FirstNormalTransactionId;
|
|
|
|
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
/* Also determine the oldest datminmxid we will consider. */
|
|
|
|
recentMulti = ReadNextMultiXactId();
|
2015-05-08 18:09:14 +02:00
|
|
|
multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
if (multiForceLimit < FirstMultiXactId)
|
|
|
|
multiForceLimit -= FirstMultiXactId;
|
|
|
|
|
2007-03-23 22:45:17 +01:00
|
|
|
/*
|
|
|
|
* Choose a database to connect to. We pick the database that was least
|
|
|
|
* recently auto-vacuumed, or one that needs vacuuming to prevent Xid
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
* wraparound-related data loss. If any db at risk of Xid wraparound is
|
2007-03-23 22:45:17 +01:00
|
|
|
* found, we pick the one with oldest datfrozenxid, independently of
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
* autovacuum times; similarly we pick the one with the oldest datminmxid
|
|
|
|
* if any is in MultiXactId wraparound. Note that those in Xid wraparound
|
|
|
|
* danger are given more priority than those in multi wraparound danger.
|
2007-03-23 22:45:17 +01:00
|
|
|
*
|
|
|
|
* Note that a database with no stats entry is not considered, except for
|
|
|
|
* Xid wraparound purposes. The theory is that if no one has ever
|
|
|
|
* connected to it since the stats were last initialized, it doesn't need
|
|
|
|
* vacuuming.
|
|
|
|
*
|
|
|
|
* XXX This could be improved if we had more info about whether it needs
|
|
|
|
* vacuuming before connecting to it. Perhaps look through the pgstats
|
|
|
|
* data for the database's tables? One idea is to keep track of the
|
|
|
|
* number of new and dead tuples per database in pgstats. However it
|
|
|
|
* isn't clear how to construct a metric that measures that and not cause
|
|
|
|
* starvation for less busy databases.
|
|
|
|
*/
|
2007-04-16 20:30:04 +02:00
|
|
|
avdb = NULL;
|
2007-03-23 22:45:17 +01:00
|
|
|
for_xid_wrap = false;
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
for_multi_wrap = false;
|
2007-04-16 20:30:04 +02:00
|
|
|
current_time = GetCurrentTimestamp();
|
2007-03-23 22:45:17 +01:00
|
|
|
foreach(cell, dblist)
|
|
|
|
{
|
2007-04-16 20:30:04 +02:00
|
|
|
avw_dbase *tmp = lfirst(cell);
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_iter iter;
|
2007-03-23 22:45:17 +01:00
|
|
|
|
|
|
|
/* Check to see if this one is at risk of wraparound */
|
2007-04-16 20:30:04 +02:00
|
|
|
if (TransactionIdPrecedes(tmp->adw_frozenxid, xidForceLimit))
|
2007-03-23 22:45:17 +01:00
|
|
|
{
|
2007-04-16 20:30:04 +02:00
|
|
|
if (avdb == NULL ||
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
TransactionIdPrecedes(tmp->adw_frozenxid,
|
|
|
|
avdb->adw_frozenxid))
|
2007-04-16 20:30:04 +02:00
|
|
|
avdb = tmp;
|
2007-03-23 22:45:17 +01:00
|
|
|
for_xid_wrap = true;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
else if (for_xid_wrap)
|
|
|
|
continue; /* ignore not-at-risk DBs */
|
2013-09-16 20:45:00 +02:00
|
|
|
else if (MultiXactIdPrecedes(tmp->adw_minmulti, multiForceLimit))
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
{
|
|
|
|
if (avdb == NULL ||
|
2013-09-16 20:45:00 +02:00
|
|
|
MultiXactIdPrecedes(tmp->adw_minmulti, avdb->adw_minmulti))
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
avdb = tmp;
|
|
|
|
for_multi_wrap = true;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
else if (for_multi_wrap)
|
|
|
|
continue; /* ignore not-at-risk DBs */
|
2007-03-23 22:45:17 +01:00
|
|
|
|
2007-09-24 06:12:01 +02:00
|
|
|
/* Find pgstat entry if any */
|
|
|
|
tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid);
|
|
|
|
|
2007-03-23 22:45:17 +01:00
|
|
|
/*
|
2007-09-24 06:12:01 +02:00
|
|
|
* Skip a database with no pgstat entry; it means it hasn't seen any
|
|
|
|
* activity.
|
2007-03-23 22:45:17 +01:00
|
|
|
*/
|
2007-04-16 20:30:04 +02:00
|
|
|
if (!tmp->adw_entry)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Also, skip a database that appears on the database list as having
|
|
|
|
* been processed recently (less than autovacuum_naptime seconds ago).
|
|
|
|
* We do this so that we don't select a database which we just
|
|
|
|
* selected, but that pgstat hasn't gotten around to updating the last
|
|
|
|
* autovacuum time yet.
|
|
|
|
*/
|
|
|
|
skipit = false;
|
|
|
|
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_reverse_foreach(iter, &DatabaseList)
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
2012-10-16 22:36:30 +02:00
|
|
|
avl_dbase *dbp = dlist_container(avl_dbase, adl_node, iter.cur);
|
2007-04-16 20:30:04 +02:00
|
|
|
|
|
|
|
if (dbp->adl_datid == tmp->adw_datid)
|
|
|
|
{
|
|
|
|
/*
|
2007-05-02 20:27:57 +02:00
|
|
|
* Skip this database if its next_worker value falls between
|
2007-04-16 20:30:04 +02:00
|
|
|
* the current time and the current time plus naptime.
|
|
|
|
*/
|
2007-05-07 22:41:24 +02:00
|
|
|
if (!TimestampDifferenceExceeds(dbp->adl_next_worker,
|
|
|
|
current_time, 0) &&
|
2007-05-02 20:27:57 +02:00
|
|
|
!TimestampDifferenceExceeds(current_time,
|
|
|
|
dbp->adl_next_worker,
|
|
|
|
autovacuum_naptime * 1000))
|
2007-04-16 20:30:04 +02:00
|
|
|
skipit = true;
|
|
|
|
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (skipit)
|
2007-03-23 22:45:17 +01:00
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remember the db with oldest autovac time. (If we are here, both
|
|
|
|
* tmp->entry and db->entry must be non-null.)
|
|
|
|
*/
|
2007-04-16 20:30:04 +02:00
|
|
|
if (avdb == NULL ||
|
|
|
|
tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
|
|
|
|
avdb = tmp;
|
2007-03-23 22:45:17 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Found a database -- process it */
|
2007-04-16 20:30:04 +02:00
|
|
|
if (avdb != NULL)
|
2007-03-23 22:45:17 +01:00
|
|
|
{
|
2007-04-16 20:30:04 +02:00
|
|
|
WorkerInfo worker;
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_node *wptr;
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2007-03-23 22:45:17 +01:00
|
|
|
LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
|
2007-04-16 20:30:04 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Get a worker entry from the freelist. We checked above, so there
|
2012-10-16 22:36:30 +02:00
|
|
|
* really should be a free slot.
|
2007-04-16 20:30:04 +02:00
|
|
|
*/
|
2012-10-16 22:36:30 +02:00
|
|
|
wptr = dlist_pop_head_node(&AutoVacuumShmem->av_freeWorkers);
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2012-10-16 22:36:30 +02:00
|
|
|
worker = dlist_container(WorkerInfoData, wi_links, wptr);
|
2007-04-16 20:30:04 +02:00
|
|
|
worker->wi_dboid = avdb->adw_datid;
|
2007-10-24 21:08:25 +02:00
|
|
|
worker->wi_proc = NULL;
|
2007-04-16 20:30:04 +02:00
|
|
|
worker->wi_launchtime = GetCurrentTimestamp();
|
|
|
|
|
2008-11-02 22:24:52 +01:00
|
|
|
AutoVacuumShmem->av_startingWorker = worker;
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2007-03-23 22:45:17 +01:00
|
|
|
LWLockRelease(AutovacuumLock);
|
|
|
|
|
|
|
|
SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_WORKER);
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2007-09-13 00:14:59 +02:00
|
|
|
retval = avdb->adw_datid;
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
|
|
|
else if (skipit)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If we skipped all databases on the list, rebuild it, because it
|
|
|
|
* probably contains a dropped database.
|
|
|
|
*/
|
|
|
|
rebuild_database_list(InvalidOid);
|
|
|
|
}
|
|
|
|
|
2007-09-13 00:14:59 +02:00
|
|
|
MemoryContextSwitchTo(oldcxt);
|
|
|
|
MemoryContextDelete(tmpcxt);
|
|
|
|
|
|
|
|
return retval;
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* launch_worker
|
|
|
|
*
|
|
|
|
* Wrapper for starting a worker from the launcher. Besides actually starting
|
|
|
|
* it, update the database list to reflect the next time that another one will
|
|
|
|
* need to be started on the selected database. The actual database choice is
|
|
|
|
* left to do_start_worker.
|
|
|
|
*
|
|
|
|
* This routine is also expected to insert an entry into the database list if
|
2008-06-05 17:47:32 +02:00
|
|
|
* the selected database was previously absent from the list.
|
2007-04-16 20:30:04 +02:00
|
|
|
*/
|
|
|
|
static void
|
|
|
|
launch_worker(TimestampTz now)
|
|
|
|
{
|
|
|
|
Oid dbid;
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_iter iter;
|
2007-04-16 20:30:04 +02:00
|
|
|
|
|
|
|
dbid = do_start_worker();
|
|
|
|
if (OidIsValid(dbid))
|
|
|
|
{
|
2012-10-16 22:36:30 +02:00
|
|
|
bool found = false;
|
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
/*
|
|
|
|
* Walk the database list and update the corresponding entry. If the
|
|
|
|
* database is not on the list, we'll recreate the list.
|
|
|
|
*/
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_foreach(iter, &DatabaseList)
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
2012-10-16 22:36:30 +02:00
|
|
|
avl_dbase *avdb = dlist_container(avl_dbase, adl_node, iter.cur);
|
2007-04-16 20:30:04 +02:00
|
|
|
|
|
|
|
if (avdb->adl_datid == dbid)
|
|
|
|
{
|
2012-10-16 22:36:30 +02:00
|
|
|
found = true;
|
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
/*
|
|
|
|
* add autovacuum_naptime seconds to the current time, and use
|
|
|
|
* that as the new "next_worker" field for this database.
|
|
|
|
*/
|
|
|
|
avdb->adl_next_worker =
|
|
|
|
TimestampTzPlusMilliseconds(now, autovacuum_naptime * 1000);
|
|
|
|
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_move_head(&DatabaseList, iter.cur);
|
2007-04-16 20:30:04 +02:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the database was not present in the database list, we rebuild
|
|
|
|
* the list. It's possible that the database does not get into the
|
|
|
|
* list anyway, for example if it's a database that doesn't have a
|
|
|
|
* pgstat entry, but this is not a problem because we don't want to
|
|
|
|
* schedule workers regularly into those in any case.
|
|
|
|
*/
|
2012-10-16 22:36:30 +02:00
|
|
|
if (!found)
|
2007-04-16 20:30:04 +02:00
|
|
|
rebuild_database_list(dbid);
|
2007-03-23 22:45:17 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-06-25 18:09:03 +02:00
|
|
|
/*
|
|
|
|
* Called from postmaster to signal a failure to fork a process to become
|
2009-08-31 21:41:00 +02:00
|
|
|
* worker. The postmaster should kill(SIGUSR2) the launcher shortly
|
2007-06-25 18:09:03 +02:00
|
|
|
* after calling this function.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
AutoVacWorkerFailed(void)
|
|
|
|
{
|
|
|
|
AutoVacuumShmem->av_signal[AutoVacForkFailed] = true;
|
|
|
|
}
|
|
|
|
|
2009-08-31 21:41:00 +02:00
|
|
|
/* SIGUSR2: a worker is up and running, or just finished, or failed to fork */
|
2007-04-16 20:30:04 +02:00
|
|
|
static void
|
2009-08-31 21:41:00 +02:00
|
|
|
avl_sigusr2_handler(SIGNAL_ARGS)
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
2009-08-31 21:41:00 +02:00
|
|
|
got_SIGUSR2 = true;
|
2015-01-14 18:45:22 +01:00
|
|
|
SetLatch(MyLatch);
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
|
|
|
|
2007-02-16 00:23:23 +01:00
|
|
|
|
|
|
|
/********************************************************************
|
|
|
|
* AUTOVACUUM WORKER CODE
|
|
|
|
********************************************************************/
|
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
#ifdef EXEC_BACKEND
|
|
|
|
/*
|
2007-02-16 00:23:23 +01:00
|
|
|
* forkexec routines for the autovacuum worker.
|
2005-07-14 07:13:45 +02:00
|
|
|
*
|
2007-02-16 00:23:23 +01:00
|
|
|
* Format up the arglist, then fork and exec.
|
2005-07-14 07:13:45 +02:00
|
|
|
*/
|
|
|
|
static pid_t
|
2007-02-16 00:23:23 +01:00
|
|
|
avworker_forkexec(void)
|
2005-07-14 07:13:45 +02:00
|
|
|
{
|
|
|
|
char *av[10];
|
|
|
|
int ac = 0;
|
|
|
|
|
|
|
|
av[ac++] = "postgres";
|
2007-02-16 00:23:23 +01:00
|
|
|
av[ac++] = "--forkavworker";
|
2005-07-14 07:13:45 +02:00
|
|
|
av[ac++] = NULL; /* filled in by postmaster_forkexec */
|
|
|
|
av[ac] = NULL;
|
|
|
|
|
|
|
|
Assert(ac < lengthof(av));
|
|
|
|
|
|
|
|
return postmaster_forkexec(ac, av);
|
|
|
|
}
|
2007-02-16 00:23:23 +01:00
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Main entry point for autovacuum worker process.
|
|
|
|
*
|
|
|
|
* This code is heavily based on pgarch.c, q.v.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
StartAutoVacWorker(void)
|
2007-01-16 14:28:57 +01:00
|
|
|
{
|
2007-02-16 00:23:23 +01:00
|
|
|
pid_t worker_pid;
|
|
|
|
|
|
|
|
#ifdef EXEC_BACKEND
|
|
|
|
switch ((worker_pid = avworker_forkexec()))
|
|
|
|
#else
|
|
|
|
switch ((worker_pid = fork_process()))
|
|
|
|
#endif
|
|
|
|
{
|
|
|
|
case -1:
|
|
|
|
ereport(LOG,
|
2008-02-20 15:01:45 +01:00
|
|
|
(errmsg("could not fork autovacuum worker process: %m")));
|
2007-02-16 00:23:23 +01:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
#ifndef EXEC_BACKEND
|
|
|
|
case 0:
|
|
|
|
/* in postmaster child ... */
|
2015-01-13 13:12:37 +01:00
|
|
|
InitPostmasterChild();
|
|
|
|
|
2007-02-16 00:23:23 +01:00
|
|
|
/* Close the postmaster's sockets */
|
|
|
|
ClosePostmasterPorts(false);
|
|
|
|
|
|
|
|
AutoVacWorkerMain(0, NULL);
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
default:
|
|
|
|
return (int) worker_pid;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* shouldn't get here */
|
|
|
|
return 0;
|
2007-01-16 14:28:57 +01:00
|
|
|
}
|
2005-07-14 07:13:45 +02:00
|
|
|
|
|
|
|
/*
|
2007-02-16 00:23:23 +01:00
|
|
|
* AutoVacWorkerMain
|
2005-07-14 07:13:45 +02:00
|
|
|
*/
|
|
|
|
NON_EXEC_STATIC void
|
2007-02-16 00:23:23 +01:00
|
|
|
AutoVacWorkerMain(int argc, char *argv[])
|
2005-07-14 07:13:45 +02:00
|
|
|
{
|
|
|
|
sigjmp_buf local_sigjmp_buf;
|
2007-05-04 04:06:13 +02:00
|
|
|
Oid dbid;
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2007-02-16 00:23:23 +01:00
|
|
|
am_autovacuum_worker = true;
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2020-03-11 16:36:40 +01:00
|
|
|
MyBackendType = B_AUTOVAC_WORKER;
|
|
|
|
init_ps_display(NULL);
|
2005-08-11 23:11:50 +02:00
|
|
|
|
|
|
|
SetProcessingMode(InitProcessing);
|
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
/*
|
|
|
|
* Set up signal handlers. We operate on databases much like a regular
|
|
|
|
* backend, so we use the same signal handling. See equivalent code in
|
|
|
|
* tcop/postgres.c.
|
|
|
|
*/
|
2019-12-17 19:14:28 +01:00
|
|
|
pqsignal(SIGHUP, SignalHandlerForConfigReload);
|
2005-10-15 04:49:52 +02:00
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
/*
|
2011-06-29 08:26:14 +02:00
|
|
|
* SIGINT is used to signal canceling the current table's vacuum; SIGTERM
|
2007-06-30 06:08:05 +02:00
|
|
|
* means abort and exit cleanly, and SIGQUIT means abandon ship.
|
2005-07-14 07:13:45 +02:00
|
|
|
*/
|
|
|
|
pqsignal(SIGINT, StatementCancelHandler);
|
|
|
|
pqsignal(SIGTERM, die);
|
Centralize setup of SIGQUIT handling for postmaster child processes.
We decided that the policy established in commit 7634bd4f6 for
the bgwriter, checkpointer, walwriter, and walreceiver processes,
namely that they should accept SIGQUIT at all times, really ought
to apply uniformly to all postmaster children. Therefore, get
rid of the duplicative and inconsistent per-process code for
establishing that signal handler and removing SIGQUIT from BlockSig.
Instead, make InitPostmasterChild do it.
The handler set up by InitPostmasterChild is SignalHandlerForCrashExit,
which just summarily does _exit(2). In interactive backends, we
almost immediately replace that with quickdie, since we would prefer
to try to tell the client that we're dying. However, this patch is
changing the behavior of autovacuum (both launcher and workers), as
well as walsenders. Those processes formerly also used quickdie,
but AFAICS that was just mindless copy-and-paste: they don't have
any interactive client that's likely to benefit from being told this.
The stats collector continues to be an outlier, in that it thinks
SIGQUIT means normal exit. That should probably be changed for
consistency, but there's another patch set where that's being
dealt with, so I didn't do so here.
Discussion: https://postgr.es/m/644875.1599933441@sss.pgh.pa.us
2020-09-16 22:04:36 +02:00
|
|
|
/* SIGQUIT handler was already set up by InitPostmasterChild */
|
|
|
|
|
Introduce timeout handling framework
Management of timeouts was getting a little cumbersome; what we
originally had was more than enough back when we were only concerned
about deadlocks and query cancel; however, when we added timeouts for
standby processes, the code got considerably messier. Since there are
plans to add more complex timeouts, this seems a good time to introduce
a central timeout handling module.
External modules register their timeout handlers during process
initialization, and later enable and disable them as they see fit using
a simple API; timeout.c is in charge of keeping track of which timeouts
are in effect at any time, installing a common SIGALRM signal handler,
and calling setitimer() as appropriate to ensure timely firing of
external handlers.
timeout.c additionally supports pluggable modules to add their own
timeouts, though this capability isn't exercised anywhere yet.
Additionally, as of this commit, walsender processes are aware of
timeouts; we had a preexisting bug there that made those ignore SIGALRM,
thus being subject to unhandled deadlocks, particularly during the
authentication phase. This has already been fixed in back branches in
commit 0bf8eb2a, which see for more details.
Main author: Zoltán Böszörményi
Some review and cleanup by Álvaro Herrera
Extensive reworking by Tom Lane
2012-07-17 00:43:21 +02:00
|
|
|
InitializeTimeouts(); /* establishes SIGALRM handler */
|
2005-07-14 07:13:45 +02:00
|
|
|
|
|
|
|
pqsignal(SIGPIPE, SIG_IGN);
|
2009-07-31 22:26:23 +02:00
|
|
|
pqsignal(SIGUSR1, procsignal_sigusr1_handler);
|
2005-07-14 07:13:45 +02:00
|
|
|
pqsignal(SIGUSR2, SIG_IGN);
|
2005-08-11 23:11:50 +02:00
|
|
|
pqsignal(SIGFPE, FloatExceptionHandler);
|
2005-07-14 07:13:45 +02:00
|
|
|
pqsignal(SIGCHLD, SIG_DFL);
|
|
|
|
|
2006-01-04 22:06:32 +01:00
|
|
|
/*
|
2023-12-03 15:39:18 +01:00
|
|
|
* Create a per-backend PGPROC struct in shared memory. We must do this
|
|
|
|
* before we can use LWLocks or access any shared memory.
|
2006-01-04 22:06:32 +01:00
|
|
|
*/
|
|
|
|
InitProcess();
|
|
|
|
|
2021-08-05 23:37:09 +02:00
|
|
|
/* Early initialization */
|
|
|
|
BaseInit();
|
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
/*
|
|
|
|
* If an exception is encountered, processing resumes here.
|
|
|
|
*
|
Accept SIGQUIT during error recovery in auxiliary processes.
The bgwriter, checkpointer, walwriter, and walreceiver processes
claimed to allow SIGQUIT "at all times". In reality SIGQUIT
would get re-blocked during error recovery, because we didn't
update the actual signal mask immediately, so sigsetjmp() would
save and reinstate a mask that includes SIGQUIT.
This appears to be simply a coding oversight. There's never a
good reason to hold off SIGQUIT in these processes, because it's
going to just call _exit(2) which should be safe enough, especially
since the postmaster is going to tear down shared memory afterwards.
Hence, stick in PG_SETMASK() calls to install the modified BlockSig
mask immediately.
Also try to improve the comments around sigsetjmp blocks. Most of
them were just referencing postgres.c, which is misleading because
actually postgres.c manages the signals differently.
No back-patch, since there's no evidence that this is causing any
problems in the field.
Discussion: https://postgr.es/m/CALDaNm1d1hHPZUg3xU4XjtWBOLCrA+-2cJcLpw-cePZ=GgDVfA@mail.gmail.com
2020-09-11 22:01:28 +02:00
|
|
|
* Unlike most auxiliary processes, we don't attempt to continue
|
|
|
|
* processing after an error; we just clean up and exit. The autovac
|
|
|
|
* launcher is responsible for spawning another worker later.
|
|
|
|
*
|
|
|
|
* Note that we use sigsetjmp(..., 1), so that the prevailing signal mask
|
|
|
|
* (to wit, BlockSig) will be restored when longjmp'ing to here. Thus,
|
Centralize setup of SIGQUIT handling for postmaster child processes.
We decided that the policy established in commit 7634bd4f6 for
the bgwriter, checkpointer, walwriter, and walreceiver processes,
namely that they should accept SIGQUIT at all times, really ought
to apply uniformly to all postmaster children. Therefore, get
rid of the duplicative and inconsistent per-process code for
establishing that signal handler and removing SIGQUIT from BlockSig.
Instead, make InitPostmasterChild do it.
The handler set up by InitPostmasterChild is SignalHandlerForCrashExit,
which just summarily does _exit(2). In interactive backends, we
almost immediately replace that with quickdie, since we would prefer
to try to tell the client that we're dying. However, this patch is
changing the behavior of autovacuum (both launcher and workers), as
well as walsenders. Those processes formerly also used quickdie,
but AFAICS that was just mindless copy-and-paste: they don't have
any interactive client that's likely to benefit from being told this.
The stats collector continues to be an outlier, in that it thinks
SIGQUIT means normal exit. That should probably be changed for
consistency, but there's another patch set where that's being
dealt with, so I didn't do so here.
Discussion: https://postgr.es/m/644875.1599933441@sss.pgh.pa.us
2020-09-16 22:04:36 +02:00
|
|
|
* signals other than SIGQUIT will be blocked until we exit. It might
|
|
|
|
* seem that this policy makes the HOLD_INTERRUPTS() call redundant, but
|
|
|
|
* it is not since InterruptPending might be set already.
|
2005-07-14 07:13:45 +02:00
|
|
|
*/
|
|
|
|
if (sigsetjmp(local_sigjmp_buf, 1) != 0)
|
|
|
|
{
|
2019-10-23 03:25:06 +02:00
|
|
|
/* since not using PG_TRY, must reset error stack by hand */
|
|
|
|
error_context_stack = NULL;
|
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
/* Prevents interrupts while cleaning up */
|
|
|
|
HOLD_INTERRUPTS();
|
|
|
|
|
|
|
|
/* Report the error to the server log */
|
|
|
|
EmitErrorReport();
|
|
|
|
|
|
|
|
/*
|
2007-01-16 14:28:57 +01:00
|
|
|
* We can now go away. Note that because we called InitProcess, a
|
|
|
|
* callback was registered to do ProcKill, which will clean up
|
2005-07-14 07:13:45 +02:00
|
|
|
* necessary state.
|
|
|
|
*/
|
|
|
|
proc_exit(0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* We can now handle ereport(ERROR) */
|
|
|
|
PG_exception_stack = &local_sigjmp_buf;
|
|
|
|
|
2023-02-02 22:34:56 +01:00
|
|
|
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
|
2005-07-14 07:13:45 +02:00
|
|
|
|
Empty search_path in Autovacuum and non-psql/pgbench clients.
This makes the client programs behave as documented regardless of the
connect-time search_path and regardless of user-created objects. Today,
a malicious user with CREATE permission on a search_path schema can take
control of certain of these clients' queries and invoke arbitrary SQL
functions under the client identity, often a superuser. This is
exploitable in the default configuration, where all users have CREATE
privilege on schema "public".
This changes behavior of user-defined code stored in the database, like
pg_index.indexprs and pg_extension_config_dump(). If they reach code
bearing unqualified names, "does not exist" or "no schema has been
selected to create in" errors might appear. Users may fix such errors
by schema-qualifying affected names. After upgrading, consider watching
server logs for these errors.
The --table arguments of src/bin/scripts clients have been lax; for
example, "vacuumdb -Zt pg_am\;CHECKPOINT" performed a checkpoint. That
now fails, but for now, "vacuumdb -Zt 'pg_am(amname);CHECKPOINT'" still
performs a checkpoint.
Back-patch to 9.3 (all supported versions).
Reviewed by Tom Lane, though this fix strategy was not his first choice.
Reported by Arseniy Sharoglazov.
Security: CVE-2018-1058
2018-02-26 16:39:44 +01:00
|
|
|
/*
|
|
|
|
* Set always-secure search path, so malicious users can't redirect user
|
|
|
|
* code (e.g. pg_index.indexprs). (That code runs in a
|
|
|
|
* SECURITY_RESTRICTED_OPERATION sandbox, so malicious users could not
|
|
|
|
* take control of the entire autovacuum worker in any case.)
|
|
|
|
*/
|
|
|
|
SetConfigOption("search_path", "", PGC_SUSET, PGC_S_OVERRIDE);
|
|
|
|
|
2006-03-07 18:32:22 +01:00
|
|
|
/*
|
|
|
|
* Force zero_damaged_pages OFF in the autovac process, even if it is set
|
|
|
|
* in postgresql.conf. We don't really want such a dangerous option being
|
|
|
|
* applied non-interactively.
|
|
|
|
*/
|
|
|
|
SetConfigOption("zero_damaged_pages", "false", PGC_SUSET, PGC_S_OVERRIDE);
|
|
|
|
|
2005-07-29 21:30:09 +02:00
|
|
|
/*
|
2016-06-15 16:52:53 +02:00
|
|
|
* Force settable timeouts off to avoid letting these settings prevent
|
|
|
|
* regular maintenance from being executed.
|
2005-07-29 21:30:09 +02:00
|
|
|
*/
|
2007-04-16 20:30:04 +02:00
|
|
|
SetConfigOption("statement_timeout", "0", PGC_SUSET, PGC_S_OVERRIDE);
|
2024-02-15 22:34:11 +01:00
|
|
|
SetConfigOption("transaction_timeout", "0", PGC_SUSET, PGC_S_OVERRIDE);
|
2013-03-17 04:22:17 +01:00
|
|
|
SetConfigOption("lock_timeout", "0", PGC_SUSET, PGC_S_OVERRIDE);
|
2016-06-15 16:52:53 +02:00
|
|
|
SetConfigOption("idle_in_transaction_session_timeout", "0",
|
|
|
|
PGC_SUSET, PGC_S_OVERRIDE);
|
2005-07-29 21:30:09 +02:00
|
|
|
|
2011-11-30 04:39:16 +01:00
|
|
|
/*
|
|
|
|
* Force default_transaction_isolation to READ COMMITTED. We don't want
|
|
|
|
* to pay the overhead of serializable mode, nor add any risk of causing
|
|
|
|
* deadlocks or delaying other transactions.
|
|
|
|
*/
|
|
|
|
SetConfigOption("default_transaction_isolation", "read committed",
|
|
|
|
PGC_SUSET, PGC_S_OVERRIDE);
|
|
|
|
|
2011-03-06 23:49:16 +01:00
|
|
|
/*
|
|
|
|
* Force synchronous replication off to allow regular maintenance even if
|
|
|
|
* we are waiting for standbys to connect. This is important to ensure we
|
|
|
|
* aren't blocked from performing anti-wraparound tasks.
|
|
|
|
*/
|
2011-04-05 00:23:13 +02:00
|
|
|
if (synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
|
2011-11-30 04:39:16 +01:00
|
|
|
SetConfigOption("synchronous_commit", "local",
|
|
|
|
PGC_SUSET, PGC_S_OVERRIDE);
|
2011-03-06 23:49:16 +01:00
|
|
|
|
2022-04-07 06:29:46 +02:00
|
|
|
/*
|
|
|
|
* Even when system is configured to use a different fetch consistency,
|
|
|
|
* for autovac we always want fresh stats.
|
|
|
|
*/
|
|
|
|
SetConfigOption("stats_fetch_consistency", "none", PGC_SUSET, PGC_S_OVERRIDE);
|
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
/*
|
|
|
|
* Get the info about the database we're going to work on.
|
|
|
|
*/
|
|
|
|
LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
/*
|
2007-06-25 18:09:03 +02:00
|
|
|
* beware of startingWorker being INVALID; this should normally not
|
|
|
|
* happen, but if a worker fails after forking and before this, the
|
|
|
|
* launcher might have decided to remove it from the queue and start
|
|
|
|
* again.
|
2007-04-16 20:30:04 +02:00
|
|
|
*/
|
2008-11-02 22:24:52 +01:00
|
|
|
if (AutoVacuumShmem->av_startingWorker != NULL)
|
2007-05-02 17:47:14 +02:00
|
|
|
{
|
2008-11-02 22:24:52 +01:00
|
|
|
MyWorkerInfo = AutoVacuumShmem->av_startingWorker;
|
2007-05-02 17:47:14 +02:00
|
|
|
dbid = MyWorkerInfo->wi_dboid;
|
2007-10-24 21:08:25 +02:00
|
|
|
MyWorkerInfo->wi_proc = MyProc;
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
|
2007-05-02 17:47:14 +02:00
|
|
|
/* insert into the running list */
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_push_head(&AutoVacuumShmem->av_runningWorkers,
|
|
|
|
&MyWorkerInfo->wi_links);
|
2007-06-30 06:08:05 +02:00
|
|
|
|
2007-05-02 17:47:14 +02:00
|
|
|
/*
|
2007-05-04 04:06:13 +02:00
|
|
|
* remove from the "starting" pointer, so that the launcher can start
|
|
|
|
* a new worker if required
|
2007-05-02 17:47:14 +02:00
|
|
|
*/
|
2008-11-02 22:24:52 +01:00
|
|
|
AutoVacuumShmem->av_startingWorker = NULL;
|
2007-05-02 17:47:14 +02:00
|
|
|
LWLockRelease(AutovacuumLock);
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2007-05-02 17:47:14 +02:00
|
|
|
on_shmem_exit(FreeWorkerInfo, 0);
|
|
|
|
|
|
|
|
/* wake up the launcher */
|
|
|
|
if (AutoVacuumShmem->av_launcherpid != 0)
|
2009-08-31 21:41:00 +02:00
|
|
|
kill(AutoVacuumShmem->av_launcherpid, SIGUSR2);
|
2007-05-02 17:47:14 +02:00
|
|
|
}
|
|
|
|
else
|
2007-05-04 04:06:13 +02:00
|
|
|
{
|
2007-05-02 17:47:14 +02:00
|
|
|
/* no worker entry for me, go away */
|
2007-06-25 18:09:03 +02:00
|
|
|
elog(WARNING, "autovacuum worker started without a worker entry");
|
2007-05-04 04:06:13 +02:00
|
|
|
dbid = InvalidOid;
|
2007-05-02 17:47:14 +02:00
|
|
|
LWLockRelease(AutovacuumLock);
|
2007-05-04 04:06:13 +02:00
|
|
|
}
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2007-02-16 00:23:23 +01:00
|
|
|
if (OidIsValid(dbid))
|
2005-07-14 07:13:45 +02:00
|
|
|
{
|
2009-08-12 22:53:31 +02:00
|
|
|
char dbname[NAMEDATALEN];
|
2007-02-16 00:23:23 +01:00
|
|
|
|
2005-08-15 18:25:19 +02:00
|
|
|
/*
|
2022-04-06 22:56:06 +02:00
|
|
|
* Report autovac startup to the cumulative stats system. We
|
2005-08-15 18:25:19 +02:00
|
|
|
* deliberately do this before InitPostgres, so that the
|
|
|
|
* last_autovac_time will get updated even if the connection attempt
|
|
|
|
* fails. This is to prevent autovac from getting "stuck" repeatedly
|
|
|
|
* selecting an unopenable database, rather than making any progress
|
|
|
|
* on stuff it can connect to.
|
|
|
|
*/
|
2007-02-16 00:23:23 +01:00
|
|
|
pgstat_report_autovac(dbid);
|
2005-08-15 18:25:19 +02:00
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
/*
|
Process session_preload_libraries within InitPostgres's transaction.
Previously we did this after InitPostgres, at a somewhat randomly chosen
place within PostgresMain. However, since commit a0ffa885e doing this
outside a transaction can cause a crash, if we need to check permissions
while replacing a placeholder GUC. (Besides which, a preloaded library
could itself want to do database access within _PG_init.)
To avoid needing an additional transaction start/end in every session,
move the process_session_preload_libraries call to within InitPostgres's
transaction. That requires teaching the code not to call it when
InitPostgres is called from somewhere other than PostgresMain, since
we don't want session_preload_libraries to affect background workers.
The most future-proof solution here seems to be to add an additional
flag parameter to InitPostgres; fortunately, we're not yet very worried
about API stability for v15.
Doing this also exposed the fact that we're currently honoring
session_preload_libraries in walsenders, even those not connected to
any database. This seems, at minimum, a POLA violation: walsenders
are not interactive sessions. Let's stop doing that.
(All these comments also apply to local_preload_libraries, of course.)
Per report from Gurjeet Singh (thanks also to Nathan Bossart and Kyotaro
Horiguchi for review). Backpatch to v15 where a0ffa885e came in.
Discussion: https://postgr.es/m/CABwTF4VEpwTHhRQ+q5MiC5ucngN-whN-PdcKeufX7eLSoAfbZA@mail.gmail.com
2022-07-25 16:27:43 +02:00
|
|
|
* Connect to the selected database, specifying no particular user
|
2006-04-06 22:38:00 +02:00
|
|
|
*
|
|
|
|
* Note: if we have selected a just-deleted database (due to using
|
|
|
|
* stale stats info), we'll fail and exit here.
|
2005-07-14 07:13:45 +02:00
|
|
|
*/
|
2023-10-11 05:31:49 +02:00
|
|
|
InitPostgres(NULL, dbid, NULL, InvalidOid, 0, dbname);
|
2005-07-14 07:13:45 +02:00
|
|
|
SetProcessingMode(NormalProcessing);
|
2020-03-11 16:36:40 +01:00
|
|
|
set_ps_display(dbname);
|
2006-04-27 17:57:10 +02:00
|
|
|
ereport(DEBUG1,
|
2021-02-17 11:24:46 +01:00
|
|
|
(errmsg_internal("autovacuum: processing database \"%s\"", dbname)));
|
2005-08-11 23:11:50 +02:00
|
|
|
|
2007-10-24 22:55:36 +02:00
|
|
|
if (PostAuthDelay)
|
|
|
|
pg_usleep(PostAuthDelay * 1000000L);
|
|
|
|
|
2007-02-16 00:23:23 +01:00
|
|
|
/* And do an appropriate amount of work */
|
2021-02-15 01:03:10 +01:00
|
|
|
recentXid = ReadNextTransactionId();
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
recentMulti = ReadNextMultiXactId();
|
2007-03-29 00:17:12 +02:00
|
|
|
do_autovacuum();
|
2005-07-14 07:13:45 +02:00
|
|
|
}
|
|
|
|
|
2007-02-16 00:23:23 +01:00
|
|
|
/*
|
2007-05-02 17:47:14 +02:00
|
|
|
* The launcher will be notified of my death in ProcKill, *if* we managed
|
|
|
|
* to get a worker slot at all
|
2007-02-16 00:23:23 +01:00
|
|
|
*/
|
|
|
|
|
|
|
|
/* All done, go away */
|
2005-07-14 07:13:45 +02:00
|
|
|
proc_exit(0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2007-05-04 04:06:13 +02:00
|
|
|
* Return a WorkerInfo to the free list
|
|
|
|
*/
|
2007-04-16 20:30:04 +02:00
|
|
|
static void
|
|
|
|
FreeWorkerInfo(int code, Datum arg)
|
|
|
|
{
|
|
|
|
if (MyWorkerInfo != NULL)
|
|
|
|
{
|
|
|
|
LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
|
|
|
|
|
|
|
|
/*
|
2007-06-25 18:09:03 +02:00
|
|
|
* Wake the launcher up so that he can launch a new worker immediately
|
|
|
|
* if required. We only save the launcher's PID in local memory here;
|
|
|
|
* the actual signal will be sent when the PGPROC is recycled. Note
|
|
|
|
* that we always do this, so that the launcher can rebalance the cost
|
|
|
|
* limit setting of the remaining workers.
|
2007-04-16 20:30:04 +02:00
|
|
|
*
|
|
|
|
* We somewhat ignore the risk that the launcher changes its PID
|
2010-11-20 04:28:20 +01:00
|
|
|
* between us reading it and the actual kill; we expect ProcKill to be
|
2007-04-16 20:30:04 +02:00
|
|
|
* called shortly after us, and we assume that PIDs are not reused too
|
|
|
|
* quickly after a process exits.
|
|
|
|
*/
|
2007-06-25 18:09:03 +02:00
|
|
|
AutovacuumLauncherPid = AutoVacuumShmem->av_launcherpid;
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2012-10-19 01:04:20 +02:00
|
|
|
dlist_delete(&MyWorkerInfo->wi_links);
|
2007-04-16 20:30:04 +02:00
|
|
|
MyWorkerInfo->wi_dboid = InvalidOid;
|
|
|
|
MyWorkerInfo->wi_tableoid = InvalidOid;
|
2016-05-10 21:23:54 +02:00
|
|
|
MyWorkerInfo->wi_sharedrel = false;
|
2007-10-24 21:08:25 +02:00
|
|
|
MyWorkerInfo->wi_proc = NULL;
|
2007-04-16 20:30:04 +02:00
|
|
|
MyWorkerInfo->wi_launchtime = 0;
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
pg_atomic_clear_flag(&MyWorkerInfo->wi_dobalance);
|
2012-10-19 01:04:20 +02:00
|
|
|
dlist_push_head(&AutoVacuumShmem->av_freeWorkers,
|
|
|
|
&MyWorkerInfo->wi_links);
|
2007-04-16 20:30:04 +02:00
|
|
|
/* not mine anymore */
|
|
|
|
MyWorkerInfo = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* now that we're inactive, cause a rebalancing of the surviving
|
|
|
|
* workers
|
|
|
|
*/
|
2007-06-25 18:09:03 +02:00
|
|
|
AutoVacuumShmem->av_signal[AutoVacRebalance] = true;
|
2007-04-16 20:30:04 +02:00
|
|
|
LWLockRelease(AutovacuumLock);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2023-04-07 00:54:53 +02:00
|
|
|
* Update vacuum cost-based delay-related parameters for autovacuum workers and
|
|
|
|
* backends executing VACUUM or ANALYZE using the value of relevant GUCs and
|
|
|
|
* global state. This must be called during setup for vacuum and after every
|
|
|
|
* config reload to ensure up-to-date values.
|
2007-04-16 20:30:04 +02:00
|
|
|
*/
|
|
|
|
void
|
2023-04-07 00:54:53 +02:00
|
|
|
VacuumUpdateCosts(void)
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
|
|
|
if (MyWorkerInfo)
|
|
|
|
{
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
if (av_storage_param_cost_delay >= 0)
|
|
|
|
vacuum_cost_delay = av_storage_param_cost_delay;
|
|
|
|
else if (autovacuum_vac_cost_delay >= 0)
|
|
|
|
vacuum_cost_delay = autovacuum_vac_cost_delay;
|
|
|
|
else
|
|
|
|
/* fall back to VacuumCostDelay */
|
|
|
|
vacuum_cost_delay = VacuumCostDelay;
|
|
|
|
|
|
|
|
AutoVacuumUpdateCostLimit();
|
2023-04-07 00:54:53 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* Must be explicit VACUUM or ANALYZE */
|
|
|
|
vacuum_cost_delay = VacuumCostDelay;
|
|
|
|
vacuum_cost_limit = VacuumCostLimit;
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If configuration changes are allowed to impact VacuumCostActive, make
|
|
|
|
* sure it is updated.
|
|
|
|
*/
|
|
|
|
if (VacuumFailsafeActive)
|
|
|
|
Assert(!VacuumCostActive);
|
|
|
|
else if (vacuum_cost_delay > 0)
|
|
|
|
VacuumCostActive = true;
|
|
|
|
else
|
|
|
|
{
|
|
|
|
VacuumCostActive = false;
|
|
|
|
VacuumCostBalance = 0;
|
|
|
|
}
|
|
|
|
|
2023-04-20 15:45:44 +02:00
|
|
|
/*
|
|
|
|
* Since the cost logging requires a lock, avoid rendering the log message
|
|
|
|
* in case we are using a message level where the log wouldn't be emitted.
|
|
|
|
*/
|
|
|
|
if (MyWorkerInfo && message_level_is_interesting(DEBUG2))
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
{
|
|
|
|
Oid dboid,
|
|
|
|
tableoid;
|
|
|
|
|
|
|
|
Assert(!LWLockHeldByMe(AutovacuumLock));
|
|
|
|
|
|
|
|
LWLockAcquire(AutovacuumLock, LW_SHARED);
|
|
|
|
dboid = MyWorkerInfo->wi_dboid;
|
|
|
|
tableoid = MyWorkerInfo->wi_tableoid;
|
|
|
|
LWLockRelease(AutovacuumLock);
|
|
|
|
|
|
|
|
elog(DEBUG2,
|
|
|
|
"Autovacuum VacuumUpdateCosts(db=%u, rel=%u, dobalance=%s, cost_limit=%d, cost_delay=%g active=%s failsafe=%s)",
|
|
|
|
dboid, tableoid, pg_atomic_unlocked_test_flag(&MyWorkerInfo->wi_dobalance) ? "no" : "yes",
|
|
|
|
vacuum_cost_limit, vacuum_cost_delay,
|
|
|
|
vacuum_cost_delay > 0 ? "yes" : "no",
|
|
|
|
VacuumFailsafeActive ? "yes" : "no");
|
|
|
|
}
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
* Update vacuum_cost_limit with the correct value for an autovacuum worker,
|
|
|
|
* given the value of other relevant cost limit parameters and the number of
|
|
|
|
* workers across which the limit must be balanced. Autovacuum workers must
|
|
|
|
* call this regularly in case av_nworkersForBalance has been updated by
|
|
|
|
* another worker or by the autovacuum launcher. They must also call it after a
|
|
|
|
* config reload.
|
2007-04-16 20:30:04 +02:00
|
|
|
*/
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
void
|
|
|
|
AutoVacuumUpdateCostLimit(void)
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
if (!MyWorkerInfo)
|
|
|
|
return;
|
|
|
|
|
2007-06-08 23:09:49 +02:00
|
|
|
/*
|
|
|
|
* note: in cost_limit, zero also means use value from elsewhere, because
|
|
|
|
* zero is not a valid value.
|
|
|
|
*/
|
2007-04-16 20:30:04 +02:00
|
|
|
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
if (av_storage_param_cost_limit > 0)
|
|
|
|
vacuum_cost_limit = av_storage_param_cost_limit;
|
|
|
|
else
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
int nworkers_for_balance;
|
|
|
|
|
|
|
|
if (autovacuum_vac_cost_limit > 0)
|
|
|
|
vacuum_cost_limit = autovacuum_vac_cost_limit;
|
|
|
|
else
|
|
|
|
vacuum_cost_limit = VacuumCostLimit;
|
|
|
|
|
|
|
|
/* Only balance limit if no cost-related storage parameters specified */
|
|
|
|
if (pg_atomic_unlocked_test_flag(&MyWorkerInfo->wi_dobalance))
|
|
|
|
return;
|
2012-10-16 22:36:30 +02:00
|
|
|
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
Assert(vacuum_cost_limit > 0);
|
|
|
|
|
|
|
|
nworkers_for_balance = pg_atomic_read_u32(&AutoVacuumShmem->av_nworkersForBalance);
|
|
|
|
|
|
|
|
/* There is at least 1 autovac worker (this worker) */
|
|
|
|
if (nworkers_for_balance <= 0)
|
|
|
|
elog(ERROR, "nworkers_for_balance must be > 0");
|
|
|
|
|
|
|
|
vacuum_cost_limit = Max(vacuum_cost_limit / nworkers_for_balance, 1);
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
}
|
Don't balance vacuum cost delay when per-table settings are in effect
When there are cost-delay-related storage options set for a table,
trying to make that table participate in the autovacuum cost-limit
balancing algorithm produces undesirable results: instead of using the
configured values, the global values are always used,
as illustrated by Mark Kirkwood in
http://www.postgresql.org/message-id/52FACF15.8020507@catalyst.net.nz
Since the mechanism is already complicated, just disable it for those
cases rather than trying to make it cope. There are undesirable
side-effects from this too, namely that the total I/O impact on the
system will be higher whenever such tables are vacuumed. However, this
is seen as less harmful than slowing down vacuum, because that would
cause bloat to accumulate. Anyway, in the new system it is possible to
tweak options to get the precise behavior one wants, whereas with the
previous system one was simply hosed.
This has been broken forever, so backpatch to all supported branches.
This might affect systems where cost_limit and cost_delay have been set
for individual tables.
2014-10-03 18:01:27 +02:00
|
|
|
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
/*
|
|
|
|
* autovac_recalculate_workers_for_balance
|
|
|
|
* Recalculate the number of workers to consider, given cost-related
|
|
|
|
* storage parameters and the current number of active workers.
|
|
|
|
*
|
|
|
|
* Caller must hold the AutovacuumLock in at least shared mode to access
|
|
|
|
* worker->wi_proc.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
autovac_recalculate_workers_for_balance(void)
|
|
|
|
{
|
|
|
|
dlist_iter iter;
|
|
|
|
int orig_nworkers_for_balance;
|
|
|
|
int nworkers_for_balance = 0;
|
|
|
|
|
|
|
|
Assert(LWLockHeldByMe(AutovacuumLock));
|
|
|
|
|
|
|
|
orig_nworkers_for_balance =
|
|
|
|
pg_atomic_read_u32(&AutoVacuumShmem->av_nworkersForBalance);
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_foreach(iter, &AutoVacuumShmem->av_runningWorkers)
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
2012-10-16 22:36:30 +02:00
|
|
|
WorkerInfo worker = dlist_container(WorkerInfoData, wi_links, iter.cur);
|
|
|
|
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
if (worker->wi_proc == NULL ||
|
|
|
|
pg_atomic_unlocked_test_flag(&worker->wi_dobalance))
|
|
|
|
continue;
|
2010-11-20 04:28:20 +01:00
|
|
|
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
nworkers_for_balance++;
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
|
|
|
|
if (nworkers_for_balance != orig_nworkers_for_balance)
|
|
|
|
pg_atomic_write_u32(&AutoVacuumShmem->av_nworkersForBalance,
|
|
|
|
nworkers_for_balance);
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* get_database_list
|
2009-08-31 21:41:00 +02:00
|
|
|
* Return a list of all databases found in pg_database.
|
2005-07-14 07:13:45 +02:00
|
|
|
*
|
2010-11-08 22:35:42 +01:00
|
|
|
* The list and associated data is allocated in the caller's memory context,
|
|
|
|
* which is in charge of ensuring that it's properly cleaned up afterwards.
|
|
|
|
*
|
2009-08-31 21:41:00 +02:00
|
|
|
* Note: this is the only function in which the autovacuum launcher uses a
|
|
|
|
* transaction. Although we aren't attached to any particular database and
|
|
|
|
* therefore can't access most catalogs, we do have enough infrastructure
|
|
|
|
* to do a seqscan on pg_database.
|
2005-07-14 07:13:45 +02:00
|
|
|
*/
|
|
|
|
static List *
|
2007-04-16 20:30:04 +02:00
|
|
|
get_database_list(void)
|
2005-07-14 07:13:45 +02:00
|
|
|
{
|
|
|
|
List *dblist = NIL;
|
2009-08-31 21:41:00 +02:00
|
|
|
Relation rel;
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
TableScanDesc scan;
|
2009-08-31 21:41:00 +02:00
|
|
|
HeapTuple tup;
|
2010-11-08 22:35:42 +01:00
|
|
|
MemoryContext resultcxt;
|
|
|
|
|
|
|
|
/* This is the context that we will allocate our output data in */
|
|
|
|
resultcxt = CurrentMemoryContext;
|
2009-08-31 21:41:00 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Start a transaction so we can access pg_database, and get a snapshot.
|
|
|
|
* We don't have a use for the snapshot itself, but we're interested in
|
|
|
|
* the secondary effect that it sets RecentGlobalXmin. (This is critical
|
|
|
|
* for anything that reads heap pages, because HOT may decide to prune
|
|
|
|
* them even if the process doesn't attempt to modify any tuples.)
|
snapshot scalability: Don't compute global horizons while building snapshots.
To make GetSnapshotData() more scalable, it cannot not look at at each proc's
xmin: While snapshot contents do not need to change whenever a read-only
transaction commits or a snapshot is released, a proc's xmin is modified in
those cases. The frequency of xmin modifications leads to, particularly on
higher core count systems, many cache misses inside GetSnapshotData(), despite
the data underlying a snapshot not changing. That is the most
significant source of GetSnapshotData() scaling poorly on larger systems.
Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons /
thresholds as it has so far. But we don't really have to: The horizons don't
actually change that much between GetSnapshotData() calls. Nor are the horizons
actually used every time a snapshot is built.
The trick this commit introduces is to delay computation of accurate horizons
until there use and using horizon boundaries to determine whether accurate
horizons need to be computed.
The use of RecentGlobal[Data]Xmin to decide whether a row version could be
removed has been replaces with new GlobalVisTest* functions. These use two
thresholds to determine whether a row can be pruned:
1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed
are definitely still visible.
2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
definitely be removed
GetSnapshotData() updates definitely_needed to be the xmin of the computed
snapshot.
When testing whether a row can be removed (with GlobalVisTestIsRemovableXid())
and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID <
definitely_needed) the boundaries can be recomputed to be more accurate. As it
is not cheap to compute accurate boundaries, we limit the number of times that
happens in short succession. As the boundaries used by
GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by
GetSnapshotData()), it is likely that further test can benefit from an earlier
computation of accurate horizons.
To avoid regressing performance when old_snapshot_threshold is set (as that
requires an accurate horizon to be computed), heap_page_prune_opt() doesn't
unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the
computation of the limited horizon, and the triggering of errors (with
SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove
tuples.
This commit just removes the accesses to PGXACT->xmin from
GetSnapshotData(), but other members of PGXACT residing in the same
cache line are accessed. Therefore this in itself does not result in a
significant improvement. Subsequent commits will take advantage of the
fact that GetSnapshotData() now does not need to access xmins anymore.
Note: This contains a workaround in heap_page_prune_opt() to keep the
snapshot_too_old tests working. While that workaround is ugly, the tests
currently are not meaningful, and it seems best to address them separately.
Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-13 01:03:49 +02:00
|
|
|
*
|
|
|
|
* FIXME: This comment is inaccurate / the code buggy. A snapshot that is
|
|
|
|
* not pushed/active does not reliably prevent HOT pruning (->xmin could
|
|
|
|
* e.g. be cleared when cache invalidations are processed).
|
2009-08-31 21:41:00 +02:00
|
|
|
*/
|
|
|
|
StartTransactionCommand();
|
|
|
|
(void) GetTransactionSnapshot();
|
|
|
|
|
2019-01-21 19:32:19 +01:00
|
|
|
rel = table_open(DatabaseRelationId, AccessShareLock);
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
scan = table_beginscan_catalog(rel, 0, NULL);
|
2009-08-31 21:41:00 +02:00
|
|
|
|
|
|
|
while (HeapTupleIsValid(tup = heap_getnext(scan, ForwardScanDirection)))
|
2005-07-14 07:13:45 +02:00
|
|
|
{
|
2009-08-31 21:41:00 +02:00
|
|
|
Form_pg_database pgdatabase = (Form_pg_database) GETSTRUCT(tup);
|
|
|
|
avw_dbase *avdb;
|
2010-11-08 22:35:42 +01:00
|
|
|
MemoryContext oldcxt;
|
|
|
|
|
Handle DROP DATABASE getting interrupted
Until now, when DROP DATABASE got interrupted in the wrong moment, the removal
of the pg_database row would also roll back, even though some irreversible
steps have already been taken. E.g. DropDatabaseBuffers() might have thrown
out dirty buffers, or files could have been unlinked. But we continued to
allow connections to such a corrupted database.
To fix this, mark databases invalid with an in-place update, just before
starting to perform irreversible steps. As we can't add a new column in the
back branches, we use pg_database.datconnlimit = -2 for this purpose.
An invalid database cannot be connected to anymore, but can still be
dropped.
Unfortunately we can't easily add output to psql's \l to indicate that some
database is invalid, it doesn't fit in any of the existing columns.
Add tests verifying that a interrupted DROP DATABASE is handled correctly in
the backend and in various tools.
Reported-by: Evgeny Morozov <postgresql3@realityexists.net>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/20230509004637.cgvmfwrbht7xm7p6@awork3.anarazel.de
Discussion: https://postgr.es/m/20230314174521.74jl6ffqsee5mtug@awork3.anarazel.de
Backpatch: 11-, bug present in all supported versions
2023-07-13 22:03:28 +02:00
|
|
|
/*
|
|
|
|
* If database has partially been dropped, we can't, nor need to,
|
|
|
|
* vacuum it.
|
|
|
|
*/
|
|
|
|
if (database_is_invalid_form(pgdatabase))
|
|
|
|
{
|
|
|
|
elog(DEBUG2,
|
|
|
|
"autovacuum: skipping invalid database \"%s\"",
|
|
|
|
NameStr(pgdatabase->datname));
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2010-11-08 22:35:42 +01:00
|
|
|
/*
|
|
|
|
* Allocate our results in the caller's context, not the
|
|
|
|
* transaction's. We do this inside the loop, and restore the original
|
|
|
|
* context at the end, so that leaky things like heap_getnext() are
|
|
|
|
* not called in a potentially long-lived context.
|
|
|
|
*/
|
|
|
|
oldcxt = MemoryContextSwitchTo(resultcxt);
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
avdb = (avw_dbase *) palloc(sizeof(avw_dbase));
|
2005-07-14 07:13:45 +02:00
|
|
|
|
Remove WITH OIDS support, change oid catalog column visibility.
Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.
This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row. Neither pg_dump nor COPY included the contents of the
oid column by default.
The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.
WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.
Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.
The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.
The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such. This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.
The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.
Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).
The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.
While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.
Catversion bump, for obvious reasons.
Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
2018-11-21 00:36:57 +01:00
|
|
|
avdb->adw_datid = pgdatabase->oid;
|
2009-08-31 21:41:00 +02:00
|
|
|
avdb->adw_name = pstrdup(NameStr(pgdatabase->datname));
|
|
|
|
avdb->adw_frozenxid = pgdatabase->datfrozenxid;
|
2013-09-16 20:45:00 +02:00
|
|
|
avdb->adw_minmulti = pgdatabase->datminmxid;
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
/* this gets set later: */
|
2007-04-16 20:30:04 +02:00
|
|
|
avdb->adw_entry = NULL;
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2007-03-27 22:36:03 +02:00
|
|
|
dblist = lappend(dblist, avdb);
|
2010-11-08 22:35:42 +01:00
|
|
|
MemoryContextSwitchTo(oldcxt);
|
2005-07-14 07:13:45 +02:00
|
|
|
}
|
|
|
|
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
table_endscan(scan);
|
2019-01-21 19:32:19 +01:00
|
|
|
table_close(rel, AccessShareLock);
|
2009-08-31 21:41:00 +02:00
|
|
|
|
|
|
|
CommitTransactionCommand();
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2022-08-31 22:23:20 +02:00
|
|
|
/* Be sure to restore caller's memory context */
|
|
|
|
MemoryContextSwitchTo(resultcxt);
|
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
return dblist;
|
|
|
|
}
|
|
|
|
|
2005-08-11 23:11:50 +02:00
|
|
|
/*
|
|
|
|
* Process a database table-by-table
|
2005-07-29 21:30:09 +02:00
|
|
|
*
|
2005-07-14 07:13:45 +02:00
|
|
|
* Note that CHECK_FOR_INTERRUPTS is supposed to be used in certain spots in
|
|
|
|
* order not to ignore shutdown commands for too long.
|
|
|
|
*/
|
|
|
|
static void
|
2007-03-29 00:17:12 +02:00
|
|
|
do_autovacuum(void)
|
2005-07-14 07:13:45 +02:00
|
|
|
{
|
2009-02-09 21:57:59 +01:00
|
|
|
Relation classRel;
|
2005-07-14 07:13:45 +02:00
|
|
|
HeapTuple tuple;
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
TableScanDesc relScan;
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
Form_pg_database dbForm;
|
2007-03-29 00:17:12 +02:00
|
|
|
List *table_oids = NIL;
|
autovacuum: Drop orphan temp tables more quickly but with more caution.
Previously, we only dropped an orphan temp table when it became old
enough to threaten wraparound; instead, doing it immediately. The
only value of waiting is that someone might be able to examine the
contents of the orphan temp table for forensic purposes, but it's
pretty difficult to actually do that and few users will wish to do so.
On the flip side, not performing the drop immediately generates log
spam and bloats pg_class.
In addition, per a report from Grigory Smolkin, if a temporary schema
contains a very large number of temporary tables, a backend attempting
to clear the temporary schema might fail due to lock table exhaustion.
It's helpful for autovacuum to clean up after such cases, and we don't
want it to wait for wraparound to threaten before doing so. To
prevent autovacuum from failing in the same manner as a backend trying
to drop an entire temp schema, remove orphan temp tables in batches of
50, committing after each batch, so that we don't accumulate an
unbounded number of locks. If a drop fails, retry other orphan tables
that need to be dropped up to 10 times before giving up. With this
system, if a backend does fail to clean a temporary schema due to
lock table exhaustion, autovacuum should hopefully put things right
the next time it processes the database.
Discussion: CAB7nPqSbYT6dRwsXVgiKmBdL_ARemfDZMPA+RPeC_ge0GK70hA@mail.gmail.com
Michael Paquier, with a bunch of comment changes by me.
2016-11-21 18:54:19 +01:00
|
|
|
List *orphan_oids = NIL;
|
2008-08-13 02:07:50 +02:00
|
|
|
HASHCTL ctl;
|
|
|
|
HTAB *table_toast_map;
|
2007-07-01 04:20:59 +02:00
|
|
|
ListCell *volatile cell;
|
2007-05-30 22:12:03 +02:00
|
|
|
BufferAccessStrategy bstrategy;
|
2008-08-13 02:07:50 +02:00
|
|
|
ScanKeyData key;
|
2009-02-09 21:57:59 +01:00
|
|
|
TupleDesc pg_class_desc;
|
2015-05-08 18:09:14 +02:00
|
|
|
int effective_multixact_freeze_max_age;
|
2017-01-20 21:55:45 +01:00
|
|
|
bool did_vacuum = false;
|
|
|
|
bool found_concurrent_worker = false;
|
2017-08-15 23:14:07 +02:00
|
|
|
int i;
|
2007-03-27 22:36:03 +02:00
|
|
|
|
2007-06-30 06:08:05 +02:00
|
|
|
/*
|
|
|
|
* StartTransactionCommand and CommitTransactionCommand will automatically
|
|
|
|
* switch to other contexts. We need this one to keep the list of
|
|
|
|
* relations to vacuum/analyze across transactions.
|
|
|
|
*/
|
|
|
|
AutovacMemCxt = AllocSetContextCreate(TopMemoryContext,
|
2021-11-22 16:55:36 +01:00
|
|
|
"Autovacuum worker",
|
Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
|
|
|
ALLOCSET_DEFAULT_SIZES);
|
2007-06-30 06:08:05 +02:00
|
|
|
MemoryContextSwitchTo(AutovacMemCxt);
|
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
/* Start a transaction so our commands have one to play into. */
|
|
|
|
StartTransactionCommand();
|
|
|
|
|
2015-05-08 18:09:14 +02:00
|
|
|
/*
|
|
|
|
* Compute the multixact age for which freezing is urgent. This is
|
|
|
|
* normally autovacuum_multixact_freeze_max_age, but may be less if we are
|
|
|
|
* short of multixact member space.
|
|
|
|
*/
|
|
|
|
effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
|
|
|
|
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
/*
|
2009-01-16 14:27:24 +01:00
|
|
|
* Find the pg_database entry and select the default freeze ages. We use
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
* zero in template and nonconnectable databases, else the system-wide
|
|
|
|
* default.
|
|
|
|
*/
|
2010-02-14 19:42:19 +01:00
|
|
|
tuple = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(MyDatabaseId));
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
if (!HeapTupleIsValid(tuple))
|
|
|
|
elog(ERROR, "cache lookup failed for database %u", MyDatabaseId);
|
|
|
|
dbForm = (Form_pg_database) GETSTRUCT(tuple);
|
|
|
|
|
|
|
|
if (dbForm->datistemplate || !dbForm->datallowconn)
|
2009-01-16 14:27:24 +01:00
|
|
|
{
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
default_freeze_min_age = 0;
|
2009-01-16 14:27:24 +01:00
|
|
|
default_freeze_table_age = 0;
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
default_multixact_freeze_min_age = 0;
|
|
|
|
default_multixact_freeze_table_age = 0;
|
2009-01-16 14:27:24 +01:00
|
|
|
}
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
else
|
2009-01-16 14:27:24 +01:00
|
|
|
{
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
default_freeze_min_age = vacuum_freeze_min_age;
|
2009-01-16 14:27:24 +01:00
|
|
|
default_freeze_table_age = vacuum_freeze_table_age;
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
|
|
|
|
default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
|
2009-01-16 14:27:24 +01:00
|
|
|
}
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
|
|
|
|
ReleaseSysCache(tuple);
|
|
|
|
|
2007-06-30 06:08:05 +02:00
|
|
|
/* StartTransactionCommand changed elsewhere */
|
2005-07-14 07:13:45 +02:00
|
|
|
MemoryContextSwitchTo(AutovacMemCxt);
|
|
|
|
|
2019-01-21 19:32:19 +01:00
|
|
|
classRel = table_open(RelationRelationId, AccessShareLock);
|
2009-02-09 21:57:59 +01:00
|
|
|
|
|
|
|
/* create a copy so we can use it after closing pg_class */
|
|
|
|
pg_class_desc = CreateTupleDescCopy(RelationGetDescr(classRel));
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2008-08-13 02:07:50 +02:00
|
|
|
/* create hash table for toast <-> main relid mapping */
|
|
|
|
ctl.keysize = sizeof(Oid);
|
2009-02-09 21:57:59 +01:00
|
|
|
ctl.entrysize = sizeof(av_relation);
|
2008-08-13 02:07:50 +02:00
|
|
|
|
|
|
|
table_toast_map = hash_create("TOAST to main relid map",
|
|
|
|
100,
|
|
|
|
&ctl,
|
Improve hash_create's API for selecting simple-binary-key hash functions.
Previously, if you wanted anything besides C-string hash keys, you had to
specify a custom hashing function to hash_create(). Nearly all such
callers were specifying tag_hash or oid_hash; which is tedious, and rather
error-prone, since a caller could easily miss the opportunity to optimize
by using hash_uint32 when appropriate. Replace this with a design whereby
callers using simple binary-data keys just specify HASH_BLOBS and don't
need to mess with specific support functions. hash_create() itself will
take care of optimizing when the key size is four bytes.
This nets out saving a few hundred bytes of code space, and offers
a measurable performance improvement in tidbitmap.c (which was not
exploiting the opportunity to use hash_uint32 for its 4-byte keys).
There might be some wins elsewhere too, I didn't analyze closely.
In future we could look into offering a similar optimized hashing function
for 8-byte keys. Under this design that could be done in a centralized
and machine-independent fashion, whereas getting it right for keys of
platform-dependent sizes would've been notationally painful before.
For the moment, the old way still works fine, so as not to break source
code compatibility for loadable modules. Eventually we might want to
remove tag_hash and friends from the exported API altogether, since there's
no real need for them to be explicitly referenced from outside dynahash.c.
Teodor Sigaev and Tom Lane
2014-12-18 19:36:29 +01:00
|
|
|
HASH_ELEM | HASH_BLOBS);
|
2008-08-13 02:07:50 +02:00
|
|
|
|
2005-08-15 18:25:19 +02:00
|
|
|
/*
|
2008-08-13 02:07:50 +02:00
|
|
|
* Scan pg_class to determine which tables to vacuum.
|
2005-08-15 18:25:19 +02:00
|
|
|
*
|
2021-08-16 23:27:52 +02:00
|
|
|
* We do this in two passes: on the first one we collect the list of plain
|
|
|
|
* relations and materialized views, and on the second one we collect
|
|
|
|
* TOAST tables. The reason for doing the second pass is that during it we
|
|
|
|
* want to use the main relation's pg_class.reloptions entry if the TOAST
|
|
|
|
* table does not have any, and we cannot obtain it unless we know
|
|
|
|
* beforehand what's the main table OID.
|
2005-08-15 18:25:19 +02:00
|
|
|
*
|
2008-08-13 02:07:50 +02:00
|
|
|
* We need to check TOAST tables separately because in cases with short,
|
|
|
|
* wide tables there might be proportionally much more activity in the
|
|
|
|
* TOAST table than in its parent.
|
2005-08-15 18:25:19 +02:00
|
|
|
*/
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
relScan = table_beginscan_catalog(classRel, 0, NULL);
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2008-08-13 02:07:50 +02:00
|
|
|
/*
|
2021-08-16 23:27:52 +02:00
|
|
|
* On the first pass, we collect main tables to vacuum, and also the main
|
2008-08-13 02:07:50 +02:00
|
|
|
* table relid to TOAST relid mapping.
|
|
|
|
*/
|
2005-08-11 23:11:50 +02:00
|
|
|
while ((tuple = heap_getnext(relScan, ForwardScanDirection)) != NULL)
|
|
|
|
{
|
|
|
|
Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);
|
|
|
|
PgStat_StatTabEntry *tabentry;
|
2009-02-09 21:57:59 +01:00
|
|
|
AutoVacOpts *relopts;
|
2005-08-11 23:11:50 +02:00
|
|
|
Oid relid;
|
2008-07-01 04:09:34 +02:00
|
|
|
bool dovacuum;
|
|
|
|
bool doanalyze;
|
|
|
|
bool wraparound;
|
2005-08-11 23:11:50 +02:00
|
|
|
|
2013-03-04 01:23:31 +01:00
|
|
|
if (classForm->relkind != RELKIND_RELATION &&
|
2021-08-16 23:27:52 +02:00
|
|
|
classForm->relkind != RELKIND_MATVIEW)
|
2013-03-04 01:23:31 +01:00
|
|
|
continue;
|
|
|
|
|
Remove WITH OIDS support, change oid catalog column visibility.
Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.
This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row. Neither pg_dump nor COPY included the contents of the
oid column by default.
The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.
WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.
Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.
The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.
The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such. This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.
The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.
Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).
The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.
While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.
Catversion bump, for obvious reasons.
Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
2018-11-21 00:36:57 +01:00
|
|
|
relid = classForm->oid;
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2008-07-01 04:09:34 +02:00
|
|
|
/*
|
|
|
|
* Check if it is a temp table (presumably, of some other backend's).
|
|
|
|
* We cannot safely process other backends' temp tables.
|
|
|
|
*/
|
2010-12-13 18:34:26 +01:00
|
|
|
if (classForm->relpersistence == RELPERSISTENCE_TEMP)
|
2008-07-01 04:09:34 +02:00
|
|
|
{
|
Make autovacuum more aggressive to remove orphaned temp tables
Commit dafa084, added in 10, made the removal of temporary orphaned
tables more aggressive. This commit makes an extra step into the
aggressiveness by adding a flag in each backend's MyProc which tracks
down any temporary namespace currently in use. The flag is set when the
namespace gets created and can be reset if the temporary namespace has
been created in a transaction or sub-transaction which is aborted. The
flag value assignment is assumed to be atomic, so this can be done in a
lock-less fashion like other flags already present in PGPROC like
databaseId or backendId, still the fact that the temporary namespace and
table created are still locked until the transaction creating those
commits acts as a barrier for other backends.
This new flag gets used by autovacuum to discard more aggressively
orphaned tables by additionally checking for the database a backend is
connected to as well as its temporary namespace in-use, removing
orphaned temporary relations even if a backend reuses the same slot as
one which created temporary relations in a past session.
The base idea of this patch comes from Robert Haas, has been written in
its first version by Tsunakawa Takayuki, then heavily reviewed by me.
Author: Tsunakawa Takayuki
Reviewed-by: Michael Paquier, Kyotaro Horiguchi, Andres Freund
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8A4DC6@G01JPEXMBYT05
Backpatch: 11-, as PGPROC gains a new flag and we don't want silent ABI
breakages on already released versions.
2018-08-13 11:49:04 +02:00
|
|
|
/*
|
|
|
|
* We just ignore it if the owning backend is still active and
|
Avoid failure if autovacuum tries to access a just-dropped temp namespace.
Such an access became possible when commit 246a6c8f7 added more
aggressive cleanup of orphaned temp relations by autovacuum.
Since autovacuum's snapshot might be slightly stale, it could
attempt to access an already-dropped temp namespace, resulting in
an assertion failure or null-pointer dereference. (In practice,
since we don't drop temp namespaces automatically but merely
recycle them, this situation could only arise if a superuser does
a manual drop of a temp namespace. Still, that should be allowed.)
The core of the bug, IMO, is that isTempNamespaceInUse and its callers
failed to think hard about whether to treat "temp namespace isn't there"
differently from "temp namespace isn't in use". In hopes of forestalling
future mistakes of the same ilk, replace that function with a new one
checkTempNamespaceStatus, which makes the same tests but returns a
three-way enum rather than just a bool. isTempNamespaceInUse is gone
entirely in HEAD; but just in case some external code is relying on it,
keep it in the back branches, as a bug-compatible wrapper around the
new function.
Per report originally from Prabhat Kumar Sahu, investigated by Mahendra
Singh and Michael Paquier; the final form of the patch is my fault.
This replaces the failed fix attempt in a052f6cbb.
Backpatch as far as v11, as 246a6c8f7 was.
Discussion: https://postgr.es/m/CAKYtNAr9Zq=1-ww4etHo-VCC-k120YxZy5OS01VkaLPaDbv2tg@mail.gmail.com
2020-02-29 02:28:34 +01:00
|
|
|
* using the temporary schema. Also, for safety, ignore it if the
|
|
|
|
* namespace doesn't exist or isn't a temp namespace after all.
|
Make autovacuum more aggressive to remove orphaned temp tables
Commit dafa084, added in 10, made the removal of temporary orphaned
tables more aggressive. This commit makes an extra step into the
aggressiveness by adding a flag in each backend's MyProc which tracks
down any temporary namespace currently in use. The flag is set when the
namespace gets created and can be reset if the temporary namespace has
been created in a transaction or sub-transaction which is aborted. The
flag value assignment is assumed to be atomic, so this can be done in a
lock-less fashion like other flags already present in PGPROC like
databaseId or backendId, still the fact that the temporary namespace and
table created are still locked until the transaction creating those
commits acts as a barrier for other backends.
This new flag gets used by autovacuum to discard more aggressively
orphaned tables by additionally checking for the database a backend is
connected to as well as its temporary namespace in-use, removing
orphaned temporary relations even if a backend reuses the same slot as
one which created temporary relations in a past session.
The base idea of this patch comes from Robert Haas, has been written in
its first version by Tsunakawa Takayuki, then heavily reviewed by me.
Author: Tsunakawa Takayuki
Reviewed-by: Michael Paquier, Kyotaro Horiguchi, Andres Freund
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8A4DC6@G01JPEXMBYT05
Backpatch: 11-, as PGPROC gains a new flag and we don't want silent ABI
breakages on already released versions.
2018-08-13 11:49:04 +02:00
|
|
|
*/
|
Avoid failure if autovacuum tries to access a just-dropped temp namespace.
Such an access became possible when commit 246a6c8f7 added more
aggressive cleanup of orphaned temp relations by autovacuum.
Since autovacuum's snapshot might be slightly stale, it could
attempt to access an already-dropped temp namespace, resulting in
an assertion failure or null-pointer dereference. (In practice,
since we don't drop temp namespaces automatically but merely
recycle them, this situation could only arise if a superuser does
a manual drop of a temp namespace. Still, that should be allowed.)
The core of the bug, IMO, is that isTempNamespaceInUse and its callers
failed to think hard about whether to treat "temp namespace isn't there"
differently from "temp namespace isn't in use". In hopes of forestalling
future mistakes of the same ilk, replace that function with a new one
checkTempNamespaceStatus, which makes the same tests but returns a
three-way enum rather than just a bool. isTempNamespaceInUse is gone
entirely in HEAD; but just in case some external code is relying on it,
keep it in the back branches, as a bug-compatible wrapper around the
new function.
Per report originally from Prabhat Kumar Sahu, investigated by Mahendra
Singh and Michael Paquier; the final form of the patch is my fault.
This replaces the failed fix attempt in a052f6cbb.
Backpatch as far as v11, as 246a6c8f7 was.
Discussion: https://postgr.es/m/CAKYtNAr9Zq=1-ww4etHo-VCC-k120YxZy5OS01VkaLPaDbv2tg@mail.gmail.com
2020-02-29 02:28:34 +01:00
|
|
|
if (checkTempNamespaceStatus(classForm->relnamespace) == TEMP_NAMESPACE_IDLE)
|
2008-07-01 04:09:34 +02:00
|
|
|
{
|
|
|
|
/*
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
* The table seems to be orphaned -- although it might be that
|
|
|
|
* the owning backend has already deleted it and exited; our
|
|
|
|
* pg_class scan snapshot is not necessarily up-to-date
|
|
|
|
* anymore, so we could be looking at a committed-dead entry.
|
|
|
|
* Remember it so we can try to delete it later.
|
2008-07-01 04:09:34 +02:00
|
|
|
*/
|
autovacuum: Drop orphan temp tables more quickly but with more caution.
Previously, we only dropped an orphan temp table when it became old
enough to threaten wraparound; instead, doing it immediately. The
only value of waiting is that someone might be able to examine the
contents of the orphan temp table for forensic purposes, but it's
pretty difficult to actually do that and few users will wish to do so.
On the flip side, not performing the drop immediately generates log
spam and bloats pg_class.
In addition, per a report from Grigory Smolkin, if a temporary schema
contains a very large number of temporary tables, a backend attempting
to clear the temporary schema might fail due to lock table exhaustion.
It's helpful for autovacuum to clean up after such cases, and we don't
want it to wait for wraparound to threaten before doing so. To
prevent autovacuum from failing in the same manner as a backend trying
to drop an entire temp schema, remove orphan temp tables in batches of
50, committing after each batch, so that we don't accumulate an
unbounded number of locks. If a drop fails, retry other orphan tables
that need to be dropped up to 10 times before giving up. With this
system, if a backend does fail to clean a temporary schema due to
lock table exhaustion, autovacuum should hopefully put things right
the next time it processes the database.
Discussion: CAB7nPqSbYT6dRwsXVgiKmBdL_ARemfDZMPA+RPeC_ge0GK70hA@mail.gmail.com
Michael Paquier, with a bunch of comment changes by me.
2016-11-21 18:54:19 +01:00
|
|
|
orphan_oids = lappend_oid(orphan_oids, relid);
|
2008-07-01 04:09:34 +02:00
|
|
|
}
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
continue;
|
2008-07-01 04:09:34 +02:00
|
|
|
}
|
2008-08-13 02:07:50 +02:00
|
|
|
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
/* Fetch reloptions and the pgstat entry for this table */
|
|
|
|
relopts = extract_autovac_opts(tuple, pg_class_desc);
|
2022-04-07 06:29:46 +02:00
|
|
|
tabentry = pgstat_fetch_stat_tabentry_ext(classForm->relisshared,
|
|
|
|
relid);
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
|
|
|
|
/* Check if it needs vacuum or analyze */
|
|
|
|
relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
|
|
|
|
effective_multixact_freeze_max_age,
|
|
|
|
&dovacuum, &doanalyze, &wraparound);
|
|
|
|
|
|
|
|
/* Relations that need work are added to table_oids */
|
|
|
|
if (dovacuum || doanalyze)
|
|
|
|
table_oids = lappend_oid(table_oids, relid);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remember TOAST associations for the second pass. Note: we must do
|
|
|
|
* this whether or not the table is going to be vacuumed, because we
|
|
|
|
* don't automatically vacuum toast tables along the parent table.
|
|
|
|
*/
|
|
|
|
if (OidIsValid(classForm->reltoastrelid))
|
|
|
|
{
|
|
|
|
av_relation *hentry;
|
|
|
|
bool found;
|
2008-07-01 04:09:34 +02:00
|
|
|
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
hentry = hash_search(table_toast_map,
|
|
|
|
&classForm->reltoastrelid,
|
|
|
|
HASH_ENTER, &found);
|
2008-07-01 04:09:34 +02:00
|
|
|
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
if (!found)
|
|
|
|
{
|
|
|
|
/* hash_search already filled in the key */
|
|
|
|
hentry->ar_relid = relid;
|
|
|
|
hentry->ar_hasrelopts = false;
|
|
|
|
if (relopts != NULL)
|
2008-08-13 02:07:50 +02:00
|
|
|
{
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
hentry->ar_hasrelopts = true;
|
|
|
|
memcpy(&hentry->ar_reloptions, relopts,
|
|
|
|
sizeof(AutoVacOpts));
|
2008-08-13 02:07:50 +02:00
|
|
|
}
|
2008-07-01 04:09:34 +02:00
|
|
|
}
|
|
|
|
}
|
2005-08-11 23:11:50 +02:00
|
|
|
}
|
2005-07-29 21:30:09 +02:00
|
|
|
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
table_endscan(relScan);
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2021-08-16 23:27:52 +02:00
|
|
|
/* second pass: check TOAST tables */
|
2008-08-13 02:07:50 +02:00
|
|
|
ScanKeyInit(&key,
|
|
|
|
Anum_pg_class_relkind,
|
|
|
|
BTEqualStrategyNumber, F_CHAREQ,
|
|
|
|
CharGetDatum(RELKIND_TOASTVALUE));
|
|
|
|
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
relScan = table_beginscan_catalog(classRel, 1, &key);
|
2008-08-13 02:07:50 +02:00
|
|
|
while ((tuple = heap_getnext(relScan, ForwardScanDirection)) != NULL)
|
2007-03-29 00:17:12 +02:00
|
|
|
{
|
2008-08-13 02:07:50 +02:00
|
|
|
Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);
|
|
|
|
PgStat_StatTabEntry *tabentry;
|
|
|
|
Oid relid;
|
2009-02-09 21:57:59 +01:00
|
|
|
AutoVacOpts *relopts = NULL;
|
2008-08-13 02:07:50 +02:00
|
|
|
bool dovacuum;
|
|
|
|
bool doanalyze;
|
|
|
|
bool wraparound;
|
2007-03-29 00:17:12 +02:00
|
|
|
|
2008-08-13 02:07:50 +02:00
|
|
|
/*
|
2009-04-01 00:12:48 +02:00
|
|
|
* We cannot safely process other backends' temp tables, so skip 'em.
|
2008-08-13 02:07:50 +02:00
|
|
|
*/
|
2010-12-13 18:34:26 +01:00
|
|
|
if (classForm->relpersistence == RELPERSISTENCE_TEMP)
|
2008-08-13 02:07:50 +02:00
|
|
|
continue;
|
2007-03-29 00:17:12 +02:00
|
|
|
|
Remove WITH OIDS support, change oid catalog column visibility.
Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.
This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row. Neither pg_dump nor COPY included the contents of the
oid column by default.
The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.
WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.
Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.
The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.
The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such. This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.
The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.
Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).
The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.
While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.
Catversion bump, for obvious reasons.
Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
2018-11-21 00:36:57 +01:00
|
|
|
relid = classForm->oid;
|
2008-08-13 02:07:50 +02:00
|
|
|
|
2009-02-09 21:57:59 +01:00
|
|
|
/*
|
|
|
|
* fetch reloptions -- if this toast table does not have them, try the
|
|
|
|
* main rel
|
|
|
|
*/
|
|
|
|
relopts = extract_autovac_opts(tuple, pg_class_desc);
|
|
|
|
if (relopts == NULL)
|
|
|
|
{
|
|
|
|
av_relation *hentry;
|
|
|
|
bool found;
|
2008-08-13 02:07:50 +02:00
|
|
|
|
2009-02-09 21:57:59 +01:00
|
|
|
hentry = hash_search(table_toast_map, &relid, HASH_FIND, &found);
|
|
|
|
if (found && hentry->ar_hasrelopts)
|
|
|
|
relopts = &hentry->ar_reloptions;
|
|
|
|
}
|
2008-08-13 02:07:50 +02:00
|
|
|
|
|
|
|
/* Fetch the pgstat entry for this table */
|
2022-04-07 06:29:46 +02:00
|
|
|
tabentry = pgstat_fetch_stat_tabentry_ext(classForm->relisshared,
|
|
|
|
relid);
|
2008-08-13 02:07:50 +02:00
|
|
|
|
2009-02-09 21:57:59 +01:00
|
|
|
relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
|
2015-05-08 18:09:14 +02:00
|
|
|
effective_multixact_freeze_max_age,
|
2008-08-13 02:07:50 +02:00
|
|
|
&dovacuum, &doanalyze, &wraparound);
|
|
|
|
|
|
|
|
/* ignore analyze for toast tables */
|
|
|
|
if (dovacuum)
|
|
|
|
table_oids = lappend_oid(table_oids, relid);
|
2007-03-29 00:17:12 +02:00
|
|
|
}
|
|
|
|
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
table_endscan(relScan);
|
2019-01-21 19:32:19 +01:00
|
|
|
table_close(classRel, AccessShareLock);
|
2007-03-29 00:17:12 +02:00
|
|
|
|
autovacuum: Drop orphan temp tables more quickly but with more caution.
Previously, we only dropped an orphan temp table when it became old
enough to threaten wraparound; instead, doing it immediately. The
only value of waiting is that someone might be able to examine the
contents of the orphan temp table for forensic purposes, but it's
pretty difficult to actually do that and few users will wish to do so.
On the flip side, not performing the drop immediately generates log
spam and bloats pg_class.
In addition, per a report from Grigory Smolkin, if a temporary schema
contains a very large number of temporary tables, a backend attempting
to clear the temporary schema might fail due to lock table exhaustion.
It's helpful for autovacuum to clean up after such cases, and we don't
want it to wait for wraparound to threaten before doing so. To
prevent autovacuum from failing in the same manner as a backend trying
to drop an entire temp schema, remove orphan temp tables in batches of
50, committing after each batch, so that we don't accumulate an
unbounded number of locks. If a drop fails, retry other orphan tables
that need to be dropped up to 10 times before giving up. With this
system, if a backend does fail to clean a temporary schema due to
lock table exhaustion, autovacuum should hopefully put things right
the next time it processes the database.
Discussion: CAB7nPqSbYT6dRwsXVgiKmBdL_ARemfDZMPA+RPeC_ge0GK70hA@mail.gmail.com
Michael Paquier, with a bunch of comment changes by me.
2016-11-21 18:54:19 +01:00
|
|
|
/*
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
* Recheck orphan temporary tables, and if they still seem orphaned, drop
|
|
|
|
* them. We'll eat a transaction per dropped table, which might seem
|
|
|
|
* excessive, but we should only need to do anything as a result of a
|
|
|
|
* previous backend crash, so this should not happen often enough to
|
|
|
|
* justify "optimizing". Using separate transactions ensures that we
|
|
|
|
* don't bloat the lock table if there are many temp tables to be dropped,
|
|
|
|
* and it ensures that we don't lose work if a deletion attempt fails.
|
autovacuum: Drop orphan temp tables more quickly but with more caution.
Previously, we only dropped an orphan temp table when it became old
enough to threaten wraparound; instead, doing it immediately. The
only value of waiting is that someone might be able to examine the
contents of the orphan temp table for forensic purposes, but it's
pretty difficult to actually do that and few users will wish to do so.
On the flip side, not performing the drop immediately generates log
spam and bloats pg_class.
In addition, per a report from Grigory Smolkin, if a temporary schema
contains a very large number of temporary tables, a backend attempting
to clear the temporary schema might fail due to lock table exhaustion.
It's helpful for autovacuum to clean up after such cases, and we don't
want it to wait for wraparound to threaten before doing so. To
prevent autovacuum from failing in the same manner as a backend trying
to drop an entire temp schema, remove orphan temp tables in batches of
50, committing after each batch, so that we don't accumulate an
unbounded number of locks. If a drop fails, retry other orphan tables
that need to be dropped up to 10 times before giving up. With this
system, if a backend does fail to clean a temporary schema due to
lock table exhaustion, autovacuum should hopefully put things right
the next time it processes the database.
Discussion: CAB7nPqSbYT6dRwsXVgiKmBdL_ARemfDZMPA+RPeC_ge0GK70hA@mail.gmail.com
Michael Paquier, with a bunch of comment changes by me.
2016-11-21 18:54:19 +01:00
|
|
|
*/
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
foreach(cell, orphan_oids)
|
autovacuum: Drop orphan temp tables more quickly but with more caution.
Previously, we only dropped an orphan temp table when it became old
enough to threaten wraparound; instead, doing it immediately. The
only value of waiting is that someone might be able to examine the
contents of the orphan temp table for forensic purposes, but it's
pretty difficult to actually do that and few users will wish to do so.
On the flip side, not performing the drop immediately generates log
spam and bloats pg_class.
In addition, per a report from Grigory Smolkin, if a temporary schema
contains a very large number of temporary tables, a backend attempting
to clear the temporary schema might fail due to lock table exhaustion.
It's helpful for autovacuum to clean up after such cases, and we don't
want it to wait for wraparound to threaten before doing so. To
prevent autovacuum from failing in the same manner as a backend trying
to drop an entire temp schema, remove orphan temp tables in batches of
50, committing after each batch, so that we don't accumulate an
unbounded number of locks. If a drop fails, retry other orphan tables
that need to be dropped up to 10 times before giving up. With this
system, if a backend does fail to clean a temporary schema due to
lock table exhaustion, autovacuum should hopefully put things right
the next time it processes the database.
Discussion: CAB7nPqSbYT6dRwsXVgiKmBdL_ARemfDZMPA+RPeC_ge0GK70hA@mail.gmail.com
Michael Paquier, with a bunch of comment changes by me.
2016-11-21 18:54:19 +01:00
|
|
|
{
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
Oid relid = lfirst_oid(cell);
|
|
|
|
Form_pg_class classForm;
|
|
|
|
ObjectAddress object;
|
autovacuum: Drop orphan temp tables more quickly but with more caution.
Previously, we only dropped an orphan temp table when it became old
enough to threaten wraparound; instead, doing it immediately. The
only value of waiting is that someone might be able to examine the
contents of the orphan temp table for forensic purposes, but it's
pretty difficult to actually do that and few users will wish to do so.
On the flip side, not performing the drop immediately generates log
spam and bloats pg_class.
In addition, per a report from Grigory Smolkin, if a temporary schema
contains a very large number of temporary tables, a backend attempting
to clear the temporary schema might fail due to lock table exhaustion.
It's helpful for autovacuum to clean up after such cases, and we don't
want it to wait for wraparound to threaten before doing so. To
prevent autovacuum from failing in the same manner as a backend trying
to drop an entire temp schema, remove orphan temp tables in batches of
50, committing after each batch, so that we don't accumulate an
unbounded number of locks. If a drop fails, retry other orphan tables
that need to be dropped up to 10 times before giving up. With this
system, if a backend does fail to clean a temporary schema due to
lock table exhaustion, autovacuum should hopefully put things right
the next time it processes the database.
Discussion: CAB7nPqSbYT6dRwsXVgiKmBdL_ARemfDZMPA+RPeC_ge0GK70hA@mail.gmail.com
Michael Paquier, with a bunch of comment changes by me.
2016-11-21 18:54:19 +01:00
|
|
|
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
/*
|
|
|
|
* Check for user-requested abort.
|
|
|
|
*/
|
|
|
|
CHECK_FOR_INTERRUPTS();
|
autovacuum: Drop orphan temp tables more quickly but with more caution.
Previously, we only dropped an orphan temp table when it became old
enough to threaten wraparound; instead, doing it immediately. The
only value of waiting is that someone might be able to examine the
contents of the orphan temp table for forensic purposes, but it's
pretty difficult to actually do that and few users will wish to do so.
On the flip side, not performing the drop immediately generates log
spam and bloats pg_class.
In addition, per a report from Grigory Smolkin, if a temporary schema
contains a very large number of temporary tables, a backend attempting
to clear the temporary schema might fail due to lock table exhaustion.
It's helpful for autovacuum to clean up after such cases, and we don't
want it to wait for wraparound to threaten before doing so. To
prevent autovacuum from failing in the same manner as a backend trying
to drop an entire temp schema, remove orphan temp tables in batches of
50, committing after each batch, so that we don't accumulate an
unbounded number of locks. If a drop fails, retry other orphan tables
that need to be dropped up to 10 times before giving up. With this
system, if a backend does fail to clean a temporary schema due to
lock table exhaustion, autovacuum should hopefully put things right
the next time it processes the database.
Discussion: CAB7nPqSbYT6dRwsXVgiKmBdL_ARemfDZMPA+RPeC_ge0GK70hA@mail.gmail.com
Michael Paquier, with a bunch of comment changes by me.
2016-11-21 18:54:19 +01:00
|
|
|
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
/*
|
|
|
|
* Try to lock the table. If we can't get the lock immediately,
|
|
|
|
* somebody else is using (or dropping) the table, so it's not our
|
|
|
|
* concern anymore. Having the lock prevents race conditions below.
|
|
|
|
*/
|
|
|
|
if (!ConditionalLockRelationOid(relid, AccessExclusiveLock))
|
|
|
|
continue;
|
autovacuum: Drop orphan temp tables more quickly but with more caution.
Previously, we only dropped an orphan temp table when it became old
enough to threaten wraparound; instead, doing it immediately. The
only value of waiting is that someone might be able to examine the
contents of the orphan temp table for forensic purposes, but it's
pretty difficult to actually do that and few users will wish to do so.
On the flip side, not performing the drop immediately generates log
spam and bloats pg_class.
In addition, per a report from Grigory Smolkin, if a temporary schema
contains a very large number of temporary tables, a backend attempting
to clear the temporary schema might fail due to lock table exhaustion.
It's helpful for autovacuum to clean up after such cases, and we don't
want it to wait for wraparound to threaten before doing so. To
prevent autovacuum from failing in the same manner as a backend trying
to drop an entire temp schema, remove orphan temp tables in batches of
50, committing after each batch, so that we don't accumulate an
unbounded number of locks. If a drop fails, retry other orphan tables
that need to be dropped up to 10 times before giving up. With this
system, if a backend does fail to clean a temporary schema due to
lock table exhaustion, autovacuum should hopefully put things right
the next time it processes the database.
Discussion: CAB7nPqSbYT6dRwsXVgiKmBdL_ARemfDZMPA+RPeC_ge0GK70hA@mail.gmail.com
Michael Paquier, with a bunch of comment changes by me.
2016-11-21 18:54:19 +01:00
|
|
|
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
/*
|
|
|
|
* Re-fetch the pg_class tuple and re-check whether it still seems to
|
|
|
|
* be an orphaned temp table. If it's not there or no longer the same
|
|
|
|
* relation, ignore it.
|
|
|
|
*/
|
|
|
|
tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
|
|
|
|
if (!HeapTupleIsValid(tuple))
|
autovacuum: Drop orphan temp tables more quickly but with more caution.
Previously, we only dropped an orphan temp table when it became old
enough to threaten wraparound; instead, doing it immediately. The
only value of waiting is that someone might be able to examine the
contents of the orphan temp table for forensic purposes, but it's
pretty difficult to actually do that and few users will wish to do so.
On the flip side, not performing the drop immediately generates log
spam and bloats pg_class.
In addition, per a report from Grigory Smolkin, if a temporary schema
contains a very large number of temporary tables, a backend attempting
to clear the temporary schema might fail due to lock table exhaustion.
It's helpful for autovacuum to clean up after such cases, and we don't
want it to wait for wraparound to threaten before doing so. To
prevent autovacuum from failing in the same manner as a backend trying
to drop an entire temp schema, remove orphan temp tables in batches of
50, committing after each batch, so that we don't accumulate an
unbounded number of locks. If a drop fails, retry other orphan tables
that need to be dropped up to 10 times before giving up. With this
system, if a backend does fail to clean a temporary schema due to
lock table exhaustion, autovacuum should hopefully put things right
the next time it processes the database.
Discussion: CAB7nPqSbYT6dRwsXVgiKmBdL_ARemfDZMPA+RPeC_ge0GK70hA@mail.gmail.com
Michael Paquier, with a bunch of comment changes by me.
2016-11-21 18:54:19 +01:00
|
|
|
{
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
/* be sure to drop useless lock so we don't bloat lock table */
|
|
|
|
UnlockRelationOid(relid, AccessExclusiveLock);
|
|
|
|
continue;
|
autovacuum: Drop orphan temp tables more quickly but with more caution.
Previously, we only dropped an orphan temp table when it became old
enough to threaten wraparound; instead, doing it immediately. The
only value of waiting is that someone might be able to examine the
contents of the orphan temp table for forensic purposes, but it's
pretty difficult to actually do that and few users will wish to do so.
On the flip side, not performing the drop immediately generates log
spam and bloats pg_class.
In addition, per a report from Grigory Smolkin, if a temporary schema
contains a very large number of temporary tables, a backend attempting
to clear the temporary schema might fail due to lock table exhaustion.
It's helpful for autovacuum to clean up after such cases, and we don't
want it to wait for wraparound to threaten before doing so. To
prevent autovacuum from failing in the same manner as a backend trying
to drop an entire temp schema, remove orphan temp tables in batches of
50, committing after each batch, so that we don't accumulate an
unbounded number of locks. If a drop fails, retry other orphan tables
that need to be dropped up to 10 times before giving up. With this
system, if a backend does fail to clean a temporary schema due to
lock table exhaustion, autovacuum should hopefully put things right
the next time it processes the database.
Discussion: CAB7nPqSbYT6dRwsXVgiKmBdL_ARemfDZMPA+RPeC_ge0GK70hA@mail.gmail.com
Michael Paquier, with a bunch of comment changes by me.
2016-11-21 18:54:19 +01:00
|
|
|
}
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
classForm = (Form_pg_class) GETSTRUCT(tuple);
|
autovacuum: Drop orphan temp tables more quickly but with more caution.
Previously, we only dropped an orphan temp table when it became old
enough to threaten wraparound; instead, doing it immediately. The
only value of waiting is that someone might be able to examine the
contents of the orphan temp table for forensic purposes, but it's
pretty difficult to actually do that and few users will wish to do so.
On the flip side, not performing the drop immediately generates log
spam and bloats pg_class.
In addition, per a report from Grigory Smolkin, if a temporary schema
contains a very large number of temporary tables, a backend attempting
to clear the temporary schema might fail due to lock table exhaustion.
It's helpful for autovacuum to clean up after such cases, and we don't
want it to wait for wraparound to threaten before doing so. To
prevent autovacuum from failing in the same manner as a backend trying
to drop an entire temp schema, remove orphan temp tables in batches of
50, committing after each batch, so that we don't accumulate an
unbounded number of locks. If a drop fails, retry other orphan tables
that need to be dropped up to 10 times before giving up. With this
system, if a backend does fail to clean a temporary schema due to
lock table exhaustion, autovacuum should hopefully put things right
the next time it processes the database.
Discussion: CAB7nPqSbYT6dRwsXVgiKmBdL_ARemfDZMPA+RPeC_ge0GK70hA@mail.gmail.com
Michael Paquier, with a bunch of comment changes by me.
2016-11-21 18:54:19 +01:00
|
|
|
|
|
|
|
/*
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
* Make all the same tests made in the loop above. In event of OID
|
|
|
|
* counter wraparound, the pg_class entry we have now might be
|
|
|
|
* completely unrelated to the one we saw before.
|
autovacuum: Drop orphan temp tables more quickly but with more caution.
Previously, we only dropped an orphan temp table when it became old
enough to threaten wraparound; instead, doing it immediately. The
only value of waiting is that someone might be able to examine the
contents of the orphan temp table for forensic purposes, but it's
pretty difficult to actually do that and few users will wish to do so.
On the flip side, not performing the drop immediately generates log
spam and bloats pg_class.
In addition, per a report from Grigory Smolkin, if a temporary schema
contains a very large number of temporary tables, a backend attempting
to clear the temporary schema might fail due to lock table exhaustion.
It's helpful for autovacuum to clean up after such cases, and we don't
want it to wait for wraparound to threaten before doing so. To
prevent autovacuum from failing in the same manner as a backend trying
to drop an entire temp schema, remove orphan temp tables in batches of
50, committing after each batch, so that we don't accumulate an
unbounded number of locks. If a drop fails, retry other orphan tables
that need to be dropped up to 10 times before giving up. With this
system, if a backend does fail to clean a temporary schema due to
lock table exhaustion, autovacuum should hopefully put things right
the next time it processes the database.
Discussion: CAB7nPqSbYT6dRwsXVgiKmBdL_ARemfDZMPA+RPeC_ge0GK70hA@mail.gmail.com
Michael Paquier, with a bunch of comment changes by me.
2016-11-21 18:54:19 +01:00
|
|
|
*/
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
if (!((classForm->relkind == RELKIND_RELATION ||
|
|
|
|
classForm->relkind == RELKIND_MATVIEW) &&
|
|
|
|
classForm->relpersistence == RELPERSISTENCE_TEMP))
|
autovacuum: Drop orphan temp tables more quickly but with more caution.
Previously, we only dropped an orphan temp table when it became old
enough to threaten wraparound; instead, doing it immediately. The
only value of waiting is that someone might be able to examine the
contents of the orphan temp table for forensic purposes, but it's
pretty difficult to actually do that and few users will wish to do so.
On the flip side, not performing the drop immediately generates log
spam and bloats pg_class.
In addition, per a report from Grigory Smolkin, if a temporary schema
contains a very large number of temporary tables, a backend attempting
to clear the temporary schema might fail due to lock table exhaustion.
It's helpful for autovacuum to clean up after such cases, and we don't
want it to wait for wraparound to threaten before doing so. To
prevent autovacuum from failing in the same manner as a backend trying
to drop an entire temp schema, remove orphan temp tables in batches of
50, committing after each batch, so that we don't accumulate an
unbounded number of locks. If a drop fails, retry other orphan tables
that need to be dropped up to 10 times before giving up. With this
system, if a backend does fail to clean a temporary schema due to
lock table exhaustion, autovacuum should hopefully put things right
the next time it processes the database.
Discussion: CAB7nPqSbYT6dRwsXVgiKmBdL_ARemfDZMPA+RPeC_ge0GK70hA@mail.gmail.com
Michael Paquier, with a bunch of comment changes by me.
2016-11-21 18:54:19 +01:00
|
|
|
{
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
UnlockRelationOid(relid, AccessExclusiveLock);
|
|
|
|
continue;
|
|
|
|
}
|
Make autovacuum more aggressive to remove orphaned temp tables
Commit dafa084, added in 10, made the removal of temporary orphaned
tables more aggressive. This commit makes an extra step into the
aggressiveness by adding a flag in each backend's MyProc which tracks
down any temporary namespace currently in use. The flag is set when the
namespace gets created and can be reset if the temporary namespace has
been created in a transaction or sub-transaction which is aborted. The
flag value assignment is assumed to be atomic, so this can be done in a
lock-less fashion like other flags already present in PGPROC like
databaseId or backendId, still the fact that the temporary namespace and
table created are still locked until the transaction creating those
commits acts as a barrier for other backends.
This new flag gets used by autovacuum to discard more aggressively
orphaned tables by additionally checking for the database a backend is
connected to as well as its temporary namespace in-use, removing
orphaned temporary relations even if a backend reuses the same slot as
one which created temporary relations in a past session.
The base idea of this patch comes from Robert Haas, has been written in
its first version by Tsunakawa Takayuki, then heavily reviewed by me.
Author: Tsunakawa Takayuki
Reviewed-by: Michael Paquier, Kyotaro Horiguchi, Andres Freund
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8A4DC6@G01JPEXMBYT05
Backpatch: 11-, as PGPROC gains a new flag and we don't want silent ABI
breakages on already released versions.
2018-08-13 11:49:04 +02:00
|
|
|
|
Avoid failure if autovacuum tries to access a just-dropped temp namespace.
Such an access became possible when commit 246a6c8f7 added more
aggressive cleanup of orphaned temp relations by autovacuum.
Since autovacuum's snapshot might be slightly stale, it could
attempt to access an already-dropped temp namespace, resulting in
an assertion failure or null-pointer dereference. (In practice,
since we don't drop temp namespaces automatically but merely
recycle them, this situation could only arise if a superuser does
a manual drop of a temp namespace. Still, that should be allowed.)
The core of the bug, IMO, is that isTempNamespaceInUse and its callers
failed to think hard about whether to treat "temp namespace isn't there"
differently from "temp namespace isn't in use". In hopes of forestalling
future mistakes of the same ilk, replace that function with a new one
checkTempNamespaceStatus, which makes the same tests but returns a
three-way enum rather than just a bool. isTempNamespaceInUse is gone
entirely in HEAD; but just in case some external code is relying on it,
keep it in the back branches, as a bug-compatible wrapper around the
new function.
Per report originally from Prabhat Kumar Sahu, investigated by Mahendra
Singh and Michael Paquier; the final form of the patch is my fault.
This replaces the failed fix attempt in a052f6cbb.
Backpatch as far as v11, as 246a6c8f7 was.
Discussion: https://postgr.es/m/CAKYtNAr9Zq=1-ww4etHo-VCC-k120YxZy5OS01VkaLPaDbv2tg@mail.gmail.com
2020-02-29 02:28:34 +01:00
|
|
|
if (checkTempNamespaceStatus(classForm->relnamespace) != TEMP_NAMESPACE_IDLE)
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
{
|
|
|
|
UnlockRelationOid(relid, AccessExclusiveLock);
|
|
|
|
continue;
|
autovacuum: Drop orphan temp tables more quickly but with more caution.
Previously, we only dropped an orphan temp table when it became old
enough to threaten wraparound; instead, doing it immediately. The
only value of waiting is that someone might be able to examine the
contents of the orphan temp table for forensic purposes, but it's
pretty difficult to actually do that and few users will wish to do so.
On the flip side, not performing the drop immediately generates log
spam and bloats pg_class.
In addition, per a report from Grigory Smolkin, if a temporary schema
contains a very large number of temporary tables, a backend attempting
to clear the temporary schema might fail due to lock table exhaustion.
It's helpful for autovacuum to clean up after such cases, and we don't
want it to wait for wraparound to threaten before doing so. To
prevent autovacuum from failing in the same manner as a backend trying
to drop an entire temp schema, remove orphan temp tables in batches of
50, committing after each batch, so that we don't accumulate an
unbounded number of locks. If a drop fails, retry other orphan tables
that need to be dropped up to 10 times before giving up. With this
system, if a backend does fail to clean a temporary schema due to
lock table exhaustion, autovacuum should hopefully put things right
the next time it processes the database.
Discussion: CAB7nPqSbYT6dRwsXVgiKmBdL_ARemfDZMPA+RPeC_ge0GK70hA@mail.gmail.com
Michael Paquier, with a bunch of comment changes by me.
2016-11-21 18:54:19 +01:00
|
|
|
}
|
|
|
|
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
/* OK, let's delete it */
|
|
|
|
ereport(LOG,
|
|
|
|
(errmsg("autovacuum: dropping orphan temp table \"%s.%s.%s\"",
|
|
|
|
get_database_name(MyDatabaseId),
|
|
|
|
get_namespace_name(classForm->relnamespace),
|
|
|
|
NameStr(classForm->relname))));
|
autovacuum: Drop orphan temp tables more quickly but with more caution.
Previously, we only dropped an orphan temp table when it became old
enough to threaten wraparound; instead, doing it immediately. The
only value of waiting is that someone might be able to examine the
contents of the orphan temp table for forensic purposes, but it's
pretty difficult to actually do that and few users will wish to do so.
On the flip side, not performing the drop immediately generates log
spam and bloats pg_class.
In addition, per a report from Grigory Smolkin, if a temporary schema
contains a very large number of temporary tables, a backend attempting
to clear the temporary schema might fail due to lock table exhaustion.
It's helpful for autovacuum to clean up after such cases, and we don't
want it to wait for wraparound to threaten before doing so. To
prevent autovacuum from failing in the same manner as a backend trying
to drop an entire temp schema, remove orphan temp tables in batches of
50, committing after each batch, so that we don't accumulate an
unbounded number of locks. If a drop fails, retry other orphan tables
that need to be dropped up to 10 times before giving up. With this
system, if a backend does fail to clean a temporary schema due to
lock table exhaustion, autovacuum should hopefully put things right
the next time it processes the database.
Discussion: CAB7nPqSbYT6dRwsXVgiKmBdL_ARemfDZMPA+RPeC_ge0GK70hA@mail.gmail.com
Michael Paquier, with a bunch of comment changes by me.
2016-11-21 18:54:19 +01:00
|
|
|
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
object.classId = RelationRelationId;
|
|
|
|
object.objectId = relid;
|
|
|
|
object.objectSubId = 0;
|
Delete deleteWhatDependsOn() in favor of more performDeletion() flag bits.
deleteWhatDependsOn() had grown an uncomfortably large number of
assumptions about what it's used for. There are actually only two minor
differences between what it does and what a regular performDeletion() call
can do, so let's invent additional bits in performDeletion's existing flags
argument that specify those behaviors, and get rid of deleteWhatDependsOn()
as such. (We'd probably have done it this way from the start, except that
performDeletion didn't originally have a flags argument, IIRC.)
Also, add a SKIP_EXTENSIONS flag bit that prevents ever recursing to an
extension, and use that when dropping temporary objects at session end.
This provides a more general solution to the problem addressed in a hacky
way in commit 08dd23cec: if an extension script creates temp objects and
forgets to remove them again, the whole extension went away when its
contained temp objects were deleted. The previous solution only covered
temp relations, but this solves it for all object types.
These changes require minor additions in dependency.c to pass the flags
to subroutines that previously didn't get them, but it's still a net
savings of code, and it seems cleaner than before.
Having done this, revert the special-case code added in 08dd23cec that
prevented addition of pg_depend records for temp table extension
membership, because that caused its own oddities: dropping an extension
that had created such a table didn't automatically remove the table,
leading to a failure if the table had another dependency on the extension
(such as use of an extension data type), or to a duplicate-name failure if
you then tried to recreate the extension. But we keep the part that
prevents the pg_temp_nnn schema from becoming an extension member; we never
want that to happen. Add a regression test case covering these behaviors.
Although this fixes some arguable bugs, we've heard few field complaints,
and any such problems are easily worked around by explicitly dropping temp
objects at the end of extension scripts (which seems like good practice
anyway). So I won't risk a back-patch.
Discussion: https://postgr.es/m/e51f4311-f483-4dd0-1ccc-abec3c405110@BlueTreble.com
2016-12-02 20:57:35 +01:00
|
|
|
performDeletion(&object, DROP_CASCADE,
|
|
|
|
PERFORM_DELETION_INTERNAL |
|
|
|
|
PERFORM_DELETION_QUIETLY |
|
|
|
|
PERFORM_DELETION_SKIP_EXTENSIONS);
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* To commit the deletion, end current transaction and start a new
|
|
|
|
* one. Note this also releases the lock we took.
|
|
|
|
*/
|
autovacuum: Drop orphan temp tables more quickly but with more caution.
Previously, we only dropped an orphan temp table when it became old
enough to threaten wraparound; instead, doing it immediately. The
only value of waiting is that someone might be able to examine the
contents of the orphan temp table for forensic purposes, but it's
pretty difficult to actually do that and few users will wish to do so.
On the flip side, not performing the drop immediately generates log
spam and bloats pg_class.
In addition, per a report from Grigory Smolkin, if a temporary schema
contains a very large number of temporary tables, a backend attempting
to clear the temporary schema might fail due to lock table exhaustion.
It's helpful for autovacuum to clean up after such cases, and we don't
want it to wait for wraparound to threaten before doing so. To
prevent autovacuum from failing in the same manner as a backend trying
to drop an entire temp schema, remove orphan temp tables in batches of
50, committing after each batch, so that we don't accumulate an
unbounded number of locks. If a drop fails, retry other orphan tables
that need to be dropped up to 10 times before giving up. With this
system, if a backend does fail to clean a temporary schema due to
lock table exhaustion, autovacuum should hopefully put things right
the next time it processes the database.
Discussion: CAB7nPqSbYT6dRwsXVgiKmBdL_ARemfDZMPA+RPeC_ge0GK70hA@mail.gmail.com
Michael Paquier, with a bunch of comment changes by me.
2016-11-21 18:54:19 +01:00
|
|
|
CommitTransactionCommand();
|
|
|
|
StartTransactionCommand();
|
|
|
|
|
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Discussion: https://postgr.es/m/21299.1480272347@sss.pgh.pa.us
2016-11-28 03:23:39 +01:00
|
|
|
/* StartTransactionCommand changed current memory context */
|
autovacuum: Drop orphan temp tables more quickly but with more caution.
Previously, we only dropped an orphan temp table when it became old
enough to threaten wraparound; instead, doing it immediately. The
only value of waiting is that someone might be able to examine the
contents of the orphan temp table for forensic purposes, but it's
pretty difficult to actually do that and few users will wish to do so.
On the flip side, not performing the drop immediately generates log
spam and bloats pg_class.
In addition, per a report from Grigory Smolkin, if a temporary schema
contains a very large number of temporary tables, a backend attempting
to clear the temporary schema might fail due to lock table exhaustion.
It's helpful for autovacuum to clean up after such cases, and we don't
want it to wait for wraparound to threaten before doing so. To
prevent autovacuum from failing in the same manner as a backend trying
to drop an entire temp schema, remove orphan temp tables in batches of
50, committing after each batch, so that we don't accumulate an
unbounded number of locks. If a drop fails, retry other orphan tables
that need to be dropped up to 10 times before giving up. With this
system, if a backend does fail to clean a temporary schema due to
lock table exhaustion, autovacuum should hopefully put things right
the next time it processes the database.
Discussion: CAB7nPqSbYT6dRwsXVgiKmBdL_ARemfDZMPA+RPeC_ge0GK70hA@mail.gmail.com
Michael Paquier, with a bunch of comment changes by me.
2016-11-21 18:54:19 +01:00
|
|
|
MemoryContextSwitchTo(AutovacMemCxt);
|
|
|
|
}
|
|
|
|
|
2007-05-30 22:12:03 +02:00
|
|
|
/*
|
2023-04-07 01:40:31 +02:00
|
|
|
* Optionally, create a buffer access strategy object for VACUUM to use.
|
|
|
|
* We use the same BufferAccessStrategy object for all tables VACUUMed by
|
|
|
|
* this worker to prevent autovacuum from blowing out shared buffers.
|
|
|
|
*
|
|
|
|
* VacuumBufferUsageLimit being set to 0 results in
|
|
|
|
* GetAccessStrategyWithSize returning NULL, effectively meaning we can
|
|
|
|
* use up to all of shared buffers.
|
|
|
|
*
|
|
|
|
* If we later enter failsafe mode on any of the tables being vacuumed, we
|
|
|
|
* will cease use of the BufferAccessStrategy only for that table.
|
|
|
|
*
|
|
|
|
* XXX should we consider adding code to adjust the size of this if
|
|
|
|
* VacuumBufferUsageLimit changes?
|
2007-05-30 22:12:03 +02:00
|
|
|
*/
|
2023-04-07 01:40:31 +02:00
|
|
|
bstrategy = GetAccessStrategyWithSize(BAS_VACUUM, VacuumBufferUsageLimit);
|
2007-05-30 22:12:03 +02:00
|
|
|
|
2007-06-30 06:08:05 +02:00
|
|
|
/*
|
|
|
|
* create a memory context to act as fake PortalContext, so that the
|
|
|
|
* contexts created in the vacuum code are cleaned up for each table.
|
|
|
|
*/
|
|
|
|
PortalContext = AllocSetContextCreate(AutovacMemCxt,
|
|
|
|
"Autovacuum Portal",
|
Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
|
|
|
ALLOCSET_DEFAULT_SIZES);
|
2007-06-30 06:08:05 +02:00
|
|
|
|
2005-08-11 23:11:50 +02:00
|
|
|
/*
|
|
|
|
* Perform operations on collected tables.
|
|
|
|
*/
|
2007-03-29 00:17:12 +02:00
|
|
|
foreach(cell, table_oids)
|
2005-08-11 23:11:50 +02:00
|
|
|
{
|
2007-03-29 00:17:12 +02:00
|
|
|
Oid relid = lfirst_oid(cell);
|
2018-03-13 17:28:15 +01:00
|
|
|
HeapTuple classTup;
|
2007-03-29 00:17:12 +02:00
|
|
|
autovac_table *tab;
|
2018-03-13 17:28:15 +01:00
|
|
|
bool isshared;
|
2007-04-16 20:30:04 +02:00
|
|
|
bool skipit;
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_iter iter;
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2005-07-29 21:30:09 +02:00
|
|
|
CHECK_FOR_INTERRUPTS();
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2015-04-03 16:52:55 +02:00
|
|
|
/*
|
|
|
|
* Check for config changes before processing each collected table.
|
|
|
|
*/
|
2019-12-17 19:03:57 +01:00
|
|
|
if (ConfigReloadPending)
|
2015-04-03 16:52:55 +02:00
|
|
|
{
|
2019-12-17 19:03:57 +01:00
|
|
|
ConfigReloadPending = false;
|
2015-04-03 16:52:55 +02:00
|
|
|
ProcessConfigFile(PGC_SIGHUP);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* You might be tempted to bail out if we see autovacuum is now
|
|
|
|
* disabled. Must resist that temptation -- this might be a
|
|
|
|
* for-wraparound emergency worker, in which case that would be
|
|
|
|
* entirely inappropriate.
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
/*
|
2018-03-13 17:28:15 +01:00
|
|
|
* Find out whether the table is shared or not. (It's slightly
|
|
|
|
* annoying to fetch the syscache entry just for this, but in typical
|
|
|
|
* cases it adds little cost because table_recheck_autovac would
|
|
|
|
* refetch the entry anyway. We could buy that back by copying the
|
|
|
|
* tuple here and passing it to table_recheck_autovac, but that
|
|
|
|
* increases the odds of that function working with stale data.)
|
|
|
|
*/
|
|
|
|
classTup = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
|
|
|
|
if (!HeapTupleIsValid(classTup))
|
|
|
|
continue; /* somebody deleted the rel, forget it */
|
|
|
|
isshared = ((Form_pg_class) GETSTRUCT(classTup))->relisshared;
|
|
|
|
ReleaseSysCache(classTup);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Hold schedule lock from here until we've claimed the table. We
|
|
|
|
* also need the AutovacuumLock to walk the worker array, but that one
|
|
|
|
* can just be a shared lock.
|
2007-04-16 20:30:04 +02:00
|
|
|
*/
|
|
|
|
LWLockAcquire(AutovacuumScheduleLock, LW_EXCLUSIVE);
|
|
|
|
LWLockAcquire(AutovacuumLock, LW_SHARED);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check whether the table is being vacuumed concurrently by another
|
|
|
|
* worker.
|
|
|
|
*/
|
|
|
|
skipit = false;
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_foreach(iter, &AutoVacuumShmem->av_runningWorkers)
|
2007-04-16 20:30:04 +02:00
|
|
|
{
|
2012-10-16 22:36:30 +02:00
|
|
|
WorkerInfo worker = dlist_container(WorkerInfoData, wi_links, iter.cur);
|
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
/* ignore myself */
|
|
|
|
if (worker == MyWorkerInfo)
|
2012-10-16 22:36:30 +02:00
|
|
|
continue;
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2016-05-10 21:23:54 +02:00
|
|
|
/* ignore workers in other databases (unless table is shared) */
|
|
|
|
if (!worker->wi_sharedrel && worker->wi_dboid != MyDatabaseId)
|
2012-10-16 22:36:30 +02:00
|
|
|
continue;
|
2007-04-16 20:30:04 +02:00
|
|
|
|
|
|
|
if (worker->wi_tableoid == relid)
|
|
|
|
{
|
|
|
|
skipit = true;
|
2017-01-20 21:55:45 +01:00
|
|
|
found_concurrent_worker = true;
|
2007-04-16 20:30:04 +02:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
LWLockRelease(AutovacuumLock);
|
|
|
|
if (skipit)
|
|
|
|
{
|
|
|
|
LWLockRelease(AutovacuumScheduleLock);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2018-03-13 17:28:15 +01:00
|
|
|
/*
|
|
|
|
* Store the table's OID in shared memory before releasing the
|
|
|
|
* schedule lock, so that other workers don't try to vacuum it
|
|
|
|
* concurrently. (We claim it here so as not to hold
|
|
|
|
* AutovacuumScheduleLock while rechecking the stats.)
|
|
|
|
*/
|
|
|
|
MyWorkerInfo->wi_tableoid = relid;
|
|
|
|
MyWorkerInfo->wi_sharedrel = isshared;
|
|
|
|
LWLockRelease(AutovacuumScheduleLock);
|
|
|
|
|
2005-08-15 18:25:19 +02:00
|
|
|
/*
|
2007-03-29 00:17:12 +02:00
|
|
|
* Check whether pgstat data still says we need to vacuum this table.
|
|
|
|
* It could have changed if something else processed the table while
|
2022-04-07 06:29:46 +02:00
|
|
|
* we weren't looking. This doesn't entirely close the race condition,
|
|
|
|
* but it is very small.
|
2005-08-15 18:25:19 +02:00
|
|
|
*/
|
2007-06-30 06:08:05 +02:00
|
|
|
MemoryContextSwitchTo(AutovacMemCxt);
|
2015-05-08 18:09:14 +02:00
|
|
|
tab = table_recheck_autovac(relid, table_toast_map, pg_class_desc,
|
|
|
|
effective_multixact_freeze_max_age);
|
2007-03-29 00:17:12 +02:00
|
|
|
if (tab == NULL)
|
2005-08-15 18:25:19 +02:00
|
|
|
{
|
2009-02-09 21:57:59 +01:00
|
|
|
/* someone else vacuumed the table, or it went away */
|
2018-03-13 17:28:15 +01:00
|
|
|
LWLockAcquire(AutovacuumScheduleLock, LW_EXCLUSIVE);
|
|
|
|
MyWorkerInfo->wi_tableoid = InvalidOid;
|
|
|
|
MyWorkerInfo->wi_sharedrel = false;
|
2007-04-16 20:30:04 +02:00
|
|
|
LWLockRelease(AutovacuumScheduleLock);
|
2005-08-15 18:25:19 +02:00
|
|
|
continue;
|
2007-03-29 00:17:12 +02:00
|
|
|
}
|
2005-08-15 18:25:19 +02:00
|
|
|
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
/*
|
|
|
|
* Save the cost-related storage parameter values in global variables
|
|
|
|
* for reference when updating vacuum_cost_delay and vacuum_cost_limit
|
|
|
|
* during vacuuming this table.
|
|
|
|
*/
|
|
|
|
av_storage_param_cost_delay = tab->at_storage_param_vac_cost_delay;
|
|
|
|
av_storage_param_cost_limit = tab->at_storage_param_vac_cost_limit;
|
2007-10-24 21:08:25 +02:00
|
|
|
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
/*
|
|
|
|
* We only expect this worker to ever set the flag, so don't bother
|
|
|
|
* checking the return value. We shouldn't have to retry.
|
|
|
|
*/
|
|
|
|
if (tab->at_dobalance)
|
|
|
|
pg_atomic_test_set_flag(&MyWorkerInfo->wi_dobalance);
|
|
|
|
else
|
|
|
|
pg_atomic_clear_flag(&MyWorkerInfo->wi_dobalance);
|
2007-10-24 21:08:25 +02:00
|
|
|
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
LWLockAcquire(AutovacuumLock, LW_SHARED);
|
|
|
|
autovac_recalculate_workers_for_balance();
|
|
|
|
LWLockRelease(AutovacuumLock);
|
2007-10-24 21:08:25 +02:00
|
|
|
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
/*
|
|
|
|
* We wait until this point to update cost delay and cost limit
|
|
|
|
* values, even though we reloaded the configuration file above, so
|
|
|
|
* that we can take into account the cost-related storage parameters.
|
|
|
|
*/
|
2023-04-07 00:54:53 +02:00
|
|
|
VacuumUpdateCosts();
|
2010-11-20 04:28:20 +01:00
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
|
2007-06-30 06:08:05 +02:00
|
|
|
/* clean up memory before each iteration */
|
2023-11-15 20:42:30 +01:00
|
|
|
MemoryContextReset(PortalContext);
|
2007-06-30 06:08:05 +02:00
|
|
|
|
2007-10-25 16:45:55 +02:00
|
|
|
/*
|
|
|
|
* Save the relation name for a possible error message, to avoid a
|
2008-07-17 23:02:31 +02:00
|
|
|
* catalog lookup in case of an error. If any of these return NULL,
|
|
|
|
* then the relation has been dropped since last we checked; skip it.
|
|
|
|
* Note: they must live in a long-lived memory context because we call
|
|
|
|
* vacuum and analyze in different transactions.
|
2007-10-25 16:45:55 +02:00
|
|
|
*/
|
2008-07-17 23:02:31 +02:00
|
|
|
|
|
|
|
tab->at_relname = get_rel_name(tab->at_relid);
|
|
|
|
tab->at_nspname = get_namespace_name(get_rel_namespace(tab->at_relid));
|
|
|
|
tab->at_datname = get_database_name(MyDatabaseId);
|
|
|
|
if (!tab->at_relname || !tab->at_nspname || !tab->at_datname)
|
|
|
|
goto deleted;
|
2007-10-25 16:45:55 +02:00
|
|
|
|
2007-06-29 19:07:39 +02:00
|
|
|
/*
|
2007-10-24 21:08:25 +02:00
|
|
|
* We will abort vacuuming the current table if something errors out,
|
|
|
|
* and continue with the next one in schedule; in particular, this
|
|
|
|
* happens if we are interrupted with SIGINT.
|
2007-06-29 19:07:39 +02:00
|
|
|
*/
|
|
|
|
PG_TRY();
|
|
|
|
{
|
2017-09-23 19:28:16 +02:00
|
|
|
/* Use PortalContext for any per-table allocations */
|
|
|
|
MemoryContextSwitchTo(PortalContext);
|
|
|
|
|
2007-06-29 19:07:39 +02:00
|
|
|
/* have at it */
|
2008-07-17 23:02:31 +02:00
|
|
|
autovacuum_do_vac_analyze(tab, bstrategy);
|
2007-10-26 22:45:10 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Clear a possible query-cancel signal, to avoid a late reaction
|
|
|
|
* to an automatically-sent signal because of vacuuming the
|
|
|
|
* current table (we're done with it, so it would make no sense to
|
|
|
|
* cancel at this point.)
|
|
|
|
*/
|
|
|
|
QueryCancelPending = false;
|
2007-06-29 19:07:39 +02:00
|
|
|
}
|
|
|
|
PG_CATCH();
|
|
|
|
{
|
|
|
|
/*
|
2007-10-24 21:08:25 +02:00
|
|
|
* Abort the transaction, start a new one, and proceed with the
|
|
|
|
* next table in our list.
|
2007-06-29 19:07:39 +02:00
|
|
|
*/
|
2007-10-24 21:08:25 +02:00
|
|
|
HOLD_INTERRUPTS();
|
2019-03-18 18:57:33 +01:00
|
|
|
if (tab->at_params.options & VACOPT_VACUUM)
|
2007-10-24 21:08:25 +02:00
|
|
|
errcontext("automatic vacuum of table \"%s.%s.%s\"",
|
2008-07-17 23:02:31 +02:00
|
|
|
tab->at_datname, tab->at_nspname, tab->at_relname);
|
2007-06-29 19:07:39 +02:00
|
|
|
else
|
2007-10-24 21:08:25 +02:00
|
|
|
errcontext("automatic analyze of table \"%s.%s.%s\"",
|
2008-07-17 23:02:31 +02:00
|
|
|
tab->at_datname, tab->at_nspname, tab->at_relname);
|
2007-10-24 21:08:25 +02:00
|
|
|
EmitErrorReport();
|
|
|
|
|
2020-11-16 23:42:55 +01:00
|
|
|
/* this resets ProcGlobal->statusFlags[i] too */
|
2007-10-24 21:08:25 +02:00
|
|
|
AbortOutOfAnyTransaction();
|
|
|
|
FlushErrorState();
|
2023-11-15 20:42:30 +01:00
|
|
|
MemoryContextReset(PortalContext);
|
2007-10-24 21:08:25 +02:00
|
|
|
|
|
|
|
/* restart our transaction for the following operations */
|
|
|
|
StartTransactionCommand();
|
|
|
|
RESUME_INTERRUPTS();
|
2007-06-29 19:07:39 +02:00
|
|
|
}
|
|
|
|
PG_END_TRY();
|
|
|
|
|
2017-09-23 19:28:16 +02:00
|
|
|
/* Make sure we're back in AutovacMemCxt */
|
|
|
|
MemoryContextSwitchTo(AutovacMemCxt);
|
|
|
|
|
2017-01-20 21:55:45 +01:00
|
|
|
did_vacuum = true;
|
|
|
|
|
2020-11-16 23:42:55 +01:00
|
|
|
/* ProcGlobal->statusFlags[i] are reset at the next end of xact */
|
2007-10-24 22:55:36 +02:00
|
|
|
|
2007-03-29 00:17:12 +02:00
|
|
|
/* be tidy */
|
2008-07-17 23:02:31 +02:00
|
|
|
deleted:
|
|
|
|
if (tab->at_datname != NULL)
|
|
|
|
pfree(tab->at_datname);
|
|
|
|
if (tab->at_nspname != NULL)
|
|
|
|
pfree(tab->at_nspname);
|
|
|
|
if (tab->at_relname != NULL)
|
|
|
|
pfree(tab->at_relname);
|
2007-03-29 00:17:12 +02:00
|
|
|
pfree(tab);
|
2007-10-24 21:08:25 +02:00
|
|
|
|
2010-11-20 04:28:20 +01:00
|
|
|
/*
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
* Remove my info from shared memory. We set wi_dobalance on the
|
|
|
|
* assumption that we are more likely than not to vacuum a table with
|
|
|
|
* no cost-related storage parameters next, so we want to claim our
|
|
|
|
* share of I/O as soon as possible to avoid thrashing the global
|
|
|
|
* balance.
|
2010-11-20 04:28:20 +01:00
|
|
|
*/
|
2018-03-13 17:28:15 +01:00
|
|
|
LWLockAcquire(AutovacuumScheduleLock, LW_EXCLUSIVE);
|
2007-10-24 21:08:25 +02:00
|
|
|
MyWorkerInfo->wi_tableoid = InvalidOid;
|
2016-05-10 21:23:54 +02:00
|
|
|
MyWorkerInfo->wi_sharedrel = false;
|
2018-03-13 17:28:15 +01:00
|
|
|
LWLockRelease(AutovacuumScheduleLock);
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
pg_atomic_test_set_flag(&MyWorkerInfo->wi_dobalance);
|
2005-07-29 21:30:09 +02:00
|
|
|
}
|
2005-07-14 07:13:45 +02:00
|
|
|
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
/*
|
|
|
|
* Perform additional work items, as requested by backends.
|
|
|
|
*/
|
2017-08-15 23:14:07 +02:00
|
|
|
LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
|
|
|
|
for (i = 0; i < NUM_WORKITEMS; i++)
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
{
|
2017-08-15 23:14:07 +02:00
|
|
|
AutoVacuumWorkItem *workitem = &AutoVacuumShmem->av_workItems[i];
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
|
2017-08-15 23:14:07 +02:00
|
|
|
if (!workitem->avw_used)
|
|
|
|
continue;
|
|
|
|
if (workitem->avw_active)
|
|
|
|
continue;
|
2017-10-30 15:52:02 +01:00
|
|
|
if (workitem->avw_database != MyDatabaseId)
|
|
|
|
continue;
|
2017-08-15 23:14:07 +02:00
|
|
|
|
|
|
|
/* claim this one, and release lock while performing it */
|
|
|
|
workitem->avw_active = true;
|
|
|
|
LWLockRelease(AutovacuumLock);
|
|
|
|
|
|
|
|
perform_work_item(workitem);
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
|
|
|
|
/*
|
2017-09-23 19:28:16 +02:00
|
|
|
* Check for config changes before acquiring lock for further jobs.
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
*/
|
2017-08-15 23:14:07 +02:00
|
|
|
CHECK_FOR_INTERRUPTS();
|
2019-12-17 19:03:57 +01:00
|
|
|
if (ConfigReloadPending)
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
{
|
2019-12-17 19:03:57 +01:00
|
|
|
ConfigReloadPending = false;
|
2017-08-15 23:14:07 +02:00
|
|
|
ProcessConfigFile(PGC_SIGHUP);
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
VacuumUpdateCosts();
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
}
|
|
|
|
|
2017-08-15 23:14:07 +02:00
|
|
|
LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
|
|
|
|
|
|
|
|
/* and mark it done */
|
|
|
|
workitem->avw_active = false;
|
|
|
|
workitem->avw_used = false;
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
}
|
2017-08-15 23:14:07 +02:00
|
|
|
LWLockRelease(AutovacuumLock);
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
|
2008-08-13 02:07:50 +02:00
|
|
|
/*
|
|
|
|
* We leak table_toast_map here (among other things), but since we're
|
|
|
|
* going away soon, it's not a problem.
|
|
|
|
*/
|
|
|
|
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
/*
|
2017-03-17 14:46:58 +01:00
|
|
|
* Update pg_database.datfrozenxid, and truncate pg_xact if possible. We
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
* only need to do this once, not after each table.
|
2017-01-20 21:55:45 +01:00
|
|
|
*
|
|
|
|
* Even if we didn't vacuum anything, it may still be important to do
|
|
|
|
* this, because one indirect effect of vac_update_datfrozenxid() is to
|
2023-12-08 08:47:15 +01:00
|
|
|
* update TransamVariables->xidVacLimit. That might need to be done even
|
|
|
|
* if we haven't vacuumed anything, because relations with older
|
2017-01-20 21:55:45 +01:00
|
|
|
* relfrozenxid values or other databases with older datfrozenxid values
|
|
|
|
* might have been dropped, allowing xidVacLimit to advance.
|
|
|
|
*
|
|
|
|
* However, it's also important not to do this blindly in all cases,
|
|
|
|
* because when autovacuum=off this will restart the autovacuum launcher.
|
|
|
|
* If we're not careful, an infinite loop can result, where workers find
|
|
|
|
* no work to do and restart the launcher, which starts another worker in
|
|
|
|
* the same database that finds no work to do. To prevent that, we skip
|
|
|
|
* this if (1) we found no work to do and (2) we skipped at least one
|
|
|
|
* table due to concurrent autovacuum activity. In that case, the other
|
|
|
|
* worker has already done it, or will do so when it finishes.
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
*/
|
2017-01-20 21:55:45 +01:00
|
|
|
if (did_vacuum || !found_concurrent_worker)
|
|
|
|
vac_update_datfrozenxid();
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
/* Finally close out the last transaction. */
|
|
|
|
CommitTransactionCommand();
|
|
|
|
}
|
|
|
|
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
/*
|
|
|
|
* Execute a previously registered work item.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
perform_work_item(AutoVacuumWorkItem *workitem)
|
|
|
|
{
|
|
|
|
char *cur_datname = NULL;
|
|
|
|
char *cur_nspname = NULL;
|
|
|
|
char *cur_relname = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Note we do not store table info in MyWorkerInfo, since this is not
|
|
|
|
* vacuuming proper.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Save the relation name for a possible error message, to avoid a catalog
|
|
|
|
* lookup in case of an error. If any of these return NULL, then the
|
2017-10-30 15:52:02 +01:00
|
|
|
* relation has been dropped since last we checked; skip it.
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
*/
|
2017-09-23 19:28:16 +02:00
|
|
|
Assert(CurrentMemoryContext == AutovacMemCxt);
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
|
|
|
|
cur_relname = get_rel_name(workitem->avw_relation);
|
|
|
|
cur_nspname = get_namespace_name(get_rel_namespace(workitem->avw_relation));
|
|
|
|
cur_datname = get_database_name(MyDatabaseId);
|
|
|
|
if (!cur_relname || !cur_nspname || !cur_datname)
|
|
|
|
goto deleted2;
|
|
|
|
|
2019-02-22 17:00:16 +01:00
|
|
|
autovac_report_workitem(workitem, cur_nspname, cur_relname);
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
|
2017-09-23 19:28:16 +02:00
|
|
|
/* clean up memory before each work item */
|
2023-11-15 20:42:30 +01:00
|
|
|
MemoryContextReset(PortalContext);
|
2017-09-23 19:28:16 +02:00
|
|
|
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
/*
|
|
|
|
* We will abort the current work item if something errors out, and
|
|
|
|
* continue with the next one; in particular, this happens if we are
|
|
|
|
* interrupted with SIGINT. Note that this means that the work item list
|
|
|
|
* can be lossy.
|
|
|
|
*/
|
|
|
|
PG_TRY();
|
|
|
|
{
|
2017-09-23 19:28:16 +02:00
|
|
|
/* Use PortalContext for any per-work-item allocations */
|
|
|
|
MemoryContextSwitchTo(PortalContext);
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
|
2023-03-25 21:00:27 +01:00
|
|
|
/*
|
|
|
|
* Have at it. Functions called here are responsible for any required
|
|
|
|
* user switch and sandbox.
|
|
|
|
*/
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
switch (workitem->avw_type)
|
|
|
|
{
|
|
|
|
case AVW_BRINSummarizeRange:
|
|
|
|
DirectFunctionCall2(brin_summarize_range,
|
|
|
|
ObjectIdGetDatum(workitem->avw_relation),
|
|
|
|
Int64GetDatum((int64) workitem->avw_blockNumber));
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
elog(WARNING, "unrecognized work item found: type %d",
|
|
|
|
workitem->avw_type);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Clear a possible query-cancel signal, to avoid a late reaction to
|
|
|
|
* an automatically-sent signal because of vacuuming the current table
|
|
|
|
* (we're done with it, so it would make no sense to cancel at this
|
|
|
|
* point.)
|
|
|
|
*/
|
|
|
|
QueryCancelPending = false;
|
|
|
|
}
|
|
|
|
PG_CATCH();
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Abort the transaction, start a new one, and proceed with the next
|
|
|
|
* table in our list.
|
|
|
|
*/
|
|
|
|
HOLD_INTERRUPTS();
|
|
|
|
errcontext("processing work entry for relation \"%s.%s.%s\"",
|
|
|
|
cur_datname, cur_nspname, cur_relname);
|
|
|
|
EmitErrorReport();
|
|
|
|
|
2020-11-16 23:42:55 +01:00
|
|
|
/* this resets ProcGlobal->statusFlags[i] too */
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
AbortOutOfAnyTransaction();
|
|
|
|
FlushErrorState();
|
2023-11-15 20:42:30 +01:00
|
|
|
MemoryContextReset(PortalContext);
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
|
|
|
|
/* restart our transaction for the following operations */
|
|
|
|
StartTransactionCommand();
|
|
|
|
RESUME_INTERRUPTS();
|
|
|
|
}
|
|
|
|
PG_END_TRY();
|
|
|
|
|
2017-09-23 19:28:16 +02:00
|
|
|
/* Make sure we're back in AutovacMemCxt */
|
|
|
|
MemoryContextSwitchTo(AutovacMemCxt);
|
|
|
|
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
/* We intentionally do not set did_vacuum here */
|
|
|
|
|
|
|
|
/* be tidy */
|
|
|
|
deleted2:
|
|
|
|
if (cur_datname)
|
|
|
|
pfree(cur_datname);
|
|
|
|
if (cur_nspname)
|
|
|
|
pfree(cur_nspname);
|
|
|
|
if (cur_relname)
|
|
|
|
pfree(cur_relname);
|
|
|
|
}
|
|
|
|
|
2007-03-23 22:23:13 +01:00
|
|
|
/*
|
2009-02-09 21:57:59 +01:00
|
|
|
* extract_autovac_opts
|
2008-08-13 02:07:50 +02:00
|
|
|
*
|
2009-02-09 21:57:59 +01:00
|
|
|
* Given a relation's pg_class tuple, return the AutoVacOpts portion of
|
|
|
|
* reloptions, if set; otherwise, return NULL.
|
2021-04-22 00:36:12 +02:00
|
|
|
*
|
|
|
|
* Note: callers do not have a relation lock on the table at this point,
|
|
|
|
* so the table could have been dropped, and its catalog rows gone, after
|
|
|
|
* we acquired the pg_class row. If pg_class had a TOAST table, this would
|
|
|
|
* be a risk; fortunately, it doesn't.
|
2007-03-23 22:23:13 +01:00
|
|
|
*/
|
2009-06-12 18:17:29 +02:00
|
|
|
static AutoVacOpts *
|
2009-02-09 21:57:59 +01:00
|
|
|
extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
|
2007-03-23 22:23:13 +01:00
|
|
|
{
|
2009-02-09 21:57:59 +01:00
|
|
|
bytea *relopts;
|
|
|
|
AutoVacOpts *av;
|
2007-03-23 22:23:13 +01:00
|
|
|
|
2009-02-09 21:57:59 +01:00
|
|
|
Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
|
2013-03-04 01:23:31 +01:00
|
|
|
((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
|
2009-02-09 21:57:59 +01:00
|
|
|
((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
|
2007-03-23 22:23:13 +01:00
|
|
|
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
relopts = extractRelOptions(tup, pg_class_desc, NULL);
|
2009-02-09 21:57:59 +01:00
|
|
|
if (relopts == NULL)
|
|
|
|
return NULL;
|
2009-06-11 16:49:15 +02:00
|
|
|
|
2009-02-09 21:57:59 +01:00
|
|
|
av = palloc(sizeof(AutoVacOpts));
|
|
|
|
memcpy(av, &(((StdRdOptions *) relopts)->autovacuum), sizeof(AutoVacOpts));
|
|
|
|
pfree(relopts);
|
2008-08-13 02:07:50 +02:00
|
|
|
|
2009-02-09 21:57:59 +01:00
|
|
|
return av;
|
2007-03-23 22:23:13 +01:00
|
|
|
}
|
|
|
|
|
2007-03-27 22:36:03 +02:00
|
|
|
|
2007-03-29 00:17:12 +02:00
|
|
|
/*
|
|
|
|
* table_recheck_autovac
|
|
|
|
*
|
2008-08-13 02:07:50 +02:00
|
|
|
* Recheck whether a table still needs vacuum or analyze. Return value is a
|
|
|
|
* valid autovac_table pointer if it does, NULL otherwise.
|
2008-07-17 23:02:31 +02:00
|
|
|
*
|
|
|
|
* Note that the returned autovac_table does not have the name fields set.
|
2007-03-29 00:17:12 +02:00
|
|
|
*/
|
|
|
|
static autovac_table *
|
2009-02-09 21:57:59 +01:00
|
|
|
table_recheck_autovac(Oid relid, HTAB *table_toast_map,
|
2015-05-08 18:09:14 +02:00
|
|
|
TupleDesc pg_class_desc,
|
|
|
|
int effective_multixact_freeze_max_age)
|
2007-03-29 00:17:12 +02:00
|
|
|
{
|
|
|
|
Form_pg_class classForm;
|
|
|
|
HeapTuple classTup;
|
|
|
|
bool dovacuum;
|
|
|
|
bool doanalyze;
|
|
|
|
autovac_table *tab = NULL;
|
2008-08-13 02:07:50 +02:00
|
|
|
bool wraparound;
|
2009-02-09 21:57:59 +01:00
|
|
|
AutoVacOpts *avopts;
|
2007-03-29 00:17:12 +02:00
|
|
|
|
|
|
|
/* fetch the relation's relcache entry */
|
2010-02-14 19:42:19 +01:00
|
|
|
classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
|
2007-03-29 00:17:12 +02:00
|
|
|
if (!HeapTupleIsValid(classTup))
|
|
|
|
return NULL;
|
|
|
|
classForm = (Form_pg_class) GETSTRUCT(classTup);
|
|
|
|
|
2009-02-09 21:57:59 +01:00
|
|
|
/*
|
|
|
|
* Get the applicable reloptions. If it is a TOAST table, try to get the
|
|
|
|
* main table reloptions if the toast table itself doesn't have.
|
2008-08-13 02:07:50 +02:00
|
|
|
*/
|
2009-02-09 21:57:59 +01:00
|
|
|
avopts = extract_autovac_opts(classTup, pg_class_desc);
|
|
|
|
if (classForm->relkind == RELKIND_TOASTVALUE &&
|
|
|
|
avopts == NULL && table_toast_map != NULL)
|
|
|
|
{
|
|
|
|
av_relation *hentry;
|
|
|
|
bool found;
|
2008-08-13 02:07:50 +02:00
|
|
|
|
2009-02-09 21:57:59 +01:00
|
|
|
hentry = hash_search(table_toast_map, &relid, HASH_FIND, &found);
|
|
|
|
if (found && hentry->ar_hasrelopts)
|
|
|
|
avopts = &hentry->ar_reloptions;
|
|
|
|
}
|
2007-03-29 00:17:12 +02:00
|
|
|
|
Speed up rechecking if relation needs to be vacuumed or analyze in autovacuum.
After autovacuum collects the relations to vacuum or analyze, it rechecks
whether each relation still needs to be vacuumed or analyzed before actually
doing that. Previously this recheck could be a significant overhead
especially when there were a very large number of relations. This was
because each recheck forced the statistics to be refreshed, and the refresh
of the statistics for a very large number of relations could cause heavy
overhead. There was the report that this issue caused autovacuum workers
to have gotten “stuck” in a tight loop of table_recheck_autovac() that
rechecks whether a relation needs to be vacuumed or analyzed.
This commit speeds up the recheck by making autovacuum worker reuse
the previously-read statistics for the recheck if possible. Then if that
"stale" statistics says that a relation still needs to be vacuumed or analyzed,
autovacuum refreshes the statistics and does the recheck again.
The benchmark shows that the more relations exist and autovacuum workers
are running concurrently, the more this change reduces the autovacuum
execution time. For example, when there are 20,000 tables and 10 autovacuum
workers are running, the benchmark showed that the change improved
the performance of autovacuum more than three times. On the other hand,
even when there are only 1000 tables and only a single autovacuum worker
is running, the benchmark didn't show any big performance regression by
the change.
Firstly POC patch was proposed by Jim Nasby. As the result of discussion,
we used Tatsuhito Kasahara's version of the patch using the approach
suggested by Tom Lane.
Reported-by: Jim Nasby
Author: Tatsuhito Kasahara
Reviewed-by: Masahiko Sawada, Fujii Masao
Discussion: https://postgr.es/m/3FC6C2F2-8A47-44C0-B997-28830B5716D0@amazon.com
2020-12-08 15:59:39 +01:00
|
|
|
recheck_relation_needs_vacanalyze(relid, avopts, classForm,
|
|
|
|
effective_multixact_freeze_max_age,
|
|
|
|
&dovacuum, &doanalyze, &wraparound);
|
2007-03-29 00:17:12 +02:00
|
|
|
|
2008-08-13 02:07:50 +02:00
|
|
|
/* OK, it needs something done */
|
|
|
|
if (doanalyze || dovacuum)
|
2007-03-29 00:17:12 +02:00
|
|
|
{
|
|
|
|
int freeze_min_age;
|
2009-01-16 14:27:24 +01:00
|
|
|
int freeze_table_age;
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
int multixact_freeze_min_age;
|
|
|
|
int multixact_freeze_table_age;
|
2015-04-03 16:55:50 +02:00
|
|
|
int log_min_duration;
|
2007-03-29 00:17:12 +02:00
|
|
|
|
|
|
|
/*
|
2009-02-09 21:57:59 +01:00
|
|
|
* Calculate the vacuum cost parameters and the freeze ages. If there
|
|
|
|
* are options set in pg_class.reloptions, use them; in the case of a
|
|
|
|
* toast table, try the main table too. Otherwise use the GUC
|
|
|
|
* defaults, autovacuum's own first and plain vacuum second.
|
2007-03-29 00:17:12 +02:00
|
|
|
*/
|
2009-08-27 19:18:44 +02:00
|
|
|
|
2015-04-03 16:55:50 +02:00
|
|
|
/* -1 in autovac setting means use log_autovacuum_min_duration */
|
|
|
|
log_min_duration = (avopts && avopts->log_min_duration >= 0)
|
|
|
|
? avopts->log_min_duration
|
|
|
|
: Log_autovacuum_min_duration;
|
|
|
|
|
2009-08-27 19:18:44 +02:00
|
|
|
/* these do not have autovacuum-specific settings */
|
|
|
|
freeze_min_age = (avopts && avopts->freeze_min_age >= 0)
|
|
|
|
? avopts->freeze_min_age
|
|
|
|
: default_freeze_min_age;
|
|
|
|
|
|
|
|
freeze_table_age = (avopts && avopts->freeze_table_age >= 0)
|
|
|
|
? avopts->freeze_table_age
|
|
|
|
: default_freeze_table_age;
|
2007-03-29 00:17:12 +02:00
|
|
|
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
multixact_freeze_min_age = (avopts &&
|
|
|
|
avopts->multixact_freeze_min_age >= 0)
|
|
|
|
? avopts->multixact_freeze_min_age
|
|
|
|
: default_multixact_freeze_min_age;
|
|
|
|
|
|
|
|
multixact_freeze_table_age = (avopts &&
|
|
|
|
avopts->multixact_freeze_table_age >= 0)
|
|
|
|
? avopts->multixact_freeze_table_age
|
|
|
|
: default_multixact_freeze_table_age;
|
|
|
|
|
2007-03-29 00:17:12 +02:00
|
|
|
tab = palloc(sizeof(autovac_table));
|
|
|
|
tab->at_relid = relid;
|
2016-05-10 21:23:54 +02:00
|
|
|
tab->at_sharedrel = classForm->relisshared;
|
2021-02-09 06:13:57 +01:00
|
|
|
|
2023-01-06 20:17:25 +01:00
|
|
|
/*
|
|
|
|
* Select VACUUM options. Note we don't say VACOPT_PROCESS_TOAST, so
|
|
|
|
* that vacuum() skips toast relations. Also note we tell vacuum() to
|
|
|
|
* skip vac_update_datfrozenxid(); we'll do that separately.
|
|
|
|
*/
|
|
|
|
tab->at_params.options =
|
2023-03-06 08:41:05 +01:00
|
|
|
(dovacuum ? (VACOPT_VACUUM |
|
|
|
|
VACOPT_PROCESS_MAIN |
|
|
|
|
VACOPT_SKIP_DATABASE_STATS) : 0) |
|
2015-03-18 15:52:33 +01:00
|
|
|
(doanalyze ? VACOPT_ANALYZE : 0) |
|
2018-07-12 07:28:28 +02:00
|
|
|
(!wraparound ? VACOPT_SKIP_LOCKED : 0);
|
2021-06-19 05:04:07 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* index_cleanup and truncate are unspecified at first in autovacuum.
|
|
|
|
* They will be filled in with usable values using their reloptions
|
|
|
|
* (or reloption defaults) later.
|
|
|
|
*/
|
|
|
|
tab->at_params.index_cleanup = VACOPTVALUE_UNSPECIFIED;
|
|
|
|
tab->at_params.truncate = VACOPTVALUE_UNSPECIFIED;
|
2020-01-20 03:27:49 +01:00
|
|
|
/* As of now, we don't support parallel vacuum for autovacuum */
|
|
|
|
tab->at_params.nworkers = -1;
|
2015-03-18 15:52:33 +01:00
|
|
|
tab->at_params.freeze_min_age = freeze_min_age;
|
|
|
|
tab->at_params.freeze_table_age = freeze_table_age;
|
|
|
|
tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
|
|
|
|
tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
|
|
|
|
tab->at_params.is_wraparound = wraparound;
|
2015-04-03 16:55:50 +02:00
|
|
|
tab->at_params.log_min_duration = log_min_duration;
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
tab->at_storage_param_vac_cost_limit = avopts ?
|
|
|
|
avopts->vacuum_cost_limit : 0;
|
|
|
|
tab->at_storage_param_vac_cost_delay = avopts ?
|
|
|
|
avopts->vacuum_cost_delay : -1;
|
2008-07-17 23:02:31 +02:00
|
|
|
tab->at_relname = NULL;
|
|
|
|
tab->at_nspname = NULL;
|
|
|
|
tab->at_datname = NULL;
|
Don't balance vacuum cost delay when per-table settings are in effect
When there are cost-delay-related storage options set for a table,
trying to make that table participate in the autovacuum cost-limit
balancing algorithm produces undesirable results: instead of using the
configured values, the global values are always used,
as illustrated by Mark Kirkwood in
http://www.postgresql.org/message-id/52FACF15.8020507@catalyst.net.nz
Since the mechanism is already complicated, just disable it for those
cases rather than trying to make it cope. There are undesirable
side-effects from this too, namely that the total I/O impact on the
system will be higher whenever such tables are vacuumed. However, this
is seen as less harmful than slowing down vacuum, because that would
cause bloat to accumulate. Anyway, in the new system it is possible to
tweak options to get the precise behavior one wants, whereas with the
previous system one was simply hosed.
This has been broken forever, so backpatch to all supported branches.
This might affect systems where cost_limit and cost_delay have been set
for individual tables.
2014-10-03 18:01:27 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If any of the cost delay parameters has been set individually for
|
|
|
|
* this table, disable the balancing algorithm.
|
|
|
|
*/
|
|
|
|
tab->at_dobalance =
|
|
|
|
!(avopts && (avopts->vacuum_cost_limit > 0 ||
|
2023-04-25 13:54:10 +02:00
|
|
|
avopts->vacuum_cost_delay >= 0));
|
2007-03-29 00:17:12 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
heap_freetuple(classTup);
|
|
|
|
return tab;
|
|
|
|
}
|
|
|
|
|
Speed up rechecking if relation needs to be vacuumed or analyze in autovacuum.
After autovacuum collects the relations to vacuum or analyze, it rechecks
whether each relation still needs to be vacuumed or analyzed before actually
doing that. Previously this recheck could be a significant overhead
especially when there were a very large number of relations. This was
because each recheck forced the statistics to be refreshed, and the refresh
of the statistics for a very large number of relations could cause heavy
overhead. There was the report that this issue caused autovacuum workers
to have gotten “stuck” in a tight loop of table_recheck_autovac() that
rechecks whether a relation needs to be vacuumed or analyzed.
This commit speeds up the recheck by making autovacuum worker reuse
the previously-read statistics for the recheck if possible. Then if that
"stale" statistics says that a relation still needs to be vacuumed or analyzed,
autovacuum refreshes the statistics and does the recheck again.
The benchmark shows that the more relations exist and autovacuum workers
are running concurrently, the more this change reduces the autovacuum
execution time. For example, when there are 20,000 tables and 10 autovacuum
workers are running, the benchmark showed that the change improved
the performance of autovacuum more than three times. On the other hand,
even when there are only 1000 tables and only a single autovacuum worker
is running, the benchmark didn't show any big performance regression by
the change.
Firstly POC patch was proposed by Jim Nasby. As the result of discussion,
we used Tatsuhito Kasahara's version of the patch using the approach
suggested by Tom Lane.
Reported-by: Jim Nasby
Author: Tatsuhito Kasahara
Reviewed-by: Masahiko Sawada, Fujii Masao
Discussion: https://postgr.es/m/3FC6C2F2-8A47-44C0-B997-28830B5716D0@amazon.com
2020-12-08 15:59:39 +01:00
|
|
|
/*
|
|
|
|
* recheck_relation_needs_vacanalyze
|
|
|
|
*
|
|
|
|
* Subroutine for table_recheck_autovac.
|
|
|
|
*
|
|
|
|
* Fetch the pgstat of a relation and recheck whether a relation
|
|
|
|
* needs to be vacuumed or analyzed.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
recheck_relation_needs_vacanalyze(Oid relid,
|
|
|
|
AutoVacOpts *avopts,
|
|
|
|
Form_pg_class classForm,
|
|
|
|
int effective_multixact_freeze_max_age,
|
|
|
|
bool *dovacuum,
|
|
|
|
bool *doanalyze,
|
|
|
|
bool *wraparound)
|
|
|
|
{
|
|
|
|
PgStat_StatTabEntry *tabentry;
|
|
|
|
|
|
|
|
/* fetch the pgstat table entry */
|
2022-04-07 06:29:46 +02:00
|
|
|
tabentry = pgstat_fetch_stat_tabentry_ext(classForm->relisshared,
|
|
|
|
relid);
|
Speed up rechecking if relation needs to be vacuumed or analyze in autovacuum.
After autovacuum collects the relations to vacuum or analyze, it rechecks
whether each relation still needs to be vacuumed or analyzed before actually
doing that. Previously this recheck could be a significant overhead
especially when there were a very large number of relations. This was
because each recheck forced the statistics to be refreshed, and the refresh
of the statistics for a very large number of relations could cause heavy
overhead. There was the report that this issue caused autovacuum workers
to have gotten “stuck” in a tight loop of table_recheck_autovac() that
rechecks whether a relation needs to be vacuumed or analyzed.
This commit speeds up the recheck by making autovacuum worker reuse
the previously-read statistics for the recheck if possible. Then if that
"stale" statistics says that a relation still needs to be vacuumed or analyzed,
autovacuum refreshes the statistics and does the recheck again.
The benchmark shows that the more relations exist and autovacuum workers
are running concurrently, the more this change reduces the autovacuum
execution time. For example, when there are 20,000 tables and 10 autovacuum
workers are running, the benchmark showed that the change improved
the performance of autovacuum more than three times. On the other hand,
even when there are only 1000 tables and only a single autovacuum worker
is running, the benchmark didn't show any big performance regression by
the change.
Firstly POC patch was proposed by Jim Nasby. As the result of discussion,
we used Tatsuhito Kasahara's version of the patch using the approach
suggested by Tom Lane.
Reported-by: Jim Nasby
Author: Tatsuhito Kasahara
Reviewed-by: Masahiko Sawada, Fujii Masao
Discussion: https://postgr.es/m/3FC6C2F2-8A47-44C0-B997-28830B5716D0@amazon.com
2020-12-08 15:59:39 +01:00
|
|
|
|
|
|
|
relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
|
|
|
|
effective_multixact_freeze_max_age,
|
|
|
|
dovacuum, doanalyze, wraparound);
|
|
|
|
|
|
|
|
/* ignore ANALYZE for toast tables */
|
|
|
|
if (classForm->relkind == RELKIND_TOASTVALUE)
|
|
|
|
*doanalyze = false;
|
|
|
|
}
|
|
|
|
|
2007-03-29 00:17:12 +02:00
|
|
|
/*
|
|
|
|
* relation_needs_vacanalyze
|
|
|
|
*
|
|
|
|
* Check whether a relation needs to be vacuumed or analyzed; return each into
|
2007-10-24 22:55:36 +02:00
|
|
|
* "dovacuum" and "doanalyze", respectively. Also return whether the vacuum is
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
* being forced because of Xid or multixact wraparound.
|
2009-02-09 21:57:59 +01:00
|
|
|
*
|
|
|
|
* relopts is a pointer to the AutoVacOpts options (either for itself in the
|
|
|
|
* case of a plain table, or for either itself or its parent table in the case
|
|
|
|
* of a TOAST table), NULL if none; tabentry is the pgstats entry, which can be
|
|
|
|
* NULL.
|
2005-07-14 07:13:45 +02:00
|
|
|
*
|
|
|
|
* A table needs to be vacuumed if the number of dead tuples exceeds a
|
|
|
|
* threshold. This threshold is calculated as
|
|
|
|
*
|
|
|
|
* threshold = vac_base_thresh + vac_scale_factor * reltuples
|
|
|
|
*
|
|
|
|
* For analyze, the analysis done is that the number of tuples inserted,
|
|
|
|
* deleted and updated since the last analyze exceeds a threshold calculated
|
2022-04-06 22:56:06 +02:00
|
|
|
* in the same fashion as above. Note that the cumulative stats system stores
|
2005-07-14 07:13:45 +02:00
|
|
|
* the number of tuples (both live and dead) that there were as of the last
|
|
|
|
* analyze. This is asymmetric to the VACUUM case.
|
|
|
|
*
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
* We also force vacuum if the table's relfrozenxid is more than freeze_max_age
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
* transactions back, and if its relminmxid is more than
|
|
|
|
* multixact_freeze_max_age multixacts back.
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
*
|
2009-02-09 21:57:59 +01:00
|
|
|
* A table whose autovacuum_enabled option is false is
|
|
|
|
* automatically skipped (unless we have to vacuum it due to freeze_max_age).
|
2022-04-06 22:56:06 +02:00
|
|
|
* Thus autovacuum can be disabled for specific tables. Also, when the cumulative
|
|
|
|
* stats system does not have data about a table, it will be skipped.
|
2005-07-14 07:13:45 +02:00
|
|
|
*
|
2009-02-09 21:57:59 +01:00
|
|
|
* A table whose vac_base_thresh value is < 0 takes the base value from the
|
2005-07-14 07:13:45 +02:00
|
|
|
* autovacuum_vacuum_threshold GUC variable. Similarly, a vac_scale_factor
|
2009-02-09 21:57:59 +01:00
|
|
|
* value < 0 is substituted with the value of
|
2005-07-14 07:13:45 +02:00
|
|
|
* autovacuum_vacuum_scale_factor GUC variable. Ditto for analyze.
|
|
|
|
*/
|
|
|
|
static void
|
2007-03-29 00:17:12 +02:00
|
|
|
relation_needs_vacanalyze(Oid relid,
|
2009-02-09 21:57:59 +01:00
|
|
|
AutoVacOpts *relopts,
|
2007-03-29 00:17:12 +02:00
|
|
|
Form_pg_class classForm,
|
|
|
|
PgStat_StatTabEntry *tabentry,
|
2015-05-08 18:09:14 +02:00
|
|
|
int effective_multixact_freeze_max_age,
|
2007-03-29 00:17:12 +02:00
|
|
|
/* output params below */
|
|
|
|
bool *dovacuum,
|
2007-10-24 22:55:36 +02:00
|
|
|
bool *doanalyze,
|
|
|
|
bool *wraparound)
|
2005-07-14 07:13:45 +02:00
|
|
|
{
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
bool force_vacuum;
|
2009-02-09 21:57:59 +01:00
|
|
|
bool av_enabled;
|
2005-07-14 07:13:45 +02:00
|
|
|
float4 reltuples; /* pg_class.reltuples */
|
2007-11-15 22:14:46 +01:00
|
|
|
|
2009-02-09 21:57:59 +01:00
|
|
|
/* constants from reloptions or GUC variables */
|
2005-07-14 07:13:45 +02:00
|
|
|
int vac_base_thresh,
|
Trigger autovacuum based on number of INSERTs
Traditionally autovacuum has only ever invoked a worker based on the
estimated number of dead tuples in a table and for anti-wraparound
purposes. For the latter, with certain classes of tables such as
insert-only tables, anti-wraparound vacuums could be the first vacuum that
the table ever receives. This could often lead to autovacuum workers being
busy for extended periods of time due to having to potentially freeze
every page in the table. This could be particularly bad for very large
tables. New clusters, or recently pg_restored clusters could suffer even
more as many large tables may have the same relfrozenxid, which could
result in large numbers of tables requiring an anti-wraparound vacuum all
at once.
Here we aim to reduce the work required by anti-wraparound and aggressive
vacuums in general, by triggering autovacuum when the table has received
enough INSERTs. This is controlled by adding two new GUCs and reloptions;
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor. These work exactly the same as the
existing scale factor and threshold controls, only base themselves off the
number of inserts since the last vacuum, rather than the number of dead
tuples. New controls were added rather than reusing the existing
controls, to allow these new vacuums to be tuned independently and perhaps
even completely disabled altogether, which can be done by setting
autovacuum_vacuum_insert_threshold to -1.
We make no attempt to skip index cleanup operations on these vacuums as
they may trigger for an insert-mostly table which continually doesn't have
enough dead tuples to trigger an autovacuum for the purpose of removing
those dead tuples. If we were to skip cleaning the indexes in this case,
then it is possible for the index(es) to become bloated over time.
There are additional benefits to triggering autovacuums based on inserts,
as tables which never contain enough dead tuples to trigger an autovacuum
are now more likely to receive a vacuum, which can mark more of the table
as "allvisible" and encourage the query planner to make use of Index Only
Scans.
Currently, we still obey vacuum_freeze_min_age when triggering these new
autovacuums based on INSERTs. For large insert-only tables, it may be
beneficial to lower the table's autovacuum_freeze_min_age so that tuples
are eligible to be frozen sooner. Here we've opted not to zero that for
these types of vacuums, since the table may just be insert-mostly and we
may otherwise freeze tuples that are still destined to be updated or
removed in the near future.
There was some debate to what exactly the new scale factor and threshold
should default to. For now, these are set to 0.2 and 1000, respectively.
There may be some motivation to adjust these before the release.
Author: Laurenz Albe, Darafei Praliaskouski
Reviewed-by: Alvaro Herrera, Masahiko Sawada, Chris Travers, Andres Freund, Justin Pryzby
Discussion: https://postgr.es/m/CAC8Q8t%2Bj36G_bLF%3D%2B0iMo6jGNWnLnWb1tujXuJr-%2Bx8ZCCTqoQ%40mail.gmail.com
2020-03-28 07:20:12 +01:00
|
|
|
vac_ins_base_thresh,
|
2005-07-14 07:13:45 +02:00
|
|
|
anl_base_thresh;
|
|
|
|
float4 vac_scale_factor,
|
Trigger autovacuum based on number of INSERTs
Traditionally autovacuum has only ever invoked a worker based on the
estimated number of dead tuples in a table and for anti-wraparound
purposes. For the latter, with certain classes of tables such as
insert-only tables, anti-wraparound vacuums could be the first vacuum that
the table ever receives. This could often lead to autovacuum workers being
busy for extended periods of time due to having to potentially freeze
every page in the table. This could be particularly bad for very large
tables. New clusters, or recently pg_restored clusters could suffer even
more as many large tables may have the same relfrozenxid, which could
result in large numbers of tables requiring an anti-wraparound vacuum all
at once.
Here we aim to reduce the work required by anti-wraparound and aggressive
vacuums in general, by triggering autovacuum when the table has received
enough INSERTs. This is controlled by adding two new GUCs and reloptions;
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor. These work exactly the same as the
existing scale factor and threshold controls, only base themselves off the
number of inserts since the last vacuum, rather than the number of dead
tuples. New controls were added rather than reusing the existing
controls, to allow these new vacuums to be tuned independently and perhaps
even completely disabled altogether, which can be done by setting
autovacuum_vacuum_insert_threshold to -1.
We make no attempt to skip index cleanup operations on these vacuums as
they may trigger for an insert-mostly table which continually doesn't have
enough dead tuples to trigger an autovacuum for the purpose of removing
those dead tuples. If we were to skip cleaning the indexes in this case,
then it is possible for the index(es) to become bloated over time.
There are additional benefits to triggering autovacuums based on inserts,
as tables which never contain enough dead tuples to trigger an autovacuum
are now more likely to receive a vacuum, which can mark more of the table
as "allvisible" and encourage the query planner to make use of Index Only
Scans.
Currently, we still obey vacuum_freeze_min_age when triggering these new
autovacuums based on INSERTs. For large insert-only tables, it may be
beneficial to lower the table's autovacuum_freeze_min_age so that tuples
are eligible to be frozen sooner. Here we've opted not to zero that for
these types of vacuums, since the table may just be insert-mostly and we
may otherwise freeze tuples that are still destined to be updated or
removed in the near future.
There was some debate to what exactly the new scale factor and threshold
should default to. For now, these are set to 0.2 and 1000, respectively.
There may be some motivation to adjust these before the release.
Author: Laurenz Albe, Darafei Praliaskouski
Reviewed-by: Alvaro Herrera, Masahiko Sawada, Chris Travers, Andres Freund, Justin Pryzby
Discussion: https://postgr.es/m/CAC8Q8t%2Bj36G_bLF%3D%2B0iMo6jGNWnLnWb1tujXuJr-%2Bx8ZCCTqoQ%40mail.gmail.com
2020-03-28 07:20:12 +01:00
|
|
|
vac_ins_scale_factor,
|
2005-07-14 07:13:45 +02:00
|
|
|
anl_scale_factor;
|
2007-11-15 22:14:46 +01:00
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
/* thresholds calculated from above constants */
|
|
|
|
float4 vacthresh,
|
Trigger autovacuum based on number of INSERTs
Traditionally autovacuum has only ever invoked a worker based on the
estimated number of dead tuples in a table and for anti-wraparound
purposes. For the latter, with certain classes of tables such as
insert-only tables, anti-wraparound vacuums could be the first vacuum that
the table ever receives. This could often lead to autovacuum workers being
busy for extended periods of time due to having to potentially freeze
every page in the table. This could be particularly bad for very large
tables. New clusters, or recently pg_restored clusters could suffer even
more as many large tables may have the same relfrozenxid, which could
result in large numbers of tables requiring an anti-wraparound vacuum all
at once.
Here we aim to reduce the work required by anti-wraparound and aggressive
vacuums in general, by triggering autovacuum when the table has received
enough INSERTs. This is controlled by adding two new GUCs and reloptions;
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor. These work exactly the same as the
existing scale factor and threshold controls, only base themselves off the
number of inserts since the last vacuum, rather than the number of dead
tuples. New controls were added rather than reusing the existing
controls, to allow these new vacuums to be tuned independently and perhaps
even completely disabled altogether, which can be done by setting
autovacuum_vacuum_insert_threshold to -1.
We make no attempt to skip index cleanup operations on these vacuums as
they may trigger for an insert-mostly table which continually doesn't have
enough dead tuples to trigger an autovacuum for the purpose of removing
those dead tuples. If we were to skip cleaning the indexes in this case,
then it is possible for the index(es) to become bloated over time.
There are additional benefits to triggering autovacuums based on inserts,
as tables which never contain enough dead tuples to trigger an autovacuum
are now more likely to receive a vacuum, which can mark more of the table
as "allvisible" and encourage the query planner to make use of Index Only
Scans.
Currently, we still obey vacuum_freeze_min_age when triggering these new
autovacuums based on INSERTs. For large insert-only tables, it may be
beneficial to lower the table's autovacuum_freeze_min_age so that tuples
are eligible to be frozen sooner. Here we've opted not to zero that for
these types of vacuums, since the table may just be insert-mostly and we
may otherwise freeze tuples that are still destined to be updated or
removed in the near future.
There was some debate to what exactly the new scale factor and threshold
should default to. For now, these are set to 0.2 and 1000, respectively.
There may be some motivation to adjust these before the release.
Author: Laurenz Albe, Darafei Praliaskouski
Reviewed-by: Alvaro Herrera, Masahiko Sawada, Chris Travers, Andres Freund, Justin Pryzby
Discussion: https://postgr.es/m/CAC8Q8t%2Bj36G_bLF%3D%2B0iMo6jGNWnLnWb1tujXuJr-%2Bx8ZCCTqoQ%40mail.gmail.com
2020-03-28 07:20:12 +01:00
|
|
|
vacinsthresh,
|
2005-07-14 07:13:45 +02:00
|
|
|
anlthresh;
|
2007-11-15 22:14:46 +01:00
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
/* number of vacuum (resp. analyze) tuples at this time */
|
|
|
|
float4 vactuples,
|
Trigger autovacuum based on number of INSERTs
Traditionally autovacuum has only ever invoked a worker based on the
estimated number of dead tuples in a table and for anti-wraparound
purposes. For the latter, with certain classes of tables such as
insert-only tables, anti-wraparound vacuums could be the first vacuum that
the table ever receives. This could often lead to autovacuum workers being
busy for extended periods of time due to having to potentially freeze
every page in the table. This could be particularly bad for very large
tables. New clusters, or recently pg_restored clusters could suffer even
more as many large tables may have the same relfrozenxid, which could
result in large numbers of tables requiring an anti-wraparound vacuum all
at once.
Here we aim to reduce the work required by anti-wraparound and aggressive
vacuums in general, by triggering autovacuum when the table has received
enough INSERTs. This is controlled by adding two new GUCs and reloptions;
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor. These work exactly the same as the
existing scale factor and threshold controls, only base themselves off the
number of inserts since the last vacuum, rather than the number of dead
tuples. New controls were added rather than reusing the existing
controls, to allow these new vacuums to be tuned independently and perhaps
even completely disabled altogether, which can be done by setting
autovacuum_vacuum_insert_threshold to -1.
We make no attempt to skip index cleanup operations on these vacuums as
they may trigger for an insert-mostly table which continually doesn't have
enough dead tuples to trigger an autovacuum for the purpose of removing
those dead tuples. If we were to skip cleaning the indexes in this case,
then it is possible for the index(es) to become bloated over time.
There are additional benefits to triggering autovacuums based on inserts,
as tables which never contain enough dead tuples to trigger an autovacuum
are now more likely to receive a vacuum, which can mark more of the table
as "allvisible" and encourage the query planner to make use of Index Only
Scans.
Currently, we still obey vacuum_freeze_min_age when triggering these new
autovacuums based on INSERTs. For large insert-only tables, it may be
beneficial to lower the table's autovacuum_freeze_min_age so that tuples
are eligible to be frozen sooner. Here we've opted not to zero that for
these types of vacuums, since the table may just be insert-mostly and we
may otherwise freeze tuples that are still destined to be updated or
removed in the near future.
There was some debate to what exactly the new scale factor and threshold
should default to. For now, these are set to 0.2 and 1000, respectively.
There may be some motivation to adjust these before the release.
Author: Laurenz Albe, Darafei Praliaskouski
Reviewed-by: Alvaro Herrera, Masahiko Sawada, Chris Travers, Andres Freund, Justin Pryzby
Discussion: https://postgr.es/m/CAC8Q8t%2Bj36G_bLF%3D%2B0iMo6jGNWnLnWb1tujXuJr-%2Bx8ZCCTqoQ%40mail.gmail.com
2020-03-28 07:20:12 +01:00
|
|
|
instuples,
|
2005-07-14 07:13:45 +02:00
|
|
|
anltuples;
|
2007-11-15 22:14:46 +01:00
|
|
|
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
/* freeze parameters */
|
|
|
|
int freeze_max_age;
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
int multixact_freeze_max_age;
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
TransactionId xidForceLimit;
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
MultiXactId multiForceLimit;
|
2007-03-29 00:17:12 +02:00
|
|
|
|
2022-10-28 09:19:06 +02:00
|
|
|
Assert(classForm != NULL);
|
|
|
|
Assert(OidIsValid(relid));
|
2005-07-14 07:13:45 +02:00
|
|
|
|
|
|
|
/*
|
2009-02-09 21:57:59 +01:00
|
|
|
* Determine vacuum/analyze equation parameters. We have two possible
|
|
|
|
* sources: the passed reloptions (which could be a main table or a toast
|
|
|
|
* table), or the autovacuum GUC variables.
|
2005-07-14 07:13:45 +02:00
|
|
|
*/
|
2009-08-27 19:18:44 +02:00
|
|
|
|
2019-03-10 20:01:39 +01:00
|
|
|
/* -1 in autovac setting means use plain vacuum_scale_factor */
|
2009-08-27 19:18:44 +02:00
|
|
|
vac_scale_factor = (relopts && relopts->vacuum_scale_factor >= 0)
|
|
|
|
? relopts->vacuum_scale_factor
|
|
|
|
: autovacuum_vac_scale;
|
|
|
|
|
|
|
|
vac_base_thresh = (relopts && relopts->vacuum_threshold >= 0)
|
|
|
|
? relopts->vacuum_threshold
|
|
|
|
: autovacuum_vac_thresh;
|
|
|
|
|
Trigger autovacuum based on number of INSERTs
Traditionally autovacuum has only ever invoked a worker based on the
estimated number of dead tuples in a table and for anti-wraparound
purposes. For the latter, with certain classes of tables such as
insert-only tables, anti-wraparound vacuums could be the first vacuum that
the table ever receives. This could often lead to autovacuum workers being
busy for extended periods of time due to having to potentially freeze
every page in the table. This could be particularly bad for very large
tables. New clusters, or recently pg_restored clusters could suffer even
more as many large tables may have the same relfrozenxid, which could
result in large numbers of tables requiring an anti-wraparound vacuum all
at once.
Here we aim to reduce the work required by anti-wraparound and aggressive
vacuums in general, by triggering autovacuum when the table has received
enough INSERTs. This is controlled by adding two new GUCs and reloptions;
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor. These work exactly the same as the
existing scale factor and threshold controls, only base themselves off the
number of inserts since the last vacuum, rather than the number of dead
tuples. New controls were added rather than reusing the existing
controls, to allow these new vacuums to be tuned independently and perhaps
even completely disabled altogether, which can be done by setting
autovacuum_vacuum_insert_threshold to -1.
We make no attempt to skip index cleanup operations on these vacuums as
they may trigger for an insert-mostly table which continually doesn't have
enough dead tuples to trigger an autovacuum for the purpose of removing
those dead tuples. If we were to skip cleaning the indexes in this case,
then it is possible for the index(es) to become bloated over time.
There are additional benefits to triggering autovacuums based on inserts,
as tables which never contain enough dead tuples to trigger an autovacuum
are now more likely to receive a vacuum, which can mark more of the table
as "allvisible" and encourage the query planner to make use of Index Only
Scans.
Currently, we still obey vacuum_freeze_min_age when triggering these new
autovacuums based on INSERTs. For large insert-only tables, it may be
beneficial to lower the table's autovacuum_freeze_min_age so that tuples
are eligible to be frozen sooner. Here we've opted not to zero that for
these types of vacuums, since the table may just be insert-mostly and we
may otherwise freeze tuples that are still destined to be updated or
removed in the near future.
There was some debate to what exactly the new scale factor and threshold
should default to. For now, these are set to 0.2 and 1000, respectively.
There may be some motivation to adjust these before the release.
Author: Laurenz Albe, Darafei Praliaskouski
Reviewed-by: Alvaro Herrera, Masahiko Sawada, Chris Travers, Andres Freund, Justin Pryzby
Discussion: https://postgr.es/m/CAC8Q8t%2Bj36G_bLF%3D%2B0iMo6jGNWnLnWb1tujXuJr-%2Bx8ZCCTqoQ%40mail.gmail.com
2020-03-28 07:20:12 +01:00
|
|
|
vac_ins_scale_factor = (relopts && relopts->vacuum_ins_scale_factor >= 0)
|
|
|
|
? relopts->vacuum_ins_scale_factor
|
|
|
|
: autovacuum_vac_ins_scale;
|
|
|
|
|
|
|
|
/* -1 is used to disable insert vacuums */
|
|
|
|
vac_ins_base_thresh = (relopts && relopts->vacuum_ins_threshold >= -1)
|
|
|
|
? relopts->vacuum_ins_threshold
|
|
|
|
: autovacuum_vac_ins_thresh;
|
|
|
|
|
2009-08-27 19:18:44 +02:00
|
|
|
anl_scale_factor = (relopts && relopts->analyze_scale_factor >= 0)
|
|
|
|
? relopts->analyze_scale_factor
|
|
|
|
: autovacuum_anl_scale;
|
|
|
|
|
|
|
|
anl_base_thresh = (relopts && relopts->analyze_threshold >= 0)
|
|
|
|
? relopts->analyze_threshold
|
|
|
|
: autovacuum_anl_thresh;
|
|
|
|
|
|
|
|
freeze_max_age = (relopts && relopts->freeze_max_age >= 0)
|
|
|
|
? Min(relopts->freeze_max_age, autovacuum_freeze_max_age)
|
|
|
|
: autovacuum_freeze_max_age;
|
|
|
|
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
multixact_freeze_max_age = (relopts && relopts->multixact_freeze_max_age >= 0)
|
2015-05-08 18:09:14 +02:00
|
|
|
? Min(relopts->multixact_freeze_max_age, effective_multixact_freeze_max_age)
|
|
|
|
: effective_multixact_freeze_max_age;
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
|
2009-08-27 19:18:44 +02:00
|
|
|
av_enabled = (relopts ? relopts->enabled : true);
|
2005-07-14 07:13:45 +02:00
|
|
|
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
/* Force vacuum if table is at risk of wraparound */
|
|
|
|
xidForceLimit = recentXid - freeze_max_age;
|
|
|
|
if (xidForceLimit < FirstNormalTransactionId)
|
|
|
|
xidForceLimit -= FirstNormalTransactionId;
|
|
|
|
force_vacuum = (TransactionIdIsNormal(classForm->relfrozenxid) &&
|
|
|
|
TransactionIdPrecedes(classForm->relfrozenxid,
|
|
|
|
xidForceLimit));
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
if (!force_vacuum)
|
|
|
|
{
|
Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.
Therefore, we now have multixact-specific freezing parameters:
vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)
vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids). Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.
autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids). This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.
Backpatch to 9.3 where multixacts were made to persist enough to require
freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.
Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 23:30:30 +01:00
|
|
|
multiForceLimit = recentMulti - multixact_freeze_max_age;
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
if (multiForceLimit < FirstMultiXactId)
|
|
|
|
multiForceLimit -= FirstMultiXactId;
|
2019-04-24 06:42:12 +02:00
|
|
|
force_vacuum = MultiXactIdIsValid(classForm->relminmxid) &&
|
|
|
|
MultiXactIdPrecedes(classForm->relminmxid, multiForceLimit);
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
}
|
2007-10-24 22:55:36 +02:00
|
|
|
*wraparound = force_vacuum;
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2009-02-09 21:57:59 +01:00
|
|
|
/* User disabled it in pg_class.reloptions? (But ignore if at risk) */
|
2014-07-30 20:41:35 +02:00
|
|
|
if (!av_enabled && !force_vacuum)
|
2007-03-29 00:17:12 +02:00
|
|
|
{
|
|
|
|
*doanalyze = false;
|
|
|
|
*dovacuum = false;
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
return;
|
2007-03-29 00:17:12 +02:00
|
|
|
}
|
2005-07-14 07:13:45 +02:00
|
|
|
|
2014-07-30 20:41:35 +02:00
|
|
|
/*
|
2022-04-07 06:29:46 +02:00
|
|
|
* If we found stats for the table, and autovacuum is currently enabled,
|
|
|
|
* make a threshold-based decision whether to vacuum and/or analyze. If
|
|
|
|
* autovacuum is currently disabled, we must be here for anti-wraparound
|
|
|
|
* vacuuming only, so don't vacuum (or analyze) anything that's not being
|
|
|
|
* forced.
|
2014-07-30 20:41:35 +02:00
|
|
|
*/
|
|
|
|
if (PointerIsValid(tabentry) && AutoVacuumingActive())
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
{
|
2021-04-09 17:29:08 +02:00
|
|
|
reltuples = classForm->reltuples;
|
2022-12-06 02:46:35 +01:00
|
|
|
vactuples = tabentry->dead_tuples;
|
|
|
|
instuples = tabentry->ins_since_vacuum;
|
|
|
|
anltuples = tabentry->mod_since_analyze;
|
2005-07-14 07:13:45 +02:00
|
|
|
|
Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.
Historically, we've considered the state with relpages and reltuples
both zero as indicating that we do not know the table's tuple density.
This is problematic because it's impossible to distinguish "never yet
vacuumed" from "vacuumed and seen to be empty". In particular, a user
cannot use VACUUM or ANALYZE to override the planner's normal heuristic
that an empty table should not be believed to be empty because it is
probably about to get populated. That heuristic is a good safety
measure, so I don't care to abandon it, but there should be a way to
override it if the table is indeed intended to stay empty.
Hence, represent the initial state of ignorance by setting reltuples
to -1 (relpages is still set to zero), and apply the minimum-ten-pages
heuristic only when reltuples is still -1. If the table is empty,
VACUUM or ANALYZE (but not CREATE INDEX) will override that to
reltuples = relpages = 0, and then we'll plan on that basis.
This requires a bunch of fiddly little changes, but we can get rid of
some ugly kluges that were formerly needed to maintain the old definition.
One notable point is that FDWs' GetForeignRelSize methods will see
baserel->tuples = -1 when no ANALYZE has been done on the foreign table.
That seems like a net improvement, since those methods were formerly
also in the dark about what baserel->tuples = 0 really meant. Still,
it is an API change.
I bumped catversion because code predating this change would get confused
by seeing reltuples = -1.
Discussion: https://postgr.es/m/F02298E0-6EF4-49A1-BCB6-C484794D9ACC@thebuild.com
2020-08-30 18:21:51 +02:00
|
|
|
/* If the table hasn't yet been vacuumed, take reltuples as zero */
|
|
|
|
if (reltuples < 0)
|
|
|
|
reltuples = 0;
|
|
|
|
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
vacthresh = (float4) vac_base_thresh + vac_scale_factor * reltuples;
|
Trigger autovacuum based on number of INSERTs
Traditionally autovacuum has only ever invoked a worker based on the
estimated number of dead tuples in a table and for anti-wraparound
purposes. For the latter, with certain classes of tables such as
insert-only tables, anti-wraparound vacuums could be the first vacuum that
the table ever receives. This could often lead to autovacuum workers being
busy for extended periods of time due to having to potentially freeze
every page in the table. This could be particularly bad for very large
tables. New clusters, or recently pg_restored clusters could suffer even
more as many large tables may have the same relfrozenxid, which could
result in large numbers of tables requiring an anti-wraparound vacuum all
at once.
Here we aim to reduce the work required by anti-wraparound and aggressive
vacuums in general, by triggering autovacuum when the table has received
enough INSERTs. This is controlled by adding two new GUCs and reloptions;
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor. These work exactly the same as the
existing scale factor and threshold controls, only base themselves off the
number of inserts since the last vacuum, rather than the number of dead
tuples. New controls were added rather than reusing the existing
controls, to allow these new vacuums to be tuned independently and perhaps
even completely disabled altogether, which can be done by setting
autovacuum_vacuum_insert_threshold to -1.
We make no attempt to skip index cleanup operations on these vacuums as
they may trigger for an insert-mostly table which continually doesn't have
enough dead tuples to trigger an autovacuum for the purpose of removing
those dead tuples. If we were to skip cleaning the indexes in this case,
then it is possible for the index(es) to become bloated over time.
There are additional benefits to triggering autovacuums based on inserts,
as tables which never contain enough dead tuples to trigger an autovacuum
are now more likely to receive a vacuum, which can mark more of the table
as "allvisible" and encourage the query planner to make use of Index Only
Scans.
Currently, we still obey vacuum_freeze_min_age when triggering these new
autovacuums based on INSERTs. For large insert-only tables, it may be
beneficial to lower the table's autovacuum_freeze_min_age so that tuples
are eligible to be frozen sooner. Here we've opted not to zero that for
these types of vacuums, since the table may just be insert-mostly and we
may otherwise freeze tuples that are still destined to be updated or
removed in the near future.
There was some debate to what exactly the new scale factor and threshold
should default to. For now, these are set to 0.2 and 1000, respectively.
There may be some motivation to adjust these before the release.
Author: Laurenz Albe, Darafei Praliaskouski
Reviewed-by: Alvaro Herrera, Masahiko Sawada, Chris Travers, Andres Freund, Justin Pryzby
Discussion: https://postgr.es/m/CAC8Q8t%2Bj36G_bLF%3D%2B0iMo6jGNWnLnWb1tujXuJr-%2Bx8ZCCTqoQ%40mail.gmail.com
2020-03-28 07:20:12 +01:00
|
|
|
vacinsthresh = (float4) vac_ins_base_thresh + vac_ins_scale_factor * reltuples;
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
anlthresh = (float4) anl_base_thresh + anl_scale_factor * reltuples;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Note that we don't need to take special consideration for stat
|
|
|
|
* reset, because if that happens, the last vacuum and analyze counts
|
|
|
|
* will be reset too.
|
|
|
|
*/
|
Trigger autovacuum based on number of INSERTs
Traditionally autovacuum has only ever invoked a worker based on the
estimated number of dead tuples in a table and for anti-wraparound
purposes. For the latter, with certain classes of tables such as
insert-only tables, anti-wraparound vacuums could be the first vacuum that
the table ever receives. This could often lead to autovacuum workers being
busy for extended periods of time due to having to potentially freeze
every page in the table. This could be particularly bad for very large
tables. New clusters, or recently pg_restored clusters could suffer even
more as many large tables may have the same relfrozenxid, which could
result in large numbers of tables requiring an anti-wraparound vacuum all
at once.
Here we aim to reduce the work required by anti-wraparound and aggressive
vacuums in general, by triggering autovacuum when the table has received
enough INSERTs. This is controlled by adding two new GUCs and reloptions;
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor. These work exactly the same as the
existing scale factor and threshold controls, only base themselves off the
number of inserts since the last vacuum, rather than the number of dead
tuples. New controls were added rather than reusing the existing
controls, to allow these new vacuums to be tuned independently and perhaps
even completely disabled altogether, which can be done by setting
autovacuum_vacuum_insert_threshold to -1.
We make no attempt to skip index cleanup operations on these vacuums as
they may trigger for an insert-mostly table which continually doesn't have
enough dead tuples to trigger an autovacuum for the purpose of removing
those dead tuples. If we were to skip cleaning the indexes in this case,
then it is possible for the index(es) to become bloated over time.
There are additional benefits to triggering autovacuums based on inserts,
as tables which never contain enough dead tuples to trigger an autovacuum
are now more likely to receive a vacuum, which can mark more of the table
as "allvisible" and encourage the query planner to make use of Index Only
Scans.
Currently, we still obey vacuum_freeze_min_age when triggering these new
autovacuums based on INSERTs. For large insert-only tables, it may be
beneficial to lower the table's autovacuum_freeze_min_age so that tuples
are eligible to be frozen sooner. Here we've opted not to zero that for
these types of vacuums, since the table may just be insert-mostly and we
may otherwise freeze tuples that are still destined to be updated or
removed in the near future.
There was some debate to what exactly the new scale factor and threshold
should default to. For now, these are set to 0.2 and 1000, respectively.
There may be some motivation to adjust these before the release.
Author: Laurenz Albe, Darafei Praliaskouski
Reviewed-by: Alvaro Herrera, Masahiko Sawada, Chris Travers, Andres Freund, Justin Pryzby
Discussion: https://postgr.es/m/CAC8Q8t%2Bj36G_bLF%3D%2B0iMo6jGNWnLnWb1tujXuJr-%2Bx8ZCCTqoQ%40mail.gmail.com
2020-03-28 07:20:12 +01:00
|
|
|
if (vac_ins_base_thresh >= 0)
|
|
|
|
elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), ins: %.0f (threshold %.0f), anl: %.0f (threshold %.0f)",
|
|
|
|
NameStr(classForm->relname),
|
|
|
|
vactuples, vacthresh, instuples, vacinsthresh, anltuples, anlthresh);
|
|
|
|
else
|
|
|
|
elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), ins: (disabled), anl: %.0f (threshold %.0f)",
|
|
|
|
NameStr(classForm->relname),
|
|
|
|
vactuples, vacthresh, anltuples, anlthresh);
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
|
|
|
|
/* Determine if this table needs vacuum or analyze. */
|
Trigger autovacuum based on number of INSERTs
Traditionally autovacuum has only ever invoked a worker based on the
estimated number of dead tuples in a table and for anti-wraparound
purposes. For the latter, with certain classes of tables such as
insert-only tables, anti-wraparound vacuums could be the first vacuum that
the table ever receives. This could often lead to autovacuum workers being
busy for extended periods of time due to having to potentially freeze
every page in the table. This could be particularly bad for very large
tables. New clusters, or recently pg_restored clusters could suffer even
more as many large tables may have the same relfrozenxid, which could
result in large numbers of tables requiring an anti-wraparound vacuum all
at once.
Here we aim to reduce the work required by anti-wraparound and aggressive
vacuums in general, by triggering autovacuum when the table has received
enough INSERTs. This is controlled by adding two new GUCs and reloptions;
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor. These work exactly the same as the
existing scale factor and threshold controls, only base themselves off the
number of inserts since the last vacuum, rather than the number of dead
tuples. New controls were added rather than reusing the existing
controls, to allow these new vacuums to be tuned independently and perhaps
even completely disabled altogether, which can be done by setting
autovacuum_vacuum_insert_threshold to -1.
We make no attempt to skip index cleanup operations on these vacuums as
they may trigger for an insert-mostly table which continually doesn't have
enough dead tuples to trigger an autovacuum for the purpose of removing
those dead tuples. If we were to skip cleaning the indexes in this case,
then it is possible for the index(es) to become bloated over time.
There are additional benefits to triggering autovacuums based on inserts,
as tables which never contain enough dead tuples to trigger an autovacuum
are now more likely to receive a vacuum, which can mark more of the table
as "allvisible" and encourage the query planner to make use of Index Only
Scans.
Currently, we still obey vacuum_freeze_min_age when triggering these new
autovacuums based on INSERTs. For large insert-only tables, it may be
beneficial to lower the table's autovacuum_freeze_min_age so that tuples
are eligible to be frozen sooner. Here we've opted not to zero that for
these types of vacuums, since the table may just be insert-mostly and we
may otherwise freeze tuples that are still destined to be updated or
removed in the near future.
There was some debate to what exactly the new scale factor and threshold
should default to. For now, these are set to 0.2 and 1000, respectively.
There may be some motivation to adjust these before the release.
Author: Laurenz Albe, Darafei Praliaskouski
Reviewed-by: Alvaro Herrera, Masahiko Sawada, Chris Travers, Andres Freund, Justin Pryzby
Discussion: https://postgr.es/m/CAC8Q8t%2Bj36G_bLF%3D%2B0iMo6jGNWnLnWb1tujXuJr-%2Bx8ZCCTqoQ%40mail.gmail.com
2020-03-28 07:20:12 +01:00
|
|
|
*dovacuum = force_vacuum || (vactuples > vacthresh) ||
|
|
|
|
(vac_ins_base_thresh >= 0 && instuples > vacinsthresh);
|
2007-03-29 00:17:12 +02:00
|
|
|
*doanalyze = (anltuples > anlthresh);
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Skip a table not found in stat hash, unless we have to force vacuum
|
|
|
|
* for anti-wrap purposes. If it's not acted upon, there's no need to
|
|
|
|
* vacuum it.
|
|
|
|
*/
|
2007-03-29 00:17:12 +02:00
|
|
|
*dovacuum = force_vacuum;
|
|
|
|
*doanalyze = false;
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
}
|
2005-08-11 23:11:50 +02:00
|
|
|
|
2017-01-25 20:35:31 +01:00
|
|
|
/* ANALYZE refuses to work with pg_statistic */
|
2005-08-11 23:11:50 +02:00
|
|
|
if (relid == StatisticRelationId)
|
2007-03-29 00:17:12 +02:00
|
|
|
*doanalyze = false;
|
2005-07-14 07:13:45 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* autovacuum_do_vac_analyze
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
* Vacuum and/or analyze the specified table
|
2023-04-06 05:44:52 +02:00
|
|
|
*
|
|
|
|
* We expect the caller to have switched into a memory context that won't
|
|
|
|
* disappear at transaction commit.
|
2005-07-14 07:13:45 +02:00
|
|
|
*/
|
|
|
|
static void
|
2015-03-18 15:52:33 +01:00
|
|
|
autovacuum_do_vac_analyze(autovac_table *tab, BufferAccessStrategy bstrategy)
|
2005-07-14 07:13:45 +02:00
|
|
|
{
|
2017-10-04 00:53:44 +02:00
|
|
|
RangeVar *rangevar;
|
|
|
|
VacuumRelation *rel;
|
|
|
|
List *rel_list;
|
2023-04-06 05:44:52 +02:00
|
|
|
MemoryContext vac_context;
|
2007-03-23 22:57:10 +01:00
|
|
|
|
2006-05-19 17:15:37 +02:00
|
|
|
/* Let pgstat know what we're doing */
|
2008-07-17 23:02:31 +02:00
|
|
|
autovac_report_activity(tab);
|
2005-08-11 23:11:50 +02:00
|
|
|
|
2017-10-04 00:53:44 +02:00
|
|
|
/* Set up one VacuumRelation target, identified by OID, for vacuum() */
|
|
|
|
rangevar = makeRangeVar(tab->at_nspname, tab->at_relname, -1);
|
|
|
|
rel = makeVacuumRelation(rangevar, tab->at_relid, NIL);
|
|
|
|
rel_list = list_make1(rel);
|
|
|
|
|
2023-04-06 05:44:52 +02:00
|
|
|
vac_context = AllocSetContextCreate(CurrentMemoryContext,
|
|
|
|
"Vacuum",
|
|
|
|
ALLOCSET_DEFAULT_SIZES);
|
|
|
|
|
|
|
|
vacuum(rel_list, &tab->at_params, bstrategy, vac_context, true);
|
|
|
|
|
|
|
|
MemoryContextDelete(vac_context);
|
2005-07-14 07:13:45 +02:00
|
|
|
}
|
|
|
|
|
2006-05-19 17:15:37 +02:00
|
|
|
/*
|
|
|
|
* autovac_report_activity
|
|
|
|
* Report to pgstat what autovacuum is doing
|
|
|
|
*
|
|
|
|
* We send a SQL string corresponding to what the user would see if the
|
|
|
|
* equivalent command was to be issued manually.
|
|
|
|
*
|
|
|
|
* Note we assume that we are going to report the next command as soon as we're
|
2007-09-23 22:07:33 +02:00
|
|
|
* done with the current one, and exit right after the last one, so we don't
|
2006-05-19 17:15:37 +02:00
|
|
|
* bother to report "<IDLE>" or some such.
|
|
|
|
*/
|
|
|
|
static void
|
2008-07-17 23:02:31 +02:00
|
|
|
autovac_report_activity(autovac_table *tab)
|
2006-05-19 17:15:37 +02:00
|
|
|
{
|
2008-07-23 22:20:10 +02:00
|
|
|
#define MAX_AUTOVAC_ACTIV_LEN (NAMEDATALEN * 2 + 56)
|
2008-07-17 23:02:31 +02:00
|
|
|
char activity[MAX_AUTOVAC_ACTIV_LEN];
|
|
|
|
int len;
|
2006-05-19 17:15:37 +02:00
|
|
|
|
|
|
|
/* Report the command and possible options */
|
2019-03-18 18:57:33 +01:00
|
|
|
if (tab->at_params.options & VACOPT_VACUUM)
|
2006-05-19 17:15:37 +02:00
|
|
|
snprintf(activity, MAX_AUTOVAC_ACTIV_LEN,
|
2008-07-23 22:20:10 +02:00
|
|
|
"autovacuum: VACUUM%s",
|
2019-03-18 18:57:33 +01:00
|
|
|
tab->at_params.options & VACOPT_ANALYZE ? " ANALYZE" : "");
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
else
|
2006-05-19 17:15:37 +02:00
|
|
|
snprintf(activity, MAX_AUTOVAC_ACTIV_LEN,
|
2008-01-14 14:39:25 +01:00
|
|
|
"autovacuum: ANALYZE");
|
2006-05-19 17:15:37 +02:00
|
|
|
|
Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 23:42:10 +01:00
|
|
|
/*
|
|
|
|
* Report the qualified name of the relation.
|
|
|
|
*/
|
2008-07-17 23:02:31 +02:00
|
|
|
len = strlen(activity);
|
2006-05-19 17:15:37 +02:00
|
|
|
|
2008-07-17 23:02:31 +02:00
|
|
|
snprintf(activity + len, MAX_AUTOVAC_ACTIV_LEN - len,
|
2008-07-23 22:20:10 +02:00
|
|
|
" %s.%s%s", tab->at_nspname, tab->at_relname,
|
2015-03-18 15:52:33 +01:00
|
|
|
tab->at_params.is_wraparound ? " (to prevent wraparound)" : "");
|
2006-05-19 17:15:37 +02:00
|
|
|
|
2007-09-23 22:07:33 +02:00
|
|
|
/* Set statement_timestamp() to current time for pg_stat_activity */
|
|
|
|
SetCurrentStatementStartTimestamp();
|
|
|
|
|
2012-01-19 14:19:20 +01:00
|
|
|
pgstat_report_activity(STATE_RUNNING, activity);
|
2006-05-19 17:15:37 +02:00
|
|
|
}
|
|
|
|
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
/*
|
|
|
|
* autovac_report_workitem
|
|
|
|
* Report to pgstat that autovacuum is processing a work item
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
autovac_report_workitem(AutoVacuumWorkItem *workitem,
|
|
|
|
const char *nspname, const char *relname)
|
|
|
|
{
|
|
|
|
char activity[MAX_AUTOVAC_ACTIV_LEN + 12 + 2];
|
|
|
|
char blk[12 + 2];
|
|
|
|
int len;
|
|
|
|
|
|
|
|
switch (workitem->avw_type)
|
|
|
|
{
|
|
|
|
case AVW_BRINSummarizeRange:
|
|
|
|
snprintf(activity, MAX_AUTOVAC_ACTIV_LEN,
|
|
|
|
"autovacuum: BRIN summarize");
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Report the qualified name of the relation, and the block number if any
|
|
|
|
*/
|
|
|
|
len = strlen(activity);
|
|
|
|
|
|
|
|
if (BlockNumberIsValid(workitem->avw_blockNumber))
|
|
|
|
snprintf(blk, sizeof(blk), " %u", workitem->avw_blockNumber);
|
|
|
|
else
|
|
|
|
blk[0] = '\0';
|
|
|
|
|
|
|
|
snprintf(activity + len, MAX_AUTOVAC_ACTIV_LEN - len,
|
|
|
|
" %s.%s%s", nspname, relname, blk);
|
|
|
|
|
|
|
|
/* Set statement_timestamp() to current time for pg_stat_activity */
|
|
|
|
SetCurrentStatementStartTimestamp();
|
|
|
|
|
|
|
|
pgstat_report_activity(STATE_RUNNING, activity);
|
|
|
|
}
|
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
/*
|
|
|
|
* AutoVacuumingActive
|
|
|
|
* Check GUC vars and report whether the autovacuum process should be
|
|
|
|
* running.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
AutoVacuumingActive(void)
|
|
|
|
{
|
2007-09-24 05:12:23 +02:00
|
|
|
if (!autovacuum_start_daemon || !pgstat_track_counts)
|
2005-07-14 07:13:45 +02:00
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
/*
|
|
|
|
* Request one work item to the next autovacuum run processing our database.
|
2018-03-14 15:53:56 +01:00
|
|
|
* Return false if the request can't be recorded.
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
*/
|
2018-03-14 15:53:56 +01:00
|
|
|
bool
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
AutoVacuumRequestWork(AutoVacuumWorkItemType type, Oid relationId,
|
|
|
|
BlockNumber blkno)
|
|
|
|
{
|
2017-08-15 23:14:07 +02:00
|
|
|
int i;
|
2018-03-14 15:53:56 +01:00
|
|
|
bool result = false;
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
|
|
|
|
LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
|
|
|
|
|
|
|
|
/*
|
2017-08-15 23:14:07 +02:00
|
|
|
* Locate an unused work item and fill it with the given data.
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
*/
|
2017-08-15 23:14:07 +02:00
|
|
|
for (i = 0; i < NUM_WORKITEMS; i++)
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
{
|
2017-08-15 23:14:07 +02:00
|
|
|
AutoVacuumWorkItem *workitem = &AutoVacuumShmem->av_workItems[i];
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
|
2017-08-15 23:14:07 +02:00
|
|
|
if (workitem->avw_used)
|
|
|
|
continue;
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
|
2017-08-15 23:14:07 +02:00
|
|
|
workitem->avw_used = true;
|
|
|
|
workitem->avw_active = false;
|
|
|
|
workitem->avw_type = type;
|
|
|
|
workitem->avw_database = MyDatabaseId;
|
|
|
|
workitem->avw_relation = relationId;
|
|
|
|
workitem->avw_blockNumber = blkno;
|
2018-03-14 15:53:56 +01:00
|
|
|
result = true;
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
|
2017-08-15 23:14:07 +02:00
|
|
|
/* done */
|
|
|
|
break;
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
LWLockRelease(AutovacuumLock);
|
2018-03-14 15:53:56 +01:00
|
|
|
|
|
|
|
return result;
|
BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur. To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range. Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.
This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations. The only actual effect is to
change the timing, i.e. that it occurs earlier. For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.
Most of the new code here is in autovacuum, which can now be told about
"work items" to process. This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.
The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values(). It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.
Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 19:00:53 +02:00
|
|
|
}
|
|
|
|
|
2005-07-14 07:13:45 +02:00
|
|
|
/*
|
|
|
|
* autovac_init
|
|
|
|
* This is called at postmaster initialization.
|
|
|
|
*
|
2007-09-24 05:12:23 +02:00
|
|
|
* All we do here is annoy the user if he got it wrong.
|
2005-07-14 07:13:45 +02:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
autovac_init(void)
|
|
|
|
{
|
2007-09-24 05:12:23 +02:00
|
|
|
if (autovacuum_start_daemon && !pgstat_track_counts)
|
2005-07-14 07:13:45 +02:00
|
|
|
ereport(WARNING,
|
|
|
|
(errmsg("autovacuum not started because of misconfiguration"),
|
2007-09-24 05:12:23 +02:00
|
|
|
errhint("Enable the \"track_counts\" option.")));
|
2005-07-14 07:13:45 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2007-02-16 00:23:23 +01:00
|
|
|
* IsAutoVacuum functions
|
|
|
|
* Return whether this is either a launcher autovacuum process or a worker
|
|
|
|
* process.
|
2005-07-14 07:13:45 +02:00
|
|
|
*/
|
|
|
|
bool
|
2007-02-16 00:23:23 +01:00
|
|
|
IsAutoVacuumLauncherProcess(void)
|
|
|
|
{
|
|
|
|
return am_autovacuum_launcher;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool
|
|
|
|
IsAutoVacuumWorkerProcess(void)
|
2005-07-14 07:13:45 +02:00
|
|
|
{
|
2007-02-16 00:23:23 +01:00
|
|
|
return am_autovacuum_worker;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* AutoVacuumShmemSize
|
|
|
|
* Compute space needed for autovacuum-related shared memory
|
|
|
|
*/
|
|
|
|
Size
|
|
|
|
AutoVacuumShmemSize(void)
|
|
|
|
{
|
2007-04-16 20:30:04 +02:00
|
|
|
Size size;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Need the fixed struct and the array of WorkerInfoData.
|
|
|
|
*/
|
|
|
|
size = sizeof(AutoVacuumShmemStruct);
|
|
|
|
size = MAXALIGN(size);
|
|
|
|
size = add_size(size, mul_size(autovacuum_max_workers,
|
|
|
|
sizeof(WorkerInfoData)));
|
|
|
|
return size;
|
2007-02-16 00:23:23 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* AutoVacuumShmemInit
|
|
|
|
* Allocate and initialize autovacuum-related shared memory
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
AutoVacuumShmemInit(void)
|
|
|
|
{
|
|
|
|
bool found;
|
|
|
|
|
|
|
|
AutoVacuumShmem = (AutoVacuumShmemStruct *)
|
|
|
|
ShmemInitStruct("AutoVacuum Data",
|
|
|
|
AutoVacuumShmemSize(),
|
|
|
|
&found);
|
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
if (!IsUnderPostmaster)
|
|
|
|
{
|
|
|
|
WorkerInfo worker;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
Assert(!found);
|
|
|
|
|
|
|
|
AutoVacuumShmem->av_launcherpid = 0;
|
2012-10-16 22:36:30 +02:00
|
|
|
dlist_init(&AutoVacuumShmem->av_freeWorkers);
|
|
|
|
dlist_init(&AutoVacuumShmem->av_runningWorkers);
|
2008-11-02 22:24:52 +01:00
|
|
|
AutoVacuumShmem->av_startingWorker = NULL;
|
2017-08-15 23:14:07 +02:00
|
|
|
memset(AutoVacuumShmem->av_workItems, 0,
|
|
|
|
sizeof(AutoVacuumWorkItem) * NUM_WORKITEMS);
|
2007-04-16 20:30:04 +02:00
|
|
|
|
|
|
|
worker = (WorkerInfo) ((char *) AutoVacuumShmem +
|
|
|
|
MAXALIGN(sizeof(AutoVacuumShmemStruct)));
|
|
|
|
|
|
|
|
/* initialize the WorkerInfo free list */
|
|
|
|
for (i = 0; i < autovacuum_max_workers; i++)
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
{
|
2012-10-19 01:04:20 +02:00
|
|
|
dlist_push_head(&AutoVacuumShmem->av_freeWorkers,
|
|
|
|
&worker[i].wi_links);
|
Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
|
|
|
pg_atomic_init_flag(&worker[i].wi_dobalance);
|
|
|
|
}
|
|
|
|
|
|
|
|
pg_atomic_init_u32(&AutoVacuumShmem->av_nworkersForBalance, 0);
|
|
|
|
|
2007-04-16 20:30:04 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
Assert(found);
|
2005-07-14 07:13:45 +02:00
|
|
|
}
|
Split up guc.c for better build speed and ease of maintenance.
guc.c has grown to be one of our largest .c files, making it
a bottleneck for compilation. It's also acquired a bunch of
knowledge that'd be better kept elsewhere, because of our not
very good habit of putting variable-specific check hooks here.
Hence, split it up along these lines:
* guc.c itself retains just the core GUC housekeeping mechanisms.
* New file guc_funcs.c contains the SET/SHOW interfaces and some
SQL-accessible functions for GUC manipulation.
* New file guc_tables.c contains the data arrays that define the
built-in GUC variables, along with some already-exported constant
tables.
* GUC check/assign/show hook functions are moved to the variable's
home module, whenever that's clearly identifiable. A few hard-
to-classify hooks ended up in commands/variable.c, which was
already a home for miscellaneous GUC hook functions.
To avoid cluttering a lot more header files with #include "guc.h",
I also invented a new header file utils/guc_hooks.h and put all
the GUC hook functions' declarations there, regardless of their
originating module. That allowed removal of #include "guc.h"
from some existing headers. The fallout from that (hopefully
all caught here) demonstrates clearly why such inclusions are
best minimized: there are a lot of files that, for example,
were getting array.h at two or more levels of remove, despite
not having any connection at all to GUCs in themselves.
There is some very minor code beautification here, such as
renaming a couple of inconsistently-named hook functions
and improving some comments. But mostly this just moves
code from point A to point B and deals with the ensuing
needs for #include adjustments and exporting a few functions
that previously weren't exported.
Patch by me, per a suggestion from Andres Freund; thanks also
to Michael Paquier for the idea to invent guc_funcs.c.
Discussion: https://postgr.es/m/587607.1662836699@sss.pgh.pa.us
2022-09-13 17:05:07 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* GUC check_hook for autovacuum_work_mem
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* -1 indicates fallback.
|
|
|
|
*
|
|
|
|
* If we haven't yet changed the boot_val default of -1, just let it be.
|
|
|
|
* Autovacuum will look to maintenance_work_mem instead.
|
|
|
|
*/
|
|
|
|
if (*newval == -1)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We clamp manually-set values to at least 1MB. Since
|
|
|
|
* maintenance_work_mem is always set to at least this value, do the same
|
|
|
|
* here.
|
|
|
|
*/
|
|
|
|
if (*newval < 1024)
|
|
|
|
*newval = 1024;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|