postgresql

Commit Graph

Author	SHA1	Message	Date
Andres Freund	5891c7a8ed	pgstat: store statistics in shared memory. Previously the statistics collector received statistics updates via UDP and shared statistics data by writing them out to temporary files regularly. These files can reach tens of megabytes and are written out up to twice a second. This has repeatedly prevented us from adding additional useful statistics. Now statistics are stored in shared memory. Statistics for variable-numbered objects are stored in a dshash hashtable (backed by dynamic shared memory). Fixed-numbered stats are stored in plain shared memory. The header for pgstat.c contains an overview of the architecture. The stats collector is not needed anymore, remove it. By utilizing the transactional statistics drop infrastructure introduced in a prior commit statistics entries cannot "leak" anymore. Previously leaked statistics were dropped by pgstat_vacuum_stat(), called from [auto-]vacuum. On systems with many small relations pgstat_vacuum_stat() could be quite expensive. Now that replicas drop statistics entries for dropped objects, it is not necessary anymore to reset stats when starting from a cleanly shut down replica. Subsequent commits will perform some further code cleanup, adapt docs and add tests. Bumps PGSTAT_FILE_FORMAT_ID. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Author: Andres Freund <andres@anarazel.de> Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: Justin Pryzby <pryzby@telsasoft.com> Reviewed-By: "David G. Johnston" <david.g.johnston@gmail.com> Reviewed-By: Tomas Vondra <tomas.vondra@2ndquadrant.com> (in a much earlier version) Reviewed-By: Arthur Zakirov <a.zakirov@postgrespro.ru> (in a much earlier version) Reviewed-By: Antonin Houska <ah@cybertec.at> (in a much earlier version) Discussion: https://postgr.es/m/20220303021600.hs34ghqcw6zcokdh@alap3.anarazel.de Discussion: https://postgr.es/m/20220308205351.2xcn6k4x5yivcxyd@alap3.anarazel.de Discussion: https://postgr.es/m/20210319235115.y3wz7hpnnrshdyv6@alap3.anarazel.de	2022-04-06 21:29:46 -07:00
Andres Freund	be902e2651	pgstat: normalize function naming. Most of pgstat uses pgstat_<verb>_<subject>() or just <verb>_<subject>(). But not all (some introduced fairly recently by me). Rename ones that aren't intentionally following a different scheme (e.g. AtEOXact_*).	2022-04-06 21:29:46 -07:00
Andres Freund	8b1dccd37c	pgstat: scaffolding for transactional stats creation / drop. One problematic part of the current statistics collector design is that there is no reliable way of getting rid of statistics entries. Because of that pgstat_vacuum_stat() (called by [auto-]vacuum) matches all stats for the current database with the catalog contents and tries to drop now-superfluous entries. That's quite expensive. What's worse, it doesn't work on physical replicas, despite physical replicas collection statistics entries. This commit introduces infrastructure to create / drop statistics entries transactionally, together with the underlying catalog objects (functions, relations, subscriptions). pgstat_xact.c maintains a list of stats entries created / dropped transactionally in the current transaction. To ensure the removal of statistics entries is durable dropped statistics entries are included in commit / abort (and prepare) records, which also ensures that stats entries are dropped on standbys. Statistics entries created separately from creating the underlying catalog object (e.g. when stats were previously lost due to an immediate restart) are not WAL logged. However that can only happen outside of the transaction creating the catalog object, so it does not lead to "leaked" statistics entries. For this to work, functions creating / dropping functions / relations / subscriptions need to call into pgstat. For subscriptions this was already done when dropping subscriptions, via pgstat_report_subscription_drop() (now renamed to pgstat_drop_subscription()). This commit does not actually drop stats yet, it just provides the infrastructure. It is however a largely independent piece of infrastructure, so committing it separately makes sense. Bumps XLOG_PAGE_MAGIC. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20220303021600.hs34ghqcw6zcokdh@alap3.anarazel.de	2022-04-06 18:27:52 -07:00
Tom Lane	dbafe127bb	Suppress "variable 'pagesaving' set but not used" warning. With asserts disabled, late-model clang notices that this variable is incremented but never otherwise read. Discussion: https://postgr.es/m/3171401.1649275153@sss.pgh.pa.us	2022-04-06 17:03:50 -04:00
Andres Freund	bdbd3d9064	pgstat: stats collector references in comments. Soon the stats collector will be no more, with statistics instead getting stored in shared memory. There are a lot of references to the stats collector in comments. This commit replaces most of these references with "cumulative statistics system", with the remaining ones getting replaced as part of subsequent commits. This is done separately from the - quite large - shared memory statistics patch to make review easier. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Justin Pryzby <pryzby@telsasoft.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20220303021600.hs34ghqcw6zcokdh@alap3.anarazel.de Discussion: https://postgr.es/m/20220308205351.2xcn6k4x5yivcxyd@alap3.anarazel.de	2022-04-06 13:56:06 -07:00
Stephen Frost	39969e2a1e	Remove exclusive backup mode Exclusive-mode backups have been deprecated since 9.6 (when non-exclusive backups were introduced) due to the issues they can cause should the system crash while one is running and generally because non-exclusive provides a much better interface. Further, exclusive backup mode wasn't really being tested (nor was most of the related code- like being able to log in just to stop an exclusive backup and the bits of the state machine related to that) and having to possibly deal with an exclusive backup and the backup_label file existing during pg_basebackup, pg_rewind, etc, added other complexities that we are better off without. This patch removes the exclusive backup mode, the various special cases for dealing with it, and greatly simplifies the online backup code and documentation. Authors: David Steele, Nathan Bossart Reviewed-by: Chapman Flack Discussion: https://postgr.es/m/ac7339ca-3718-3c93-929f-99e725d1172c@pgmasters.net https://postgr.es/m/CAHg+QDfiM+WU61tF6=nPZocMZvHDzCK47Kneyb0ZRULYzV5sKQ@mail.gmail.com	2022-04-06 14:41:03 -04:00
Peter Eisentraut	01effb1304	Fix unsigned output format in SLRU error reporting Avoid printing signed values as unsigned. (No impact in practice expected.) Author: Pavel Borisov <pashkin.elfe@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CALT9ZEHN7hWJo6MgJKqoDMGj%3DGOzQU50wTvOYZXDj7x%3DsUK-kw%40mail.gmail.com	2022-04-06 09:15:05 +02:00
Peter Geoghegan	c42a6fc41d	vacuumlazy.c: Further consolidate resource allocation. Move remaining VACUUM resource allocation and deallocation code from lazy_scan_heap() to its caller, heap_vacuum_rel(). This finishes off work started by commit `73f6ec3d`. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-Wzk3fNBa_S3Ngi+16GQiyJ=AmUu3oUY99syMDTMRxitfyQ@mail.gmail.com	2022-04-04 11:53:33 -07:00
David Rowley	77bae396df	Adjust tuplesort API to have bitwise option flags This replaces the bool flag for randomAccess. An upcoming patch requires adding another option, so instead of breaking the API for that, then breaking it again one day if we add more options, let's just break it once. Any boolean options we add in the future will just make use of an unused bit in the flags. Any extensions making use of tuplesorts will need to update their code to pass TUPLESORT_RANDOMACCESS instead of true for randomAccess. TUPLESORT_NONE can be used for a set of empty options. Author: David Rowley Reviewed-by: Justin Pryzby Discussion: https://postgr.es/m/CAApHDvoH4ASzsAOyHcxkuY01Qf%2B%2B8JJ0paw%2B03dk%2BW25tQEcNQ%40mail.gmail.com	2022-04-04 22:24:59 +12:00
David Rowley	1b0d9aa4f7	Improve the generation memory allocator Here we make a series of improvements to the generation memory allocator, namely: 1. Allow generation contexts to have a minimum, initial and maximum block sizes. The standard allocator allows this already but when the generation context was added, it only allowed fixed-sized blocks. The problem with fixed-sized blocks is that it's difficult to choose how large to make the blocks. If the chosen size is too small then we'd end up with a large number of blocks and a large number of malloc calls. If the block size is made too large, then memory is wasted. 2. Add support for "keeper" blocks. This is a special block that is allocated along with the context itself but is never freed. Instead, when the last chunk in the keeper block is freed, we simply mark the block as empty to allow new allocations to make use of it. 3. Add facility to "recycle" newly empty blocks instead of freeing them and having to later malloc an entire new block again. We do this by recording a single GenerationBlock which has become empty of any chunks. When we run out of space in the current block, we check to see if there is a "freeblock" and use that if it contains enough space for the allocation. Author: David Rowley, Tomas Vondra Reviewed-by: Andy Fan Discussion: https://postgr.es/m/d987fd54-01f8-0f73-af6c-519f799a0ab8@enterprisedb.com	2022-04-04 20:53:13 +12:00
Peter Geoghegan	f3c15cbe50	Generalize how VACUUM skips all-frozen pages. Non-aggressive VACUUMs were at a gratuitous disadvantage (relative to aggressive VACUUMs) around advancing relfrozenxid and relminmxid before now. The issue only came up when concurrent activity unset some heap page's visibility map bit right as VACUUM was considering if the page should get counted in frozenskipped_pages. The non-aggressive case would recheck the all-frozen bit at this point. The aggressive case reasoned that the page (a skippable page) must have at least been all-frozen in the recent past, so skipping it won't make relfrozenxid advancement unsafe (which is never okay for aggressive VACUUMs). The recheck created a window for some other backend to confuse matters for VACUUM. If the page's VM bit turned out to be unset, VACUUM would conclude that the page was _never_ all-frozen. frozenskipped_pages was not incremented, and yet VACUUM couldn't back out of skipping at this late stage (it couldn't choose to scan the page instead). This made it unsafe to advance relfrozenxid later on. Consistently avoid the issue by generalizing how we skip frozen pages during aggressive VACUUMs: take the same approach when skipping any skippable page range during aggressive and non-aggressive VACUUMs alike. The new approach makes ranges (not individual pages) the fundamental unit of skipping using the visibility map. frozenskipped_pages is replaced with a boolean flag that represents whether some skippable range with one or more all-visible pages was actually skipped. It is safe for VACUUM to treat a page as all-frozen provided it at least had its all-frozen bit set after the OldestXmin cutoff was established. VACUUM is only required to scan pages that might have XIDs < OldestXmin (unfrozen XIDs) to be able to safely advance relfrozenxid. Tuples concurrently inserted on "skipped" pages can be thought of as equivalent to tuples concurrently inserted on a block >= rel_pages. It's possible that the issue this commit fixes hardly ever came up in practice. But we only had to be unlucky once to lose out on advancing relfrozenxid -- a single affected heap page was enough to throw VACUUM off. That seems like something to avoid on general principle. This is similar to an issue fixed by commit `44fa8488`, which taught vacuumlazy.c to not give up on non-aggressive relfrozenxid advancement just because a cleanup lock wasn't immediately available on some heap page. Skipping an all-visible range is now explicitly structured as a choice made by non-aggressive VACUUMs, by weighing known costs (scanning extra skippable pages to freeze their tuples early) against known benefits (advancing relfrozenxid early). This works in essentially the same way as it always has (don't skip ranges < SKIP_PAGES_THRESHOLD). We could do much better here in the future by considering other relevant factors. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/CAH2-Wzn6bGJGfOy3zSTJicKLw99PHJeSOQBOViKjSCinaxUKDQ@mail.gmail.com Discussion: https://postgr.es/m/CA%2BTgmoZiSOY6H7aadw5ZZGm7zYmfDzL6nwmL5V7GL4HgJgLF_w%40mail.gmail.com	2022-04-03 13:35:43 -07:00
Peter Geoghegan	0b018fabaa	Set relfrozenxid to oldest extant XID seen by VACUUM. When VACUUM set relfrozenxid before now, it set it to whatever value was used to determine which tuples to freeze -- the FreezeLimit cutoff. This approach was very naive. The relfrozenxid invariant only requires that new relfrozenxid values be <= the oldest extant XID remaining in the table (at the point that the VACUUM operation ends), which in general might be much more recent than FreezeLimit. VACUUM now carefully tracks the oldest remaining XID/MultiXactId as it goes (the oldest remaining values _after_ lazy_scan_prune processing). The final values are set as the table's new relfrozenxid and new relminmxid in pg_class at the end of each VACUUM. The oldest XID might come from a tuple's xmin, xmax, or xvac fields. It might even come from one of the table's remaining MultiXacts. Final relfrozenxid values must still be >= FreezeLimit in an aggressive VACUUM (FreezeLimit still acts as a lower bound on the final value that aggressive VACUUM can set relfrozenxid to). Since standard VACUUMs still make no guarantees about advancing relfrozenxid, they might as well set relfrozenxid to a value from well before FreezeLimit when the opportunity presents itself. In general standard VACUUMs may now set relfrozenxid to any value > the original relfrozenxid and <= OldestXmin. Credit for the general idea of using the oldest extant XID to set pg_class.relfrozenxid at the end of VACUUM goes to Andres Freund. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com	2022-04-03 09:57:21 -07:00
Peter Geoghegan	14bf1e8313	vacuumlazy.c: Clean up variable declarations. Move some of the heap_vacuum_rel() instrumentation related variables to the scope where they're actually needed. Also reorder some of the variable declarations at the start of heap_vacuum_rel() so that related variables appear together.	2022-04-02 10:33:21 -07:00
John Naylor	6974924347	Specialize tuplesort routines for different kinds of abbreviated keys Previously, the specialized tuplesort routine inlined handling for reverse-sort and NULLs-ordering but called the datum comparator via a pointer in the SortSupport struct parameter. Testing has showed that we can get a useful performance gain by specializing datum comparison for the different representations of abbreviated keys -- signed and unsigned 64-bit integers and signed 32-bit integers. Almost all abbreviatable data types will benefit -- the only exception for now is numeric, since the datum comparison is more complex. The performance gain depends on data type and input distribution, but often falls in the range of 10-20% faster. Thomas Munro Reviewed by Peter Geoghegan, review and performance testing by me Discussion: https://www.postgresql.org/message-id/CA%2BhUKGKKYttZZk-JMRQSVak%3DCXSJ5fiwtirFf%3Dn%3DPAbumvn1Ww%40mail.gmail.com	2022-04-02 15:22:25 +07:00
Michael Paquier	d16773cdc8	Add macros in hash and btree AMs to get the special area of their pages This makes the code more consistent with SpGiST, GiST and GIN, that already use this style, and the idea is to make easier the introduction of more sanity checks for each of these AM-specific macros. BRIN uses a different set of macros to get a page's type and flags, so it has no need for something similar. Author: Matthias van de Meent Discussion: https://postgr.es/m/CAEze2WjE3+tGO9Fs9+iZMU+z6mMZKo54W1Zt98WKqbEUHbHOBg@mail.gmail.com	2022-04-01 13:24:50 +09:00
Robert Haas	9c08aea6a3	Add new block-by-block strategy for CREATE DATABASE. Because this strategy logs changes on a block-by-block basis, it avoids the need to checkpoint before and after the operation. However, because it logs each changed block individually, it might generate a lot of extra write-ahead logging if the template database is large. Therefore, the older strategy remains available via a new STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy option to createdb. Somewhat controversially, this patch assembles the list of relations to be copied to the new database by reading the pg_class relation of the template database. Cross-database access like this isn't normally possible, but it can be made to work here because there can't be any connections to the database being copied, nor can it contain any in-doubt transactions. Even so, we have to use lower-level interfaces than normal, since the table scan and relcache interfaces will not work for a database to which we're not connected. The advantage of this approach is that we do not need to rely on the filesystem to determine what ought to be copied, but instead on PostgreSQL's own knowledge of the database structure. This avoids, for example, copying stray files that happen to be located in the source database directory. Dilip Kumar, with a fairly large number of cosmetic changes by me. Reviewed and tested by Ashutosh Sharma, Andres Freund, John Naylor, Greg Nancarrow, Neha Sharma. Additional feedback from Bruce Momjian, Heikki Linnakangas, Julien Rouhaud, Adam Brusselback, Kyotaro Horiguchi, Tomas Vondra, Andrew Dunstan, Álvaro Herrera, and others. Discussion: http://postgr.es/m/CA+TgmoYtcdxBjLh31DLxUXHxFVMPGzrU5_T=CYCvRyFHywSBUQ@mail.gmail.com	2022-03-29 11:48:36 -04:00
Alvaro Herrera	bf902c1393	Revert "Fix replay of create database records on standby" This reverts commit `49d9cfc68b`. The approach taken by this patch has problems, so we'll come up with a radically different fix. Discussion: https://postgr.es/m/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=iCFM0SuYqBbqZEamEg@mail.gmail.com	2022-03-29 15:36:21 +02:00
Alvaro Herrera	49d9cfc68b	Fix replay of create database records on standby Crash recovery on standby may encounter missing directories when replaying create database WAL records. Prior to this patch, the standby would fail to recover in such a case. However, the directories could be legitimately missing. Consider a sequence of WAL records as follows: CREATE DATABASE DROP DATABASE DROP TABLESPACE If, after replaying the last WAL record and removing the tablespace directory, the standby crashes and has to replay the create database record again, the crash recovery must be able to move on. This patch adds a mechanism similar to invalid-page tracking, to keep a tally of missing directories during crash recovery. If all the missing directory references are matched with corresponding drop records at the end of crash recovery, the standby can safely continue following the primary. Backpatch to 13, at least for now. The bug is older, but fixing it in older branches requires more careful study of the interactions with commit `e6d8069522`, which appeared in 13. A new TAP test file is added to verify the condition. However, because it depends on commit `d6d317dbf6`, it can only be added to branch master. I (Álvaro) manually verified that the code behaves as expected in branch 14. It's a bit nervous-making to leave the code uncovered by tests in older branches, but leaving the bug unfixed is even worse. Also, the main reason this fix took so long is precisely that we couldn't agree on a good strategy to approach testing for the bug, so perhaps this is the best we can do. Diagnosed-by: Paul Guo <paulguo@gmail.com> Author: Paul Guo <paulguo@gmail.com> Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Author: Asim R Praveen <apraveen@pivotal.io> Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27DN5xWJ2P9-ROH16e4JUA@mail.gmail.com	2022-03-25 13:16:21 +01:00
Robert Haas	412ad7a556	Fix possible recovery trouble if TRUNCATE overlaps a checkpoint. If TRUNCATE causes some buffers to be invalidated and thus the checkpoint does not flush them, TRUNCATE must also ensure that the corresponding files are truncated on disk. Otherwise, a replay from the checkpoint might find that the buffers exist but have the wrong contents, which may cause replay to fail. Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design suggestion from Heikki Linnakangas, with some changes to the comments by me. Review of this and a prior patch that approached the issue differently by Heikki Linnakangas, Andres Freund, Álvaro Herrera, Masahiko Sawada, and Tom Lane. Discussion: http://postgr.es/m/BYAPR06MB6373BF50B469CA393C614257ABF00@BYAPR06MB6373.namprd06.prod.outlook.com	2022-03-24 14:52:28 -04:00
Alvaro Herrera	e27f4ee0a7	Change fastgetattr and heap_getattr to inline functions They were macros previously, but recent callsite additions made Coverity complain about one of the assertions being always true. This change could have been made a long time ago, but the Coverity complain broke the inertia. Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com> Discussion: https://postgr.es/m/202203241021.uts52sczx3al@alvherre.pgsql	2022-03-24 18:02:27 +01:00
Alvaro Herrera	9d92582abf	Fix "missing continuation record" after standby promotion Invalidate abortedRecPtr and missingContrecPtr after a missing continuation record is successfully skipped on a standby. This fixes a PANIC caused when a recently promoted standby attempts to write an OVERWRITE_RECORD with an LSN of the previously read aborted record. Backpatch to 10 (all stable versions). Author: Sami Imseih <simseih@amazon.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/44D259DE-7542-49C4-8A52-2AB01534DCA9@amazon.com	2022-03-23 18:22:10 +01:00
Dean Rasheed	7faa5fc84b	Add support for security invoker views. A security invoker view checks permissions for accessing its underlying base relations using the privileges of the user of the view, rather than the privileges of the view owner. Additionally, if any of the base relations are tables with RLS enabled, the policies of the user of the view are applied, rather than those of the view owner. This allows views to be defined without giving away additional privileges on the underlying base relations, and matches a similar feature available in other database systems. It also allows views to operate more naturally with RLS, without affecting the assignments of policies to users. Christoph Heiss, with some additional hacking by me. Reviewed by Laurenz Albe and Wolfgang Walther. Discussion: https://postgr.es/m/b66dd6d6-ad3e-c6f2-8b90-47be773da240%40cybertec.at	2022-03-22 10:28:10 +00:00
Tom Lane	1f8bc44868	Remove workarounds for avoiding [U]INT64_FORMAT in translatable strings. Further code simplification along the same lines as `d914eb347` and earlier patches. Aleksander Alekseev, Japin Li Discussion: https://postgr.es/m/CAJ7c6TMSKi3Xs8h5MP38XOnQQpBLazJvVxVfPn++roitDJcR7g@mail.gmail.com	2022-03-21 11:11:55 -04:00
Andres Freund	bff258a273	pgstat: rename pgstat_initstats() to pgstat_relation_init(). The old name was overly generic. An upcoming commit moves relation stats handling into its own file, making pgstat_initstats() look even more out of place. Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20220303021600.hs34ghqcw6zcokdh@alap3.anarazel.de	2022-03-20 19:12:09 -07:00
Thomas Munro	3f1ce97346	Add circular WAL decoding buffer, take II. Teach xlogreader.c to decode the WAL into a circular buffer. This will support optimizations based on looking ahead, to follow in a later commit. * XLogReadRecord() works as before, decoding records one by one, and allowing them to be examined via the traditional XLogRecGetXXX() macros and certain traditional members like xlogreader->ReadRecPtr. * An alternative new interface XLogReadAhead()/XLogNextRecord() is added that returns pointers to DecodedXLogRecord objects so that it's now possible to look ahead in the WAL stream while replaying. * In order to be able to use the new interface effectively while streaming data, support is added for the page_read() callback to respond to a new nonblocking mode with XLREAD_WOULDBLOCK instead of waiting for more data to arrive. No direct user of the new interface is included in this commit, though XLogReadRecord() uses it internally. Existing code doesn't need to change, except in a few places where it was accessing reader internals directly and now needs to go through accessor macros. Reviewed-by: Julien Rouhaud <rjuju123@gmail.com> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com	2022-03-18 18:45:47 +13:00
Thomas Munro	46d9bfb0a6	Fix race between DROP TABLESPACE and checkpointing. Commands like ALTER TABLE SET TABLESPACE may leave files for the next checkpoint to clean up. If such files are not removed by the time DROP TABLESPACE is called, we request a checkpoint so that they are deleted. However, there is presently a window before checkpoint start where new unlink requests won't be scheduled until the following checkpoint. This means that the checkpoint forced by DROP TABLESPACE might not remove the files we expect it to remove, and the following ERROR will be emitted: ERROR: tablespace "mytblspc" is not empty To fix, add a call to AbsorbSyncRequests() just before advancing the unlink cycle counter. This ensures that any unlink requests forwarded prior to checkpoint start (i.e., when ckpt_started is incremented) will be processed by the current checkpoint. Since AbsorbSyncRequests() performs memory allocations, it cannot be called within a critical section, so we also need to move SyncPreCheckpoint() to before CreateCheckPoint()'s critical section. This is an old bug, so back-patch to all supported versions. Author: Nathan Bossart <nathandbossart@gmail.com> Reported-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20220215235845.GA2665318%40nathanxps13	2022-03-16 17:20:24 +13:00
Michael Paquier	6bdf1a1400	Fix collection of typos in the code and the documentation Some words were duplicated while other places were grammatically incorrect, including one variable name in the code. Author: Otto Kekalainen, Justin Pryzby Discussion: https://postgr.es/m/7DDBEFC5-09B6-4325-B942-B563D1A24BDC@amazon.com	2022-03-15 11:29:35 +09:00
Thomas Munro	c6f2f01611	Fix pg_basebackup with in-place tablespaces. Previously, pg_basebackup from a cluster that contained an 'in-place' tablespace, as introduced by commit `7170f215`, would produce a harmless warning on Unix and fail completely on Windows. Reported-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/20220304.165449.1200020258723305904.horikyota.ntt%40gmail.com	2022-03-15 14:01:23 +13:00
Peter Geoghegan	6e20f4600a	VACUUM VERBOSE: tweak scanned_pages logic. Commit `872770fd6c` taught VACUUM VERBOSE and autovacuum logging to display the total number of pages scanned by VACUUM. This information was also displayed as a percentage of rel_pages in parenthesis, which makes it easy to spot trends over time and across tables. The instrumentation displayed "0 scanned (0.00% of total)" for totally empty tables. Tweak the instrumentation: have it show "0 scanned (100.00% of total)" for empty tables instead. This approach is clearer and more consistent.	2022-03-13 13:07:49 -07:00
Peter Geoghegan	e370f100f0	vacuumlazy.c: Standardize rel_pages terminology. VACUUM's rel_pages field indicates the size of the target heap rel just after the table_relation_vacuum() operation began. There are specific expectations around how rel_pages can be related to other nearby state. In particular, the range of rel_pages must contain every tuple in the relation whose tuple headers might contain an XID < OldestXmin. Consistently refer to the field as rel_pages to make this clearer and more discoverable. This is follow-up work to commit `73f6ec3d` from earlier today. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20220311031351.sbge5m2bpvy2ttxg@alap3.anarazel.de	2022-03-12 13:20:45 -08:00
Peter Geoghegan	73f6ec3d3c	vacuumlazy.c: document vistest and OldestXmin. Explain the relationship between vacuumlazy.c's vistest and OldestXmin cutoffs. These closely related cutoffs are different in subtle but important ways. Also document a closely related rule: we must establish rel_pages _after_ OldestXmin to ensure that no XID < OldestXmin can be missed by lazy_scan_heap(). It's easier to explain these issues by initializing everything together, so consolidate initialization of vacrel state. Now almost every vacrel field is initialized by heap_vacuum_rel(). The only remaining exception is the dead_items array, which is still managed by lazy_scan_heap() due to interactions with how we initialize parallel VACUUM. Also move the process that updates pg_class entries for each index into heap_vacuum_rel(), and adjust related assertions. All pg_class updates now take place after lazy_scan_heap() returns, which seems clearer. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20211211045710.ljtuu4gfloh754rs@alap3.anarazel.de Discussion: https://postgr.es/m/CAH2-WznYsUxVT156rCQ+q=YD4S4=1M37hWvvHLz-H1pwSM8-Ew@mail.gmail.com	2022-03-12 12:52:38 -08:00
Peter Geoghegan	5b68f75e12	Normalize heap_prepare_freeze_tuple argument name. We called the argument totally_frozen in its function prototype as well as in code comments, even though totally_frozen_p was used in the function definition. Standardize on totally_frozen.	2022-03-11 19:30:21 -08:00
Michael Paquier	e9537321a7	Add support for zstd with compression of full-page writes in WAL wal_compression gains a new value, "zstd", to allow the compression of full-page images using the compression method of the same name. Compression is done using the default level recommended by the library, as of ZSTD_CLEVEL_DEFAULT = 3. Some benchmarking has shown that it could make sense to use a level lower for the FPI compression, like 1 or 2, as the compression rate did not change much with a bit less CPU consumed, but any tests done would only cover few scenarios so it is hard to come to a clear conclusion. Anyway, there is no reason to not use the default level instead, which is the level recommended by the library so it should be fine for most cases. zstd outclasses easily pglz, and is better than LZ4 where one wants to have more compression at the cost of extra CPU but both are good enough in their own scenarios, so the choice between one or the other of these comes to a study of the workload patterns and the schema involved, mainly. This commit relies heavily on `4035cd5`, that reshaped the code creating and restoring full-page writes to be aware of the compression type, making this integration straight-forward. This patch borrows some early work from Andrey Borodin, though the patch got a complete rewrite. Author: Justin Pryzby Discussion: https://postgr.es/m/20220222231948.GJ9008@telsasoft.com	2022-03-11 12:18:53 +09:00
Michael Paquier	0071fc7127	Fix header inclusion order in xloginsert.c with lz4.h Per project policy, all system and library headers need to be declared in the backend code after "postgres.h" and before the internal headers, but `4035cd5` broke this policy when adding support for LZ4 in wal_compression. Noticed while reviewing the patch to add support for zstd in this area. This only impacts HEAD, so there is no need for a back-patch.	2022-03-11 10:59:47 +09:00
Tom Lane	46ab07ffda	Clean up assorted failures under clang's -fsanitize=undefined checks. Most of these are cases where we could call memcpy() or other libc functions with a NULL pointer and a zero count, which is forbidden by POSIX even though every production version of libc allows it. We've fixed such things before in a piecemeal way, but apparently never made an effort to try to get them all. I don't claim that this patch does so either, but it gets every failure I observe in check-world, using clang 12.0.1 on current RHEL8. numeric.c has a different issue that the sanitizer doesn't like: "ln(-1.0)" will compute log10(0) and then try to assign the resulting -Inf to an integer variable. We don't actually use the result in such a case, so there's no live bug. Back-patch to all supported branches, with the idea that we might start running a buildfarm member that tests this case. This includes back-patching `c1132aae3` (Check the size in COPY_POINTER_FIELD), which previously silenced some of these issues in copyfuncs.c. Discussion: https://postgr.es/m/CALNJ-vT9r0DSsAOw9OXVJFxLENoVS_68kJ5x0p44atoYH+H4dg@mail.gmail.com	2022-03-03 18:13:24 -05:00
Michael Paquier	62ce0c758d	Fix catalog data of pg_stop_backup(), labelled v2 This function has been incorrectly marked as a set-returning function with prorows (estimated number of rows) set to 1 since its creation in `7117685`, that introduced non-exclusive backups. There is no need for that as the function is designed to return only one tuple. This commit fixes the catalog definition of pg_stop_backup_v2() so as it is not marked as proretset anymore, with prorows set to 0. This simplifies its internals by removing one tuplestore (used for one single record anyway) and by removing all the checks related to a set-returning function. Issue found during my quest to simplify some of the logic used in in-core system functions. Bump catalog version. Reviewed-by: Aleksander Alekseev, Kyotaro Horiguchi Discussion: https://postgr.es/m/Yh8guT78f1Ercfzw@paquier.xyz	2022-03-03 10:51:57 +09:00
Tom Lane	12d768e704	Don't use static storage for SaveTransactionCharacteristics(). This is pretty queasy-making on general principles, and the more so once you notice that CommitTransactionCommand() is actually stomping on the values saved by _SPI_commit(). It's okay as long as the active values didn't change during HoldPinnedPortals(); but that's a larger assumption than I think we want to make, especially since the fix is so simple. Discussion: https://postgr.es/m/1533956.1645731245@sss.pgh.pa.us	2022-02-28 12:54:12 -05:00
Peter Geoghegan	73c61a50a1	vacuumlazy.c: Remove obsolete num_tuples field. Commit `49c9d9fc` unified VACUUM VERBOSE and autovacuum logging. It neglected to remove an old vacrel field that was only used by the old VACUUM VERBOSE, so remove it now. The previous num_tuples approach doesn't seem to have any real advantage over the approach VACUUM VERBOSE takes now (also the approach used by the autovacuum logging code), which is to show new_rel_tuples. new_rel_tuples is the possibly-estimated total number of tuples left in the table, whereas num_tuples meant the number of tuples encountered during the VACUUM operation, after pruning, without regard for tuples from pages skipped via the visibility map. In passing, reorder a related vacrel field for consistency.	2022-02-24 19:01:54 -08:00
Peter Geoghegan	cf879d3069	Remove unnecessary heap_tuple_needs_freeze argument. The buffer argument hasn't been used since the function was first added by commit `bbb6e559c4`. The sibling heap_prepare_freeze_tuple function doesn't have such an argument either. Remove it.	2022-02-24 18:31:07 -08:00
Heikki Linnakangas	6c46e8a5df	Fix data loss on crash after sorted GiST index build. If a checkpoint happens during sorted GiST index build, and the system crashes after the checkpoint and after the index build has finished, the data written to the index before the checkpoint started could be lost. The checkpoint won't fsync it, and it won't be replayed at crash recovery either. Fix by calling smgrimmedsync() after the index build, just like in B-tree index build. Backpatch to v14 where the sorted GiST index build was introduced. Reported-by: Melanie Plageman Discussion: https://www.postgresql.org/message-id/CAAKRu_ZJJynimxKj5xYBSziL62-iEtPE+fx-B=JzR=jUtP92mw@mail.gmail.com	2022-02-24 16:15:12 +02:00
Andres Freund	2776922201	Assert in init_toast_snapshot() that some snapshot registered or active. Commit <FIXME> fixed the bug that RemoveTempRelationsCallback() did not push/register a snapshot. That only went unnoticed because often a valid catalog snapshot exists and is returned by GetOldestSnapshot(). But due to invalidation processing that is not reliable. Thus assert in init_toast_snapshot() that there is a registered or active snapshot, using the new HaveRegisteredOrActiveSnapshot(). Author: Andres Freund Discussion: https://postgr.es/m/20220219180002.6tubjq7iw7m52bgd@alap3.anarazel.de	2022-02-21 08:58:29 -08:00
Heikki Linnakangas	69639e2b5c	Fix uninitialized variable. I'm very surprised the compiler didn't warn about it. But Coverity and Valgrind did.	2022-02-20 18:33:50 +02:00
Michael Paquier	d61a361d1a	Remove all traces of tuplestore_donestoring() in the C code This routine is a no-op since `dd04e95` from 2003, with a macro kept around for compatibility purposes. This has led to the same code patterns being copy-pasted around for no effect, sometimes in confusing ways like in pg_logical_slot_get_changes_guts() from logical.c where the code was actually incorrect. This issue has been discussed on two different threads recently, so rather than living with this legacy, remove any uses of this routine in the C code to simplify things. The compatibility macro is kept to avoid breaking any out-of-core modules that depend on it. Reported-by: Tatsuhito Kasahara, Justin Pryzby Author: Tatsuhito Kasahara Discussion: https://postgr.es/m/20211217200419.GQ17618@telsasoft.com Discussion: https://postgr.es/m/CAP0=ZVJeeYfAeRfmzqAF2Lumdiv4S4FewyBnZd4DPTrsSQKJKw@mail.gmail.com	2022-02-17 09:52:02 +09:00
Heikki Linnakangas	4620892344	Fix bogus log message when starting from a cleanly shut down state. In commit `70e81861fa` to split xlog.c, I moved the startup code that updates the state in the control file and prints out the "database system was not properly shut down" message to the log, but I accidentally removed the "if (InRecovery)" check around it. As a result, that message was printed even if the system was cleanly shut down, also during 'initdb'. Discussion: https://www.postgresql.org/message-id/3357075.1645031062@sss.pgh.pa.us	2022-02-16 23:15:08 +02:00
Heikki Linnakangas	9ed87a78e0	Fix read beyond buffer bug introduced by the split xlog.c patch. FinishWalRecovery() copied the valid part of the last WAL block into a palloc'd buffer, and the code in StartupXLOG() copied it to the WAL buffer. But the memcpy in StartupXLOG() copied a full 8kB block, not just the valid part, i.e. it copied from beyond the end of the buffer. The invalid part was cleared immediately afterwards, so as long as the memory was allocated and didn't segfault, it didn't do any harm, but it can definitely segfault. Discussion: https://www.postgresql.org/message-id/efc12e32-5af2-3485-5b1d-5af9f707491a@iki.fi	2022-02-16 12:01:32 +02:00
Heikki Linnakangas	70e81861fa	Split xlog.c into xlog.c and xlogrecovery.c. This moves the functions related to performing WAL recovery into the new xlogrecovery.c source file, leaving xlog.c responsible for maintaining the WAL buffers, coordinating the startup and switch from recovery to normal operations, and other miscellaneous stuff that have always been in xlog.c. Reviewed-by: Andres Freund, Kyotaro Horiguchi, Robert Haas Discussion: https://www.postgresql.org/message-id/a31f27b4-a31d-f976-6217-2b03be646ffa%40iki.fi	2022-02-16 09:30:38 +02:00
Heikki Linnakangas	be1c00ab13	Move code around in StartupXLOG(). This is in preparation for the next commit, which will split off recovery-related code from xlog.c into a new source file. This is the order that things will happen with the next commit, and the point of this commit is to make these ordering changes more explicit, while the next commit mechanically moves the source code to the new file. To aid review, I added "BEGIN/END function" comments to mark which blocks of code are moved to which functions in the next commit. They will be gone in the next commit. Reviewed-by: Andres Freund, Kyotaro Horiguchi, Robert Haas Discussion: https://www.postgresql.org/message-id/a31f27b4-a31d-f976-6217-2b03be646ffa%40iki.fi	2022-02-16 09:22:44 +02:00
Heikki Linnakangas	b3a5d01c05	Refactor setting XLP_FIRST_IS_OVERWRITE_CONTRECORD. Set it directly in CreateOverwriteContrecordRecord(). That way, AdvanceXLInsertBuffer() doesn't need the missingContrecPtr global variable. This is in preparation for splitting xlog.c into multiple files. Reviewed-by: Robert Haas Discussion: https://www.postgresql.org/message-id/a462d79c-cb5a-47cc-e9ac-616b5003965f%40iki.fi	2022-02-16 09:22:41 +02:00
Heikki Linnakangas	d231be00cb	Run pgindent on xlog.c. To tidy up after some recent refactorings in xlog.c. These would be fixed by the pgindent run we do at the end of the development cycle, but I want to clean these up now as I'm about to do some more big refactorings on xlog.c.	2022-02-16 09:22:34 +02:00
Peter Geoghegan	988ffc3063	Update "don't truncate with failsafe" rationale. There is a very good (though non-obvious) reason to avoid relation truncation during a VACUUM that has triggered the failsafe mechanism, which was missed before now. Update related comments, so this isn't forgotten. Reported-By: John Naylor <john.naylor@enterprisedb.com> Discussion: https://postgr.es/m/CAFBsxsFiMPxQ-dHZ8tOgktn=+ffeJT3+GinZ4zdOGbmAnCYadA@mail.gmail.com	2022-02-15 15:16:19 -08:00
Amit Kapila	5e01001ffb	WAL log unchanged toasted replica identity key attributes. Currently, during UPDATE, the unchanged replica identity key attributes are not logged separately because they are getting logged as part of the new tuple. But if they are stored externally then the untoasted values are not getting logged as part of the new tuple and logical replication won't be able to replicate such UPDATEs. So we need to log such attributes as part of the old_key_tuple during UPDATE. Reported-by: Haiying Tang Author: Dilip Kumar and Amit Kapila Reviewed-by: Alvaro Herrera, Haiying Tang, Andres Freund Backpatch-through: 10 Discussion: https://postgr.es/m/OS0PR01MB611342D0A92D4F4BF26C0F47FB229@OS0PR01MB6113.jpnprd01.prod.outlook.com	2022-02-14 08:55:58 +05:30
Michael Paquier	c963e84fb8	Make origin data initialization consistent other fields in 2PC header As of `1eb6d65`, the origin data is optionally stored in a 2PC file header, with the data filled in EndPrepare() even in the default case where there is no origin data to add. This was inconsistent with all the other fields of TwoPhaseFileHeader which are initialized in StartPrepare(), so move the initialization of origin_lsn and origin_timestamp there instead. The effect of missing the initialization at this early stage is only cosmetic based on the current logic of the code, but could have led to issues in the long-term, and it is more consistent done this way. Reported-by: Ranier Vilela Discussion: https://postgr.es/m/CAEudQAooECJ+gU_RZB-yhioPOV94R4ucoHAf68PiJhLpgpVpBw@mail.gmail.com	2022-02-14 09:30:35 +09:00
Tom Lane	302612a6c7	Silence minor compiler warnings. Depending on compiler version and optimization level, we might get a complaint that lazy_scan_heap's "freespace" is used uninitialized. Compilers not aware that ereport(ERROR) doesn't return complained about bbsink_lz4_new(). Assigning "-1" to a uint64 value has unportable results; fortunately, the value of xlogreadsegno is unimportant when xlogreadfd is -1. (It looks to me like there is no need for xlogreadsegno to be static in the first place, but I didn't venture to change that.)	2022-02-13 13:06:55 -05:00
Peter Geoghegan	efa4a9462a	Consolidate VACUUM xid cutoff logic. Push the logic for determining whether or not a VACUUM operation will be aggressive down into vacuum_set_xid_limits(). This makes the function's signature significantly simpler, and seems clearer overall. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com	2022-02-11 18:26:15 -08:00
Peter Geoghegan	872770fd6c	Add VACUUM instrumentation for scanned pages, relfrozenxid. Report on scanned pages within VACUUM VERBOSE and autovacuum logging. These are pages that were physically examined during the VACUUM operation. Note that this can include a small number of pages that were marked all-visible in the visibility map by some earlier VACUUM operation. VACUUM won't skip all-visible pages that aren't part of a range of all-visible pages that's at least 32 blocks in length (partly to avoid missing out on opportunities to advance relfrozenxid during non-aggressive VACUUMs). Commit `44fa8488` simplified the definition of scanned pages. It became the complement of the pages (of those pages from rel_pages) that were skipped using the visibility map. And so scanned pages precisely indicates how effective the visibility map was at saving work. (Before now we displayed the number of pages skipped via the visibility map when happened to be frozen pages, but not when they were merely all-visible, which was less useful to users.) Rename the user-visible OldestXmin output field to "removal cutoff", and show some supplementary information: how far behind the cutoff is (number of XIDs behind) by the time the VACUUM operation finished. This will help users to figure out what's _not_ working in extreme cases where VACUUM is fundamentally unable to remove dead tuples or freeze older tuples (e.g., due to a leaked replication slot). Also report when relfrozenxid is advanced by VACUUM in output that immediately follows "removal cutoff". This structure is intended to highlight the relationship between the new relfrozenxid value for the table, and the VACUUM operation's removal cutoff. Finally, add instrumentation of "missed dead tuples", and the number of pages that had at least one such tuple. These are fully DEAD (not just RECENTLY_DEAD) tuples with storage that could not be pruned due to failure to acquire a cleanup lock on a heap page. This is a replacement for the "skipped due to pin" instrumentation removed by commit `44fa8488`. It shows more details than before for pages where failing to get a cleanup lock actually resulted in VACUUM missing out on useful work, but usually shows nothing at all instead (the mere fact that we couldn't get a cleanup lock is usually of no consequence whatsoever now). Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAH2-Wznp=c=Opj8Z7RMR3G=ec3_JfGYMN_YvmCEjoPCHzWbx0g@mail.gmail.com	2022-02-11 16:48:40 -08:00
Peter Geoghegan	44fa84881f	Simplify lazy_scan_heap's handling of scanned pages. Redefine a scanned page as any heap page that actually gets pinned by VACUUM's first pass over the heap, regardless of whether or not the page was cleanup locked. Although it's fundamentally impossible to prune a heap page without a cleanup lock (since we cannot safely defragment the page), we can do just about everything else. The only notable further exception is freezing tuples, though even that is arguably a consequence of not being able to prune (not a separate issue). VACUUM now does as much of the same processing as possible for pages that could not be cleanup locked. Any failure to do specific required processing is treated as a special case exception, which will be rare in practice. We now collect any preexisting LP_DEAD items (left behind by earlier opportunistic pruning) in the dead_items array for these heap pages, and count their tuples in the usual way. Steps used to decide if we'll attempt relation truncation are performed in the usual way for no-cleanup-lock scanned pages, too. Although eliminating these special cases is intrinsically useful, it's even more useful as an enabler of further simplifications. The only essential difference between aggressive and non-aggressive is that only aggressive is _guaranteed_ to be able to advance relfrozenxid up to FreezeLimit. Advancing relfrozenxid is always useful, but before now non-aggressive VACUUMs threw away the opportunity to do so whenever a cleanup lock could not be acquired on any page, no matter what the details were. This was very pessimistic. It isn't actually necessary to "behave aggressively" to maintain the ability to advance relfrozenxid when a cleanup lock isn't immediately available (most of the time). The non-aggressive case will now make sure that it isn't safe to advance relfrozenxid (without waiting) using only a share lock. It will usually notice that there are no tuples that need to be frozen anyway, just like in the aggressive case -- and so it no longer wastes an opportunity to advance relfrozenxid over nothing. (The non-aggressive case still won't wait for a cleanup lock when there really are tuples on the page that need to be frozen, since that really would amount to "behaving aggressively".) VACUUM currently has a tendency to set heap pages to all-visible in the visibility map before it freezes all of the tuples on the page. Only a subsequent aggressive VACUUM will visit these pages to freeze their tuples, usually only when the tuple XIDs are much older than the vacuum_freeze_min_age GUC (FreezeLimit cutoff) is supposed to allow. And so non-aggressive VACUUMs are still far less likely to be able to advance relfrozenxid in practice, even with the enhancements from this commit. This remaining issue will be addressed by future work that overhauls the criteria for freezing tuples. Once that's in place, almost every VACUUM operation will be able to advance relfrozenxid in practice. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAH2-Wznp=c=Opj8Z7RMR3G=ec3_JfGYMN_YvmCEjoPCHzWbx0g@mail.gmail.com	2022-02-11 14:32:17 -08:00
Michael Paquier	0147fc7c8c	Fix typo in multixact.c Introduced in `aa64f23`. Author: Nathan Bossart Discussion: https://postgr.es/m/20220209175338.GB1627503@nathanxps13	2022-02-10 10:45:14 +09:00
Robert Haas	aa64f23b02	Remove MaxBackends variable in favor of GetMaxBackends() function. Previously, it was really easy to write code that accessed MaxBackends before we'd actually initialized it, especially when coding up an extension. To make this less error-prune, introduce a new function GetMaxBackends() which should be used to obtain the correct value. This will ERROR if called too early. Demote the global variable to a file-level static, so that nobody can peak at it directly. Nathan Bossart. Idea by Andres Freund. Review by Greg Sabino Mullane, by Michael Paquier (who had doubts about the approach), and by me. Discussion: http://postgr.es/m/20210802224204.bckcikl45uezv5e4@alap3.anarazel.de	2022-02-08 15:53:19 -05:00
Alexander Korotkov	f1ea98a797	Reduce non-leaf keys overlap in GiST indexes produced by a sorted build The GiST sorted build currently chooses split points according to the only page space utilization. That may lead to higher non-leaf keys overlap and, in turn, slower search query answers. This commit makes the sorted build use the opclass's picksplit method. Once four pages at the level are accumulated, the picksplit method is applied until each split partition fits the page. Some of our split algorithms could show significant performance degradation while processing 4-times more data at once. But those opclasses haven't received the sorted build support and shouldn't receive it before their split algorithms are improved. Discussion: https://postgr.es/m/CAHqSB9jqtS94e9%3D0vxqQX5dxQA89N95UKyz-%3DA7Y%2B_YJt%2BVW5A%40mail.gmail.com Author: Aliaksandr Kalenik, Sergei Shoulbakov, Andrey Borodin Reviewed-by: Björn Harrtell, Darafei Praliaskouski, Andres Freund Reviewed-by: Alexander Korotkov	2022-02-07 23:20:42 +03:00
Robert Haas	5ef1eefd76	Allow archiving via loadable modules. Running a shell command for each file to be archived has a lot of overhead and may not offer as much error checking as you want, or the exact semantics that you want. So, offer the option to call a loadable module for each file to be archived, rather than running a shell command. Also, add a 'basic_archive' contrib module as an example implementation that archives to a local directory. Nathan Bossart, with a little bit of kibitzing by me. Discussion: http://postgr.es/m/20220202224433.GA1036711@nathanxps13	2022-02-03 14:05:02 -05:00
Peter Eisentraut	94aa7cc5f7	Add UNIQUE null treatment option The SQL standard has been ambiguous about whether null values in unique constraints should be considered equal or not. Different implementations have different behaviors. In the SQL:202x draft, this has been formalized by making this implementation-defined and adding an option on unique constraint definitions UNIQUE [ NULLS [NOT] DISTINCT ] to choose a behavior explicitly. This patch adds this option to PostgreSQL. The default behavior remains UNIQUE NULLS DISTINCT. Making this happen in the btree code is pretty easy; most of the patch is just to carry the flag around to all the places that need it. The CREATE UNIQUE INDEX syntax extension is not from the standard, it's my own invention. I named all the internal flags, catalog columns, etc. in the negative ("nulls not distinct") so that the default PostgreSQL behavior is the default if the flag is false. Reviewed-by: Maxim Orlov <orlovmg@gmail.com> Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/84e5ee1b-387e-9a54-c326-9082674bde78@enterprisedb.com	2022-02-03 11:48:21 +01:00
Alvaro Herrera	b3d7d6e462	Remove xloginsert.h from xlog.h xlog.h is directly and indirectly #included in a lot of places. With this change, xloginsert.h is no longer unnecessarily included in the large number of them that don't need it. Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://postgr.es/m/CALj2ACVe-W+WM5P44N7eG9C2_FmaeM8Dq5aCnD3fHt0Ba=WR6w@mail.gmail.com	2022-01-30 12:25:24 -03:00
Peter Geoghegan	bf42fcace5	vacuumlazy.c: Rename state field for consistency. Rename pages_removed to removed_pages, for consistency with nearby vacrel fields.	2022-01-28 17:41:09 -08:00
Michael Paquier	741bd32933	Improve errors related to incorrect TLI on checkpoint record replay WAL replay would cause a hard crash if the timeline expected by a XLOG_END_OF_RECOVERY, a XLOG_CHECKPOINT_ONLINE, or a XLOG_CHECKPOINT_SHUTDOWN record is not the same as the timeline being replayed, using the same error message for all three of them. This commit changes those error messages to use different wordings, adapted to each record type, which is useful when it comes to the debugging of an issue in this area. Author: Amul Sul Reviewed-by: Nathan Bossart, Robert Haas Discussion: https://postgr.es/m/CAAJ_b97i1ZerYC_xW6o_AiDSW5n+sGi8k91Yc8KS8bKWKxjqwQ@mail.gmail.com	2022-01-25 13:37:19 +09:00
Michael Paquier	410aa248e5	Fix various typos, grammar and code style in comments and docs This fixes a set of issues that have accumulated over the past months (or years) in various code areas. Most fixes are related to some recent additions, as of the development of v15. Author: Justin Pryzby Discussion: https://postgr.es/m/20220124030001.GQ23027@telsasoft.com	2022-01-25 09:40:04 +09:00
Andres Freund	1fabec7d7c	fsync pg_logical/mappings in CheckPointLogicalRewriteHeap(). While individual logical rewrite files were synced to disk, the directory was not. On some filesystems that could lead to loosing directory entries after a crash. Reported-By: Tom Lane <tgl@sss.pgh.pa.us> Author: Nathan Bossart <bossartn@amazon.com> Discussion: https://postgr.es/m/867F2E29-2782-4869-970E-B984C6D35A8F@amazon.com Backpatch: 10-	2022-01-21 11:22:55 -08:00
Michael Paquier	237d1f3172	Fix one-off bug causing missing commit timestamps for subtransactions The logic in charge of writing commit timestamps (enabled with track_commit_timestamp) for subtransactions had a one-bug bug, where it would be possible that commit timestamps go missing for the last subtransaction committed. While on it, simplify a bit the iteration logic in the loop writing the commit timestamps, as per suggestions from Kyotaro Horiguchi and Tom Lane, so as some variable initializations are not part of the loop itself. Issue introduced in `73c986a`. Analyzed-by: Alex Kingsborough Author: Alex Kingsborough, Kyotaro Horiguchi Discussion: https://postgr.es/m/73A66172-4050-4F2A-B7F1-13508EDA2144@amazon.com Backpatch-through: 10	2022-01-21 14:54:04 +09:00
Peter Eisentraut	b99ccd2cb2	Call pg_newlocale_from_collation() also with default collation Previously, callers of pg_newlocale_from_collation() did not call it if the collation was DEFAULT_COLLATION_OID and instead proceeded with a pg_locale_t of 0. Instead, now we call it anyway and have it return 0 if the default collation was passed. It already did this, so we just have to adjust the callers. This simplifies all the call sites and also makes future enhancements easier. After discussion and testing, the previous comment in pg_locale.c about avoiding this for performance reasons may have been mistaken since it was testing a very different patch version way back when. Reviewed-by: Julien Rouhaud <rjuju123@gmail.com> Discussion: https://www.postgresql.org/message-id/ed3baa81-7fac-7788-cc12-41e3f7917e34@enterprisedb.com	2022-01-20 09:50:18 +01:00
Jeff Davis	7a5f6b4748	Make logical decoding a part of the rmgr. Add a new rmgr method, rm_decode, and use that rather than a switch statement. In preparation for rmgr extensibility. Reviewed-by: Julien Rouhaud Discussion: https://postgr.es/m/ed1fb2e22d15d3563ae0eb610f7b61bb15999c0a.camel%40j-davis.com Discussion: https://postgr.es/m/20220118095332.6xtlcjoyxobv6cbk@jrouhaud	2022-01-19 14:58:49 -08:00
Andres Freund	c702d656a2	heap pruning: Only call BufferGetBlockNumber() once. BufferGetBlockNumber() is not that cheap and obviously cannot change during one heap_prune_page(), so only call it once. We might be able to do better and pass the block number from the caller, but that'd be a larger change... Discussion: https://postgr.es/m/20211211045710.ljtuu4gfloh754rs@alap3.anarazel.de	2022-01-17 15:35:11 -08:00
Peter Geoghegan	49c9d9fcfa	Unify VACUUM VERBOSE and autovacuum logging. The log_autovacuum_min_duration instrumentation used its own dedicated code for logging, which was not reused by VACUUM VERBOSE. This was highly duplicative, and sometimes led to each code path using slightly different accounting for essentially the same information. Clean things up by making VACUUM VERBOSE reuse the same instrumentation code. This code restructuring changes the structure of the VACUUM VERBOSE output itself, but that seems like an overall improvement. The most noticeable change in VACUUM VERBOSE output is that it no longer outputs a distinct message per index per round of index vacuuming. Most of the same information (about each index) is now shown in its new per-operation summary message. This is far more legible. A few details are no longer displayed by VACUUM VERBOSE, but that's no real loss in practice, especially in the common case where we don't need multiple index scans/rounds of vacuuming. This super fine-grained information is still available via DEBUG2 messages, which might still be useful in debugging scenarios. VACUUM VERBOSE now shows new instrumentation, which is typically very useful: all of the log_autovacuum_min_duration instrumentation that it missed out on before now. This includes information about WAL overhead, buffers hit/missed/dirtied information, and I/O timing information. VACUUM VERBOSE still retains a few INFO messages of its own. This is limited to output concerning the progress of heap rel truncation, as well as some basic information about parallel workers. These details are still potentially quite useful. They aren't a good fit for the log output, which must summarize the whole operation. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAH2-WzmW4Me7_qR4X4ka7pxP-jGmn7=Npma_-Z-9Y1eD0MQRLw@mail.gmail.com	2022-01-14 16:50:34 -08:00
Andres Freund	bb42bfb5cc	Assert redirect pointers are sensible after heap_page_prune(). Corruption of redirect item pointers often only becomes visible well after being corrupted, as e.g. bug #17255 shows: In the original reproducer, gigabyte of WAL were between the source of the corruption and the corruption becoming visible. To make it easier to find / prevent such bugs, verify whether redirect pointers are sensible at the end of heap_page_prune_execute(). `5cd7eb1f1c` introduced related assertions while modifying the page, but they can't easily detect marking the target of an existing redirect as unused. Sometimes the corruption will be detected later, but that's harder to diagnose. Author: Andres Freund <andres@andres@anarazel.de> Reviewed-By: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/20211122175914.ayk6gg6nvdwuhrzb@alap3.anarazel.de	2022-01-13 18:14:05 -08:00
Andres Freund	18b87b201f	Fix possible HOT corruption when RECENTLY_DEAD changes to DEAD while pruning. Since `dc7420c2c9` the horizon used for pruning is determined "lazily". A more accurate horizon is built on-demand, rather than in GetSnapshotData(). If a horizon computation is triggered between two HeapTupleSatisfiesVacuum() calls for the same tuple, the result can change from RECENTLY_DEAD to DEAD. heap_page_prune() can process the same tid multiple times (once following an update chain, once "directly"). When the result of HeapTupleSatisfiesVacuum() of a tuple changes from RECENTLY_DEAD during the first access, to DEAD in the second, the "tuple is DEAD and doesn't chain to anything else" path in heap_prune_chain() can end up marking the target of a LP_REDIRECT ItemId unused. Initially not easily visible, Once the target of a LP_REDIRECT ItemId is marked unused, a new tuple version can reuse it. At that point the corruption may become visible, as index entries pointing to the "original" redirect item, now point to a unrelated tuple. To fix, compute HTSV for all tuples on a page only once. This fixes the entire class of problems of HTSV changing inside heap_page_prune(). However, visibility changes can obviously still occur between HTSV checks inside heap_page_prune() and outside (e.g. in lazy_scan_prune()). The computation of HTSV is now done in bulk, in heap_page_prune(), rather than on-demand in heap_prune_chain(). Besides being a bit simpler, it also is faster: Memory accesses can happen sequentially, rather than in the order of HOT chains. There are other causes of HeapTupleSatisfiesVacuum() results changing between two visibility checks for the same tuple, even before `dc7420c2c9`. E.g. HEAPTUPLE_INSERT_IN_PROGRESS can change to HEAPTUPLE_DEAD when a transaction aborts between the two checks. None of the these other visibility status changes are known to cause corruption, but heap_page_prune()'s approach makes it hard to be confident. A patch implementing a more fundamental redesign of heap_page_prune(), which fixes this bug and simplifies pruning substantially, has been proposed by Peter Geoghegan in https://postgr.es/m/CAH2-WzmNk6V6tqzuuabxoxM8HJRaWU6h12toaS-bqYcLiht16A@mail.gmail.com However, that redesign is larger change than desirable for backpatching. As the new design still benefits from the batched visibility determination introduced in this commit, it makes sense to commit this narrower fix to 14 and master, and then commit Peter's improvement in master. The precise sequence required to trigger the bug is complicated and hard to do exercise in an isolation test (until we have wait points). Due to that the isolation test initially posted at https://postgr.es/m/20211119003623.d3jusiytzjqwb62p%40alap3.anarazel.de and updated in https://postgr.es/m/20211122175914.ayk6gg6nvdwuhrzb%40alap3.anarazel.de isn't committable. A followup commit will introduce additional assertions, to detect problems like this more easily. Bug: #17255 Reported-By: Alexander Lakhin <exclusion@gmail.com> Debugged-By: Andres Freund <andres@anarazel.de> Debugged-By: Peter Geoghegan <pg@bowt.ie> Author: Andres Freund <andres@andres@anarazel.de> Reviewed-By: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/20211122175914.ayk6gg6nvdwuhrzb@alap3.anarazel.de Backpatch: 14-, the oldest branch containing `dc7420c2c9`	2022-01-13 18:13:41 -08:00
Peter Geoghegan	e9b873f667	vacuumlazy.c: fix "garbage tuples" reference. Another minor oversight in commit `4f8d9d12`.	2022-01-12 14:13:35 -08:00
Amit Kapila	dbfa1022e4	Fix typo in rewriteheap.c. Author: Bharath Rupireddy Discussion: https://postgr.es/m/CALj2ACW7SvfFW8r2uKH6oQm1kNpt8aQMG61kSBPK0S2PHhFbMw@mail.gmail.com	2022-01-11 10:50:18 +05:30
Peter Eisentraut	ee41960738	Rename functions to avoid future conflicts Rename range_serialize/range_deserialize to brin_range_serialize/brin_range_deserialize, since there are already public range_serialize/range_deserialize in rangetypes.h. Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Discussion: https://www.postgresql.org/message-id/CA+renyX0ipvY6A_jUOHeB1q9mL4bEYfAZ5FBB7G7jUo5bykjrA@mail.gmail.com	2022-01-10 09:37:43 +01:00
Bruce Momjian	27b77ecf9f	Update copyright for 2022 Backpatch-through: 10	2022-01-07 19:04:57 -05:00
Tom Lane	913a03ec29	Remove redundant initialization of BrinMemTuple. brin_new_memtuple already did this, so there's no need for initialize_brin_buildstate to do it again. Richard Guo, reviewed by Bharath Rupireddy Discussion: https://postgr.es/m/CAMbWs4-kYYpKNOdiWtsCZ3jbkFFj4nhOVH22JH7dsrMYX=aGjg@mail.gmail.com	2022-01-04 16:52:51 -05:00
Alvaro Herrera	67a8cb5cbf	Fix silly mistake in Assert	2022-01-04 13:21:23 -03:00
Alvaro Herrera	f66885bec0	Allow special SKIP LOCKED condition in Assert() Under concurrency, it is possible for two sessions to be merrily locking and releasing a tuple and marking it again as HEAP_XMAX_INVALID all the while a third session attempts to lock it, miserably fails at it, and then contemplates life, the universe and everything only to eventually fail an assertion that said bit is not set. Before SKIP LOCKED that was indeed a reasonable expectation, but alas! commit `df630b0dd5` falsified it. This bug is as old as time itself, and even older, if you think time begins with the oldest supported branch. Therefore, backpatch to all supported branches. Author: Simon Riggs <simon.riggs@enterprisedb.com> Discussion: https://postgr.es/m/CANbhV-FeEwMnN8yuMyss7if1ZKjOKfjcgqB26n8pqu1e=q0ebg@mail.gmail.com	2022-01-04 13:01:05 -03:00
Peter Eisentraut	113fa3945f	Fix incorrect format placeholders	2021-12-29 10:08:41 +01:00
Amit Kapila	8e1fae1938	Move parallel vacuum code to vacuumparallel.c. This commit moves parallel vacuum related code to a new file commands/vacuumparallel.c so that any table AM supporting indexes can utilize parallel vacuum in order to call index AM callbacks (ambulkdelete and amvacuumcleanup) with parallel workers. Another reason for this refactoring is that the parallel vacuum isn't specific to heap so it doesn't make sense to keep this code in heap/vacuumlazy.c. Author: Masahiko Sawada, based on suggestion from Andres Freund Reviewed-by: Hou Zhijie, Amit Kapila, Haiying Tang Discussion: https://www.postgresql.org/message-id/20211030212101.ae3qcouatwmy7tbr%40alap3.anarazel.de	2021-12-23 11:42:52 +05:30
Amit Kapila	cc8b25712b	Move index vacuum routines to vacuum.c. An upcoming patch moves parallel vacuum code out of vacuumlazy.c. This code restructuring will allow both lazy vacuum and parallel vacuum to use index vacuum functions. Author: Masahiko Sawada Reviewed-by: Hou Zhijie, Amit Kapila Discussion: https://www.postgresql.org/message-id/20211030212101.ae3qcouatwmy7tbr%40alap3.anarazel.de	2021-12-22 07:55:14 +05:30
Thomas Munro	a13db0e164	Change ProcSendSignal() to take pgprocno. Instead of referring to target backends by pid, use pgprocno. This means that we don't have to scan the ProcArray and we can drop some special case code for dealing with the startup process. Discussion: https://postgr.es/m/CA%2BhUKGLYRyDaneEwz5Uya_OgFLMx5BgJfkQSD%3Dq9HmwsfRRb-w%40mail.gmail.com Reviewed-by: Soumyadeep Chakraborty <soumyadeep2007@gmail.com> Reviewed-by: Ashwin Agrawal <ashwinstar@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de>	2021-12-16 15:56:03 +13:00
Amit Kapila	22bd3cbe0c	Improve parallel vacuum implementation. Previously, in parallel vacuum, we allocated shmem area of IndexBulkDeleteResult only for indexes where parallel index vacuuming is safe and had null-bitmap in shmem area to access them. This logic was too complicated with a small benefit of saving only a few bits per indexes. In this commit, we allocate a dedicated shmem area for the array of LVParallelIndStats that includes a parallel-safety flag, the index vacuum status, and IndexBulkdeleteResult. There is one array element for every index, even those indexes where parallel index vacuuming is unsafe or not worthwhile. This commit makes the code clear by removing all bitmap-related code. Also, add the check each index vacuum status after parallel index vacuum to make sure that all indexes have been processed. Finally, rename parallel vacuum functions to parallel_vacuum_* for consistency. Author: Masahiko Sawada, based on suggestions by Andres Freund Reviewed-by: Hou Zhijie, Amit Kapila Discussion: https://www.postgresql.org/message-id/20211030212101.ae3qcouatwmy7tbr%40alap3.anarazel.de	2021-12-15 07:58:19 +05:30
Michael Paquier	ece8c76192	Remove assertion for replication origins in PREPARE TRANSACTION When using replication origins, pg_replication_origin_xact_setup() is an optional choice to be able to set a LSN and a timestamp to mark the origin, which would be additionally added to WAL for transaction commits or aborts (including 2PC transactions). An assertion in the code path of PREPARE TRANSACTION assumed that this data should always be set, so it would trigger when using replication origins without setting up an origin LSN. Some tests are added to cover more this kind of scenario. Oversight in commit `1eb6d65`. Per discussion with Amit Kapila and Masahiko Sawada. Discussion: https://postgr.es/m/YbbBfNSvMm5nIINV@paquier.xyz Backpatch-through: 11	2021-12-14 10:58:15 +09:00
Robert Haas	fa0e03c15a	Remove InitXLOGAccess(). It's not great that RecoveryInProgress() calls InitXLOGAccess(), because a status inquiry function typically shouldn't have the side effect of performing initializations. We could fix that by calling InitXLOGAccess() from some other place, but instead, let's remove it altogether. One thing InitXLogAccess() did is initialize wal_segment_size, but it doesn't need to do that. In the postmaster, PostmasterMain() calls LocalProcessControlFile(), and all child processes will inherit that value -- except in EXEC_BACKEND bulds, but then each backend runs SubPostmasterMain() which also calls LocalProcessControlFile(). The other thing InitXLOGAccess() did is update RedoRecPtr and doPageWrites, but that's not critical, because all code that uses them will just retry if it turns out that they've changed. The only difference is that most code will now see an initial value that is definitely invalid instead of one that might have just been way out of date, but that will only happen once per backend lifetime, so it shouldn't be a big deal. Patch by me, reviewed by Nathan Bossart, Michael Paquier, Andres Freund, Heikki Linnakangas, and Álvaro Herrera. Discussion: http://postgr.es/m/CA+TgmoY7b65qRjzHN_tWUk8B4sJqk1vj1d31uepVzmgPnZKeLg@mail.gmail.com	2021-12-13 09:58:36 -05:00
Robert Haas	64da07c41a	Default to log_checkpoints=on, log_autovacuum_min_duration=10m The idea here is that when a performance problem is known to have occurred at a certain point in time, it's a good thing if there is some information available from the logs to help figure out what might have happened around that time. This change attracted an above-average amount of dissent, because it means that a server with default settings will produce some amount of log output even if nothing has gone wrong. However, by my count, the mailing list discussion had about twice as many people in favor of the change as opposed. The reasons for believing that the extra log output is not an issue in practice are: (1) the rate at which messages can be generated by this setting is bounded to one every few minutes on a properly-configured system and (2) production systems tend to have a lot more junk in the log from that due to failed connection attempts, ERROR messages generated by application activity, and the like. Bharath Rupireddy, reviewed by Fujii Masao and by me. Many other people commented on the thread, but as far as I can see that was discussion of the merits of the change rather than review of the patch. Discussion: https://postgr.es/m/CALj2ACX-rW_OeDcp4gqrFUAkf1f50Fnh138dmkd0JkvCNQRKGA@mail.gmail.com	2021-12-13 09:48:48 -05:00
Michael Paquier	c8b733c4c4	Improve description of some WAL records with transaction commands This commit improves the description of some WAL records for the Transaction RMGR: - Track remote_apply for a transaction commit. This GUC is user-settable, so this information can be useful for debugging. - Add replication origin information for PREPARE TRANSACTION, with the origin ID, LSN and timestamp - Same as above, for ROLLBACK PREPARED. This impacts the format of pg_waldump or anything using these description routines, so no backpatch is done. Author: Masahiko Sawada, Michael Paquier Discussion: https://postgr.es/m/CAD21AoD2dJfgsdxk4_KciAZMZQoUiCvmV9sDpp8ZuKLtKCNXaA@mail.gmail.com	2021-12-13 11:02:47 +09:00
Michael Paquier	5d08137076	Fix some typos with {a,an} One of the changes impacts the documentation, so backpatch. Author: Peter Smith Discussion: https://postgr.es/m/CAHut+Pu6+c+r3mY24VT7u+H+E_s6vMr5OdRiZ8NT3EOa-E5Lmw@mail.gmail.com Backpatch-through: 14	2021-12-09 15:20:36 +09:00
Peter Geoghegan	bcf60585e6	Standardize cleanup lock terminology. The term "super-exclusive lock" is a synonym for "buffer cleanup lock" that first appeared in nbtree many years ago. Standardize things by consistently using the term cleanup lock. This finishes work started by commit `276db875`. There is no good reason to have two terms. But there is a good reason to only have one: to avoid confusion around why VACUUM acquires a full cleanup lock (not just an ordinary exclusive lock) in index AMs, during ambulkdelete calls. This has nothing to do with protecting the physical index data structure itself. It is needed to implement a locking protocol that ensures that TIDs pointing to the heap/table structure cannot get marked for recycling by VACUUM before it is safe (which is somewhat similar to how VACUUM uses cleanup locks during its first heap pass). Note that it isn't strictly necessary for index AMs to implement this locking protocol -- several index AMs use an MVCC snapshot as their sole interlock to prevent unsafe TID recycling. In passing, update the nbtree README. Cleanly separate discussion of the aforementioned index vacuuming locking protocol from discussion of the "drop leaf page pin" optimization added by commit `2ed5b87f`. We now structure discussion of the latter by describing how individual index scans may safely opt out of applying the standard locking protocol (and so can avoid blocking progress by VACUUM). Also document why the optimization is not safe to apply during nbtree index-only scans. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzngHgQa92tz6NQihf4nxJwRzCV36yMJO_i8dS+2mgEVKw@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzkHPgsBBvGWjz=8PjNhDefy7XRkDKiT5NxMs-n5ZCf2dA@mail.gmail.com	2021-12-08 17:24:45 -08:00
Michael Paquier	f99870dd86	Fix corruption of toast indexes with REINDEX CONCURRENTLY REINDEX CONCURRENTLY run on a toast index or a toast relation could corrupt the target indexes rebuilt, as a backend running in parallel that manipulates toast values would directly release the lock on the toast relation when its local operation is done, rather than releasing the lock once the transaction that manipulated the toast values committed. The fix done here is simple: we now hold a ROW EXCLUSIVE lock on the toast relation when saving or deleting a toast value until the transaction working on them is committed, so as a concurrent reindex happening in parallel would be able to wait for any activity and see any new rows inserted (or deleted). An isolation test is added to check after the case fixed here, which is a bit fancy by design as it relies on allow_system_table_mods to rename the toast table and its index to fixed names. This way, it is possible to reindex them directly without any dependency on the OID of the underlying relation. Note that this could not use a DO block either, as REINDEX CONCURRENTLY cannot be run in a transaction block. The test is backpatched down to 13, where it is possible, thanks to `c4a7a39`, to use allow_system_table_mods in a test suite. Reported-by: Alexey Ermakov Analyzed-by: Andres Freund, Noah Misch Author: Michael Paquier Reviewed-by: Nathan Bossart Discussion: https://postgr.es/m/17268-d2fb426e0895abd4@postgresql.org Backpatch-through: 12	2021-12-08 11:01:08 +09:00
Daniel Gustafsson	018b800245	Remove mention of TimeLineID update from comments Commit `4a92a1c3d` removed the TimeLineID update from RecoveryInProgress, update comments accordingly. Author: Amul Sul <sulamul@gmail.com> Discussion: https://postgr.es/m/CAAJ_b96wyzs8N45jc-kYd-bTE02hRWQieLZRpsUtNbhap7_PuQ@mail.gmail.com	2021-12-01 14:17:24 +01:00
Peter Geoghegan	4bdfe68559	vacuumlazy.c: fix remaining "dead tuple" references. Oversight in commit `4f8d9d12`. Reported-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAD21AoDm38Em0bvRqeQKr4HPvOj65Y8cUgCP4idMk39iaLrxyw@mail.gmail.com	2021-11-30 11:40:33 -08:00
Tomas Vondra	5753d4ee32	Ignore BRIN indexes when checking for HOT udpates When determining whether an index update may be skipped by using HOT, we can ignore attributes indexed only by BRIN indexes. There are no index pointers to individual tuples in BRIN, and the page range summary will be updated anyway as it relies on visibility info. This also removes rd_indexattr list, and replaces it with rd_attrsvalid flag. The list was not used anywhere, and a simple flag is sufficient. Patch by Josef Simanek, various fixes and improvements by me. Author: Josef Simanek Reviewed-by: Tomas Vondra, Alvaro Herrera Discussion: https://postgr.es/m/CAFp7QwpMRGcDAQumN7onN9HjrJ3u4X3ZRXdGFT0K5G2JWvnbWg%40mail.gmail.com	2021-11-30 20:04:38 +01:00
Alvaro Herrera	4c83e59e01	Increase size of shared memory for pg_commit_ts Like `5364b357fb` did for pg_commit, change the formula used to determine number of pg_commit_ts buffers, which helps performance with larger servers. Discussion: https://postgr.es/m/20210115220744.GA24457@alvherre.pgsql Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>	2021-11-30 14:29:31 -03:00
Peter Geoghegan	4f8d9d1217	vacuumlazy.c: Rename dead_tuples to dead_items. Commit `8523492d` simplified what it meant for an item to be considered "dead" to VACUUM: TIDs collected in memory (in preparation for index vacuuming) must always come from LP_DEAD stub line pointers in heap pages, found following pruning. This formalized the idea that index vacuuming (and heap vacuuming) are optional processes. Unlike pruning, they can be delayed indefinitely, without any risk of that violating fundamental invariants. For example, leaving LP_DEAD items behind clearly won't add to the risk of transaction ID wraparound. You can't have transaction ID wraparound without transaction IDs. Renaming anything that references DEAD tuples (tuples with storage) reinforces all this. Code outside vacuumlazy.c continues to fudge the distinction between dead/deleted tuples, and LP_DEAD items. This is necessary because autovacuum scheduling is still mostly driven by "dead items/tuples" statistics. In the future we may find it useful to replace this model with something more sophisticated, as a step towards teaching autovacuum to perform more frequent vacuuming that targeting individual indexes that happen to be more prone to becoming bloated through version churn. In passing, simplify some function signatures that deal with VACUUM's dead_items array. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAH2-WzktGBg4si6DEdmq3q6SoXSDqNi6MtmB8CmmTmvhsxDTLA@mail.gmail.com	2021-11-29 09:58:01 -08:00
Michael Paquier	6fb7c5d67c	Centralize timestamp computation of control file on updates This commit moves the timestamp computation of the control file within the routine of src/common/ in charge of updating the backend's control file, which is shared by multiple frontend tools (pg_rewind, pg_checksums and pg_resetwal) and the backend itself. This change has as direct effect to update the control file's timestamp when writing the control file in pg_rewind and pg_checksums, something that is helpful to keep track of control file updates for those operations, something also tracked by the backend at startup within its logs. This part is arguably a bug, as ControlFileData->time should be updated each time a new version of the control file is written, but this is a behavior change so no backpatch is done. Author: Amul Sul Reviewed-by: Nathan Bossart, Michael Paquier, Bharath Rupireddy Discussion: https://postgr.es/m/CAAJ_b97nd_ghRpyFV9Djf9RLXkoTbOUqnocq11WGq9TisX09Fw@mail.gmail.com	2021-11-29 13:36:13 +09:00
Tom Lane	3804539e48	Replace random(), pg_erand48(), etc with a better PRNG API and algorithm. Standardize on xoroshiro128 as our basic PRNG algorithm, eliminating a bunch of platform dependencies as well as fundamentally-obsolete PRNG code. In addition, this API replacement will ease replacing the algorithm again in future, should that become necessary. xoroshiro128 is a few percent slower than the drand48 family, but it can produce full-width 64-bit random values not only 48-bit, and it should be much more trustworthy. It's likely to be noticeably faster than the platform's random(), depending on which platform you are thinking about; and we can have non-global state vectors easily, unlike with random(). It is not cryptographically strong, but neither are the functions it replaces. Fabien Coelho, reviewed by Dean Rasheed, Aleksander Alekseev, and myself Discussion: https://postgr.es/m/alpine.DEB.2.22.394.2105241211230.165418@pseudo	2021-11-28 21:33:07 -05:00
Peter Geoghegan	276db875d4	vacuumlazy.c: prefer the term "cleanup lock". The term "super-exclusive lock" is an acceptable synonym of "cleanup lock". Even still, switching from one term to the other in the same file is confusing. Standardize on "cleanup lock" within vacuumlazy.c. Per a complaint from Andres Freund.	2021-11-27 16:05:01 -08:00
Peter Geoghegan	12b5ade902	Update high level vacuumlazy.c comments. Update vacuumlazy.c file header comments (as well as comments above the lazy_scan_heap function) that were largely written before the introduction of the HOT optimization, when lazy_scan_heap did far less, and didn't actually prune during its initial heap pass. Since lazy_scan_heap now outsources far more work to lower level functions, it makes sense to introduce the function by talking about the high level invariant that dictates the order in which each phase takes place. Also deemphasize the case where we run out of memory for TIDs, since delaying that discussion makes it easier to talk about issues of central importance. Finally, remove discussion of parallel VACUUM from header comments. These don't add much, and are in the wrong place.	2021-11-27 14:29:43 -08:00
Peter Geoghegan	1a6f5a0e87	Go back to considering HOT on pages marked full. Commit `2fd8685e7f` simplified the checking of modified attributes that takes place within heap_update(). This included a micro-optimization affecting pages marked PD_PAGE_FULL: don't even try to use HOT to save a few cycles on determining HOT safety. The assumption was that it won't work out this time around, since it can't have worked out last time around. Remove the micro-optimization. It could only ever save cycles that are consumed by the vast majority of heap_update() calls, which hardly seems worth the added complexity. It also seems quite possible that there are workloads that will do worse over time by repeated application of the micro-optimization, despite saving some cycles on average, in the short term. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/CAH2-WznU1L3+DMPr1F7o2eJBT7=3bAJoY6ZkWABAxNt+-afyTA@mail.gmail.com	2021-11-26 10:58:38 -08:00
Alvaro Herrera	44bd3ed332	Fix determination of broken LSN in OVERWRITTEN_CONTRECORD In commit `ff9f111bce` I mixed up inconsistent definitions of the LSN of the first record in a page, when the previous record ends exactly at the page boundary. The correct LSN is adjusted to skip the WAL page header; I failed to use that when setting XLogReaderState->overwrittenRecPtr, so at WAL replay time VerifyOverwriteContrecord would refuse to let replay continue past that record. Backpatch to 10. 9.6 also contains this bug, but it's no longer being maintained. Discussion: https://postgr.es/m/45597.1637694259@sss.pgh.pa.us	2021-11-26 11:14:27 -03:00
Peter Eisentraut	36cb5e7c51	Update comments Various places wanted to point out that tuple descriptors don't contain the variable-length fields of pg_attribute. This started when attacl was added, but more fields have been added since, and these comments haven't been kept up to date consistently. Reword so that the purpose is clearer and we don't have to keep updating them.	2021-11-26 09:57:23 +01:00
Andres Freund	3030903dfe	Replace straggling uses of ReadRecPtr/EndRecPtr. `d2ddfa681d` removed ReadRecPtr/EndRecPtr, but two uses within an #ifdef WAL_DEBUG escaped. Discussion: https://postgr.es/m/20211124231206.gbadj5bblcljb6d5@alap3.anarazel.de	2021-11-24 16:56:14 -08:00
Robert Haas	d2ddfa681d	xlog.c: Remove global variables ReadRecPtr and EndRecPtr. In most places, the variables necessarily store the same value as the eponymous members of the XLogReaderState that we use during WAL replay, because ReadRecord() assigns the values from the structure members to the global variables just after XLogReadRecord() returns. However, XLogBeginRead() adjusts the structure members but not the global variables, so after XLogBeginRead() and before the completion of XLogReadRecord() the values can differ. Otherwise, they must be identical. According to my analysis, the only place where either variable is referenced at a point where it might not have the same value as the structure member is the refrence to EndRecPtr within XLogPageRead. Therefore, at every other place where we are using the global variable, we can just switch to using the structure member instead, and remove the global variable. However, we can, and in fact should, do this in XLogPageRead() as well, because at that point in the code, the global variable will actually store the start of the record we want to read - either because it's where the last WAL record ended, or because the read position has been changed using XLogBeginRead since the last record was read. The structure member, on the other hand, will already have been updated to point to the end of the record we just read. Elsewhere, the latter is what we use as an argument to emode_for_corrupt_record(), so we should do the same here. This part of the patch is perhaps a bug fix, but I don't think it has any important consequences, so no back-patch. The point here is just to continue to whittle down the entirely excessive use of global variables in xlog.c. Discussion: http://postgr.es/m/CA+Tgmoao96EuNeSPd+hspRKcsCddu=b1h-QNRuKfY8VmfNQdfg@mail.gmail.com	2021-11-24 11:27:39 -05:00
Robert Haas	e7ea2fa342	Fix corner-case failure to detect improper timeline switch. rescanLatestTimeLine() contains a guard against switching to a timeline that forked off from the current one prior to the current recovery point, but that guard does not work if the timeline switch occurs before the first WAL recod (which must be the checkpoint record) is read. Without this patch, an improper timeline switch is therefore possible in such cases. This happens because rescanLatestTimeLine() relies on the global variable EndRecPtr to understand the current position of WAL replay. However, EndRecPtr at this point in the code contains the endpoint of the last-replayed record, not the startpoint or endpoint of the record being replayed now. Thus, before any records have been replayed, it's zero, which causes the sanity check to always pass. To fix, pass down the correct timeline explicitly. The EndRecPtr value we want is the one from the xlogreader, which will be the starting position of the record we're about to try to read, rather than the global variable, which is the ending position of the last record we successfully read. They're usually the same, but not in the corner case described here. No back-patch, because in v14 and earlier branhes, we were using the wrong TLI here as well as the wrong LSN. In master, that was fixed by commit `4a92a1c3d1`, but that and it's prerequisite patches are too invasive to back-patch for such a minor issue. Patch by me, reviewed by Amul Sul. Discussion: http://postgr.es/m/CA+Tgmoao96EuNeSPd+hspRKcsCddu=b1h-QNRuKfY8VmfNQdfg@mail.gmail.com	2021-11-24 08:13:10 -05:00
Alvaro Herrera	2fed48f48f	Be more specific about OOM in XLogReaderAllocate A couple of spots can benefit from an added errdetail(), which matches what we were already doing in other places; and those that cannot withstand errdetail() can get a more descriptive primary message. Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Julien Rouhaud <rjuju123@gmail.com> Discussion: https://postgr.es/m/CALj2ACV+cX1eM03GfcA=ZMLXh5fSn1X1auJLz3yuS1duPSb9QA@mail.gmail.com	2021-11-22 13:43:43 -03:00
Fujii Masao	1b06d7bac9	Report wait events for local shell commands like archive_command. This commit introduces new wait events for archive_command, archive_cleanup_command, restore_command and recovery_end_command. Author: Fujii Masao Reviewed-by: Bharath Rupireddy, Michael Paquier Discussion: https://postgr.es/m/4ca4f920-6b48-638d-08b2-93598356f5d3@oss.nttdata.com	2021-11-22 10:28:21 +09:00
Peter Geoghegan	97f5aef609	Remove lazy_scan_heap parallel VACUUM comment block. This doesn't belong next to very high level discussion of the tasks that lazy_scan_heap performs. There is already a similar, longer comment block at the top of vacuumlazy.c that mentions lazy_scan_heap directly.	2021-11-21 16:22:57 -08:00
Tom Lane	f4e7ae2b8a	Fix SP-GiST scan initialization logic for binary-compatible cases. Commit `ac9099fc1` rearranged the logic in spgGetCache() that determines the index's attType (nominal input data type) and leafType (actual type stored in leaf index tuples). Turns out this broke things for the case where (a) the actual input data type is different from the nominal type, (b) the opclass's config function leaves leafType defaulted, and (c) the opclass has no "compress" function. (b) caused us to assign the actual input data type as leafType, and then since that's not attType, we complained that a "compress" function is required. For non-polymorphic opclasses, condition (a) arises in binary-compatible cases, such as using SP-GiST text_ops for a varchar column, or using any opclass on a domain over its nominal input type. To fix, use attType for leafType when the index's declared column type is different from but binary-compatible with attType. Do this only in the defaulted-leafType case, to avoid overriding any explicit selection made by the opclass. Per bug #17294 from Ilya Anfimov. Back-patch to v14. Discussion: https://postgr.es/m/17294-8f6c7962ce877edc@postgresql.org	2021-11-20 14:29:56 -05:00
Amit Kapila	0f0cfb4940	Fix parallel operations that prevent oldest xmin from advancing. While determining xid horizons, we skip over backends that are running Vacuum. We also ignore Create Index Concurrently, or Reindex Concurrently for the purposes of computing Xmin for Vacuum. But we were not setting the flags corresponding to these operations when they are performed in parallel which was preventing Xid horizon from advancing. The optimization related to skipping Create Index Concurrently, or Reindex Concurrently operations was implemented in PG-14 but the fix is the same for the Parallel Vacuum as well so back-patched till PG-13. Author: Masahiko Sawada Reviewed-by: Amit Kapila Backpatch-through: 13 Discussion: https://postgr.es/m/CAD21AoCLQqgM1sXh9BrDFq0uzd3RBFKi=Vfo6cjjKODm0Onr5w@mail.gmail.com	2021-11-19 09:04:40 +05:30
Michael Paquier	f975fc3a35	Remove global variable "LastRec" in xlog.c This variable is used only by StartupXLOG() now, so let's make it local to simplify the code. Author: Amul Sul Reviewed-by: Tom Lane, Michael Paquier Discussion: https://postgr.es/m/CAAJ_b96Qd023itERBRN9Z7P2saNDT3CYvGuMO8RXwndVNN6z7g@mail.gmail.com	2021-11-17 11:04:18 +09:00
Robert Haas	e51c46991f	Move InitXLogInsert() call from InitXLOGAccess() to BaseInit(). At present, there is an undocumented coding rule that you must call RecoveryInProgress(), or do something else that results in a call to InitXLogInsert(), before trying to write WAL. Otherwise, the WAL construction buffers won't be initialized, resulting in failures. Since it's not good to rely on a status inquiry function like RecoveryInProgress() having the side effect of initializing critical data structures, instead do the initialization eariler, when the backend first starts up. Patch by me. Reviewed by Nathan Bossart and Michael Paquier. Discussion: http://postgr.es/m/CA+TgmoY7b65qRjzHN_tWUk8B4sJqk1vj1d31uepVzmgPnZKeLg@mail.gmail.com	2021-11-16 09:43:17 -05:00
Peter Geoghegan	b0f7425ec2	Explain pruning pgstats accounting subtleties. Add a comment explaining why the pgstats accounting used during opportunistic heap pruning operations (to maintain the current number of dead tuples in the relation) needs to compensate by subtracting away the number of new LP_DEAD items. This is needed so it can avoid completely forgetting about tuples that become LP_DEAD items during pruning -- they should still count. It seems more natural to discuss this issue at the only relevant call site (opportunistic pruning), since the same issue does not apply to the only other caller (the VACUUM call site). Move everything there too. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-Wzm7f+A6ej650gi_ifTgbhsadVW5cujAL3punpupHff5Yg@mail.gmail.com	2021-11-12 19:45:58 -08:00
Noah Misch	3354746910	Report any XLogReadRecord() error in XlogReadTwoPhaseData(). Buildfarm members kittiwake and tadarida have witnessed errors at this site. The site discarded key facts. Back-patch to v10 (all supported versions). Reviewed by Michael Paquier and Tom Lane. Discussion: https://postgr.es/m/20211107013157.GB790288@rfd.leadboat.com	2021-11-11 17:10:18 -08:00
Peter Geoghegan	42f9427aa9	Update heap_page_prune() free space map comments. It is up to the heap_page_prune() caller to decide what to do about updating the FSM for a page following pruning. Update old comments that address what we might want to do as if it was the responsibility of heap_page_prune() itself. heap_page_prune() doesn't have enough high-level context to make a sensible choice.	2021-11-11 13:42:17 -08:00
Peter Geoghegan	eb9baef8e9	Update another obsolete reference in vacuumlazy.c. Addresses an oversight in commit `7ab96cf6`.	2021-11-11 13:13:08 -08:00
Robert Haas	beb4e9ba16	Improve performance of pgarch_readyXlog() with many status files. Presently, the archive_status directory was scanned for each file to archive. When there are many status files, say because archive_command has been failing for a long time, these directory scans can get very slow. With this change, the archiver remembers several files to archive during each directory scan, speeding things up. To ensure timeline history files are archived as quickly as possible, XLogArchiveNotify() forces the archiver to do a new directory scan as soon as the .ready file for one is created. Nathan Bossart, per a long discussion involving many people. It is not clear to me exactly who out of all those people reviewed this particular patch. Discussion: http://postgr.es/m/CA+TgmobhAbs2yabTuTRkJTq_kkC80-+jw=pfpypdOJ7+gAbQbw@mail.gmail.com Discussion: http://postgr.es/m/620F3CE1-0255-4D66-9D87-0EADE866985A@amazon.com	2021-11-11 15:20:26 -05:00
Robert Haas	a27048cbcb	More cleanup of 'ThisTimeLineID'. In XLogCtlData, rename the structure member ThisTimeLineID to InsertTimeLineID and update the comments to make clear that it's only expected to be set after recovery is complete. In StartupXLOG, replace the local variables ThisTimeLineID and PrevTimeLineID with new local variables replayTLI and newTLI. In the old scheme, ThisTimeLineID was the replay TLI until we created a new timeline, and after that the replay TLI was in PrevTimeLineID. Now, replayTLI is the TLI from which we last replayed WAL throughout the entire function, and newTLI is either that, or the new timeline created upon promotion. Remove some misleading comments from the comment block just above where recoveryTargetTimeLineGoal and friends are declared. It's become incorrect, not only because ThisTimeLineID as a variable is now gone, but also because the rmgr code does not care about ThisTimeLineID and has not since what used to be the TLI field in the page header was repurposed to store the page checksum. Add a comment GetFlushRecPtr that it's only supposed to be used in normal running, and an assertion to verify that this is so. Per some ideas from Michael Paquier and some of my own. Review by Michael Paquier also. Discussion: http://postgr.es/m/CA+TgmoY1a2d1AnVR3tJcKmGGkhj7GGrwiNwjtKr21dxOuLBzCQ@mail.gmail.com	2021-11-10 09:45:24 -05:00
Tom Lane	c3ec4f8fe8	Silence uninitialized-variable warning. Quite a few buildfarm animals are warning about this, and lapwing is actually failing (because -Werror). It's a false positive AFAICS, so no need to do more than zero the variable to start with. Discussion: https://postgr.es/m/YYXJnUxgw9dZKxlX@paquier.xyz	2021-11-07 12:18:18 -05:00
Peter Geoghegan	02f9fd1294	Update obsolete reference in vacuumlazy.c. Oversight in commit `7ab96cf6`.	2021-11-05 23:38:07 -07:00
Tomas Vondra	d91353f4b2	Fix handling of NaN values in BRIN minmax multi When calculating distance between float4/float8 values, we need to be a bit more careful about NaN values in order not to trigger assert. We consider NaN values to be equal (distace 0.0) and in infinite distance from all other values. On builds without asserts, this issue is mostly harmless - the ranges may be merged in less efficient order, but the index is still correct. Per report from Andreas Seltenreich. Backpatch to 14, where this new BRIN opclass was introduced. Reported-by: Andreas Seltenreich Discussion: https://postgr.es/m/87r1bw9ukm.fsf@credativ.de	2021-11-06 01:50:44 +01:00
Peter Geoghegan	f214960add	Update obsolete heap pruning comments. Add new comments that spell out what VACUUM expects from heap pruning: pruning must never leave behind DEAD tuples that still have tuple storage. This has at least been the case since commit `8523492d`, which established the principle that vacuumlazy.c doesn't have to deal with DEAD tuples that still have tuple storage directly, except perhaps by simply retrying pruning (to handle a rare corner case involving concurrent transaction abort). In passing, update some references to old symbol names that were missed by the snapshot scalability work (specifically commit `dc7420c2c9`).	2021-11-05 14:08:47 -07:00
Robert Haas	4a92a1c3d1	Change ThisTimeLineID from a global variable to a local variable. StartupXLOG() still has ThisTimeLineID as a local variable, but the remaining code in xlog.c now needs to the relevant TimeLineID by some other means. Mostly, this means that we now pass it as a function parameter to a bunch of functions where we didn't previously. However, a few cases require special handling: - In functions that might be called by outside callers who wouldn't necessarily know what timeline to specify, we get the timeline ID from shared memory. XLogCtl->ThisTimeLineID can be used in most cases since recovery is known to have completed by the time those functions are called. In xlog_redo(), we can use XLogCtl->replayEndTLI. - XLogFileClose() needs to know the TLI of the open logfile. Do that with a new global variable openLogTLI. While someone could argue that this is just trading one global variable for another, the new one has a far more narrow purposes and is referenced in just a few places. - read_backup_label() now returns the TLI that it obtains by parsing the backup_label file. Previously, ReadRecord() could be called to parse the checkpoint record without ThisTimeLineID having been initialized. Now, the timeline is passed down, and I didn't want to pass an uninitialized variable; this change lets us avoid that. The old coding didn't seem to have any practical consequences that we need to worry about, but this is cleaner. - In BootstrapXLOG(), it's just a constant. Patch by me, reviewed and tested by Michael Paquier, Amul Sul, and Álvaro Herrera. Discussion: https://postgr.es/m/CA+TgmobfAAqhfWa1kaFBBFvX+5CjM=7TE=n4r4Q1o2bjbGYBpA@mail.gmail.com	2021-11-05 12:53:15 -04:00
Robert Haas	e997a0c642	Remove all use of ThisTimeLineID global variable outside of xlog.c All such code deals with this global variable in one of three ways. Sometimes the same functions use it in more than one of these ways at the same time. First, sometimes it's an implicit argument to one or more functions being called in xlog.c or elsewhere, and must be set to the appropriate value before calling those functions lest they misbehave. In those cases, it is now passed as an explicit argument instead. Second, sometimes it's used to obtain the current timeline after the end of recovery, i.e. the timeline to which WAL is being written and flushed. Such code now calls GetWALInsertionTimeLine() or relies on the new out parameter added to GetFlushRecPtr(). Third, sometimes it's used during recovery to store the current replay timeline. That can change, so such code must generally update the value before each use. It can still do that, but must now use a local variable instead. The net effect of these changes is to reduce by a fair amount the amount of code that is directly accessing this global variable. That's good, because history has shown that we don't always think clearly about which timeline ID it's supposed to contain at any given point in time, or indeed, whether it has been or needs to be initialized at any given point in the code. Patch by me, reviewed and tested by Michael Paquier, Amul Sul, and Álvaro Herrera. Discussion: https://postgr.es/m/CA+TgmobfAAqhfWa1kaFBBFvX+5CjM=7TE=n4r4Q1o2bjbGYBpA@mail.gmail.com	2021-11-05 12:50:01 -04:00
Peter Geoghegan	e7428a99a1	Add hardening to catch invalid TIDs in indexes. Add hardening to the heapam index tuple deletion path to catch TIDs in index pages that point to a heap item that index tuples should never point to. The corruption we're trying to catch here is particularly tricky to detect, since it typically involves "extra" (corrupt) index tuples, as opposed to the absence of required index tuples in the index. For example, a heap TID from an index page that turns out to point to an LP_UNUSED item in the heap page has a good chance of being caught by one of the new checks. There is a decent chance that the recently fixed parallel VACUUM bug (see commit `9bacec15`) would have been caught had that particular check been in place for Postgres 14. No backpatch of this extra hardening for now, though. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAH2-Wzk-4_raTzawWGaiqNvkpwDXxv3y1AQhQyUeHfkU=tFCeA@mail.gmail.com	2021-11-04 19:54:05 -07:00
Peter Geoghegan	5cd7eb1f1c	Add various assertions to heap pruning code. These assertions document (and verify) our high level assumptions about how pruning can and cannot affect existing items from target heap pages. For example, one of the new assertions verifies that pruning does not set a heap-only tuple to LP_DEAD. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAH2-Wz=vhvBx1GjF+oueHh8YQcHoQYrMi0F0zFMHEr8yc4sCoA@mail.gmail.com	2021-11-04 19:07:54 -07:00
Peter Geoghegan	c59278a1aa	Fix parallel amvacuumcleanup safety bug. Commit `b4af70cb` inverted the return value of the function parallel_processing_is_safe(), but missed the amvacuumcleanup test. Index AMs that don't support parallel cleanup at all were affected. The practical consequences of this bug were not very serious. Hash indexes are affected, but since they just return the number of blocks during hashvacuumcleanup anyway, it can't have had much impact. Author: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAD21AoA-Em+aeVPmBbL_s1V-ghsJQSxYL-i3JP8nTfPiD1wjKw@mail.gmail.com Backpatch: 14-, where commit `b4af70cb` appears.	2021-11-02 19:52:11 -07:00
Peter Geoghegan	9bacec15b6	Don't overlook indexes during parallel VACUUM. Commit `b4af70cb`, which simplified state managed by VACUUM, performed refactoring of parallel VACUUM in passing. Confusion about the exact details of the tasks that the leader process is responsible for led to code that made it possible for parallel VACUUM to miss a subset of the table's indexes entirely. Specifically, indexes that fell under the min_parallel_index_scan_size size cutoff were missed. These indexes are supposed to be vacuumed by the leader (alongside any parallel unsafe indexes), but weren't vacuumed at all. Affected indexes could easily end up with duplicate heap TIDs, once heap TIDs were recycled for new heap tuples. This had generic symptoms that might be seen with almost any index corruption involving structural inconsistencies between an index and its table. To fix, make sure that the parallel VACUUM leader process performs any required index vacuuming for indexes that happen to be below the size cutoff. Also document the design of parallel VACUUM with these below-size-cutoff indexes. It's unclear how many users might be affected by this bug. There had to be at least three indexes on the table to hit the bug: a smaller index, plus at least two additional indexes that themselves exceed the size cutoff. Cases with just one additional index would not run into trouble, since the parallel VACUUM cost model requires two larger-than-cutoff indexes on the table to apply any parallel processing. Note also that autovacuum was not affected, since it never uses parallel processing. Test case based on tests from a larger patch to test parallel VACUUM by Masahiko Sawada. Many thanks to Kamigishi Rei for her invaluable help with tracking this problem down. Author: Peter Geoghegan <pg@bowt.ie> Author: Masahiko Sawada <sawada.mshk@gmail.com> Reported-By: Kamigishi Rei <iijima.yun@koumakan.jp> Reported-By: Andrew Gierth <andrew@tao11.riddles.org.uk> Diagnosed-By: Andres Freund <andres@anarazel.de> Bug: #17245 Discussion: https://postgr.es/m/17245-ddf06aaf85735f36@postgresql.org Discussion: https://postgr.es/m/20211030023740.qbnsl2xaoh2grq3d@alap3.anarazel.de Backpatch: 14-, where the refactoring commit appears.	2021-11-02 12:06:17 -07:00
Amit Kapila	335397456b	Move MarkCurrentTransactionIdLoggedIfAny() out of the critical section. We don't modify any shared state in this function which could cause problems for any concurrent session. This will make it look similar to the other updates for the same structure (TransactionState) which avoids confusion for future readers of code. Author: Dilip Kumar Reviewed-by: Amit Kapila Discussion: https://postgr.es/m/E1mSoYz-0007Fh-D9@gemulon.postgresql.org	2021-11-02 09:11:05 +05:30
Amit Kapila	71db6459e6	Replace XLOG_INCLUDE_XID flag with a more localized flag. Commit `0bead9af48` introduced XLOG_INCLUDE_XID flag to indicate that the WAL record contains subXID-to-topXID association. It uses that flag later to mark in CurrentTransactionState that top-xid is logged so that we should not try to log it again with the next WAL record in the current subtransaction. However, we can use a localized variable to pass that information. In passing, change the related function and variable names to make them consistent with what the code is actually doing. Author: Dilip Kumar Reviewed-by: Alvaro Herrera, Amit Kapila Discussion: https://postgr.es/m/E1mSoYz-0007Fh-D9@gemulon.postgresql.org	2021-11-02 08:35:29 +05:30
Daniel Gustafsson	43a134f28b	Replace unicode characters in comments with ascii The unicode characters, while in comments and not code, caused MSVC to emit compiler warning C4819: The file contains a character that cannot be represented in the current code page (number). Save the file in Unicode format to prevent data loss. Fix by replacing the characters in print.c with descriptive comments containing the codepoints and symbol names, and remove the character in brin_bloom.c which was a footnote reference copied from the paper citation. Per report from hamerkop in the buildfarm. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/340E4118-0D0C-4E85-8141-8C40EB22DA3A@yesql.se	2021-11-01 22:42:49 +01:00
Peter Geoghegan	5f55fc5a34	Demote pg_unreachable() in heapam to an assertion. Commit `d168b66682`, which overhauled index deletion, added a pg_unreachable() to the end of a sort comparator used when sorting heap TIDs from an index page. This allows the compiler to apply optimizations that assume that the heap TIDs from the index AM must always be unique. That doesn't seem like a good idea now, given recent reports of corruption involving duplicate TIDs in indexes on Postgres 14. Demote to an assertion, just in case. Backpatch: 14-, where index deletion was overhauled.	2021-10-29 10:53:48 -07:00
Peter Geoghegan	4c6afd805b	Remove obsolete nbtree LP_DEAD item comments. Comments above _bt_findinsertloc() that talk about LP_DEAD items are now out of place. We already discuss index tuple deletion at an earlier point in the same comment block. Oversight in commit `d168b666`.	2021-10-27 14:35:21 -07:00
Daniel Gustafsson	8af57ad815	Fix typos in comments Author: Peter Smith <smithpb2250@gmail.com> Discussion: https://postgr.es/m/CAHut+PsN_gmKu-KfeEb9NDARoTPbs4AN4PPu=6LZXFZRJ13SEw@mail.gmail.com	2021-10-27 22:38:38 +02:00
Peter Geoghegan	c2381b5104	Fix ordering of items in nbtree error message. Oversight in commit `a5213adf`. Backpatch: 13-, just like commit `a5213adf`.	2021-10-27 13:09:24 -07:00
Peter Geoghegan	a5213adf3d	Further harden nbtree posting split code. Add more defensive checks around posting list split code. These should detect corruption involving duplicate table TIDs earlier and more reliably than any existing check. Follow up to commit `8f72bbac`. Discussion: https://postgr.es/m/CAH2-WzkrSY_kjyd1_M5xJK1uM0govJXMxPn8JUSvwcUOiHuWVw@mail.gmail.com Backpatch: 13-, where nbtree deduplication was introduced.	2021-10-27 12:10:47 -07:00
Robert Haas	a030a0c5cc	Initialize variable to placate compiler. Per Nathan Bossart. Discussion: http://postgr.es/m/FECEE7FC-CB74-45A9-BB24-89FEE52A9585@amazon.com	2021-10-25 16:31:00 -04:00
Robert Haas	9ce346eabf	Report progress of startup operations that take a long time. Users sometimes get concerned whe they start the server and it emits a few messages and then doesn't emit any more messages for a long time. Generally, what's happening is either that the system is taking a long time to apply WAL, or it's taking a long time to reset unlogged relations, or it's taking a long time to fsync the data directory, but it's not easy to tell which is the case. To fix that, add a new 'log_startup_progress_interval' setting, by default 10s. When an operation that is known to be potentially long-running takes more than this amount of time, we'll log a status update each time this interval elapses. To avoid undesirable log chatter, don't log anything about WAL replay when in standby mode. Nitin Jadhav and Robert Haas, reviewed by Amul Sul, Bharath Rupireddy, Justin Pryzby, Michael Paquier, and Álvaro Herrera. Discussion: https://postgr.es/m/CA+TgmoaHQrgDFOBwgY16XCoMtXxsrVGFB2jNCvb7-ubuEe1MGg@mail.gmail.com Discussion: https://postgr.es/m/CAMm1aWaHF7VE69572_OLQ+MgpT5RUiUDgF1x5RrtkJBLdpRj3Q@mail.gmail.com	2021-10-25 11:51:57 -04:00
Robert Haas	18e0913a42	StartupXLOG: Don't repeatedly disable/enable local xlog insertion. All the code that runs in the startup process to write WAL records before that's allowed generally is now consecutive, so there's no reason to shut the facility to write WAL locally off and then turn it on again three times in a row. Unfortunately, this requires a slight kludge in the checkpointer, which needs to separately enable writing WAL in order to write the checkpoint record. Because that code might run in the same process as StartupXLOG() if we are in single-user mode, we must save/restore the state of the LocalXLogInsertAllowed flag. Hopefully, we'll be able to eliminate this wart in further refactoring, but it's not too bad anyway. Amul Sul, with modifications by me. Discussion: http://postgr.es/m/CAAJ_b97fysj6sRSQEfOHj-y8Jfd5uPqOgO74qast89B4WfD+TA@mail.gmail.com	2021-10-25 10:16:28 -04:00
Robert Haas	a75dbf7f9e	StartupXLOG: Call CleanupAfterArchiveRecovery after XLogReportParameters. This does a better job grouping related operations together, since all of the WAL records that we need to write prior to allowing WAL writes generally and written by a single uninterrupted stretch of code. Since CleanupAfterArchiveRecovery() just (1) runs recovery_end_command, (2) removes non-parent xlog files, and (3) archives any final partial segment, this should be safe, because all of those things are pretty much unrelated to the WAL record written by XLogReportParameters(). Amul Sul, per a suggestion from me Discussion: http://postgr.es/m/CAAJ_b97fysj6sRSQEfOHj-y8Jfd5uPqOgO74qast89B4WfD+TA@mail.gmail.com	2021-10-25 10:02:36 -04:00
Noah Misch	3cd9c3b921	Fix CREATE INDEX CONCURRENTLY for the newest prepared transactions. The purpose of commit `8a54e12a38` was to fix this, and it sufficed when the PREPARE TRANSACTION completed before the CIC looked for lock conflicts. Otherwise, things still broke. As before, in a cluster having used CIC while having enabled prepared transactions, queries that use the resulting index can silently fail to find rows. It may be necessary to reindex to recover from past occurrences; REINDEX CONCURRENTLY suffices. Fix this for future index builds by making CIC wait for arbitrarily-recent prepared transactions and for ordinary transactions that may yet PREPARE TRANSACTION. As part of that, have PREPARE TRANSACTION transfer locks to its dummy PGPROC before it calls ProcArrayClearTransaction(). Back-patch to 9.6 (all supported versions). Andrey Borodin, reviewed (in earlier versions) by Andres Freund. Discussion: https://postgr.es/m/01824242-AA92-4FE9-9BA7-AEBAFFEA3D0C@yandex-team.ru	2021-10-23 18:36:38 -07:00
Michael Paquier	409f9ca447	Reset properly snapshot export state during transaction abort During a replication slot creation, an ERROR generated in the same transaction as the one creating a to-be-exported snapshot would have left the backend in an inconsistent state, as the associated static export snapshot state was not being reset on transaction abort, but only on the follow-up command received by the WAL sender that created this snapshot on replication slot creation. This would trigger inconsistency failures if this session tried to export again a snapshot, like during the creation of a replication slot. Note that a snapshot export cannot happen in a transaction block, so there is no need to worry resetting this state for subtransaction aborts. Also, this inconsistent state would very unlikely show up to users. For example, one case where this could happen is an out-of-memory error when building the initial snapshot to-be-exported. Dilip found this problem while poking at a different patch, that caused an error in this code path for reasons unrelated to HEAD. Author: Dilip Kumar Reviewed-by: Michael Paquier, Zhihong Yu Discussion: https://postgr.es/m/CAFiTN-s0zA1Kj0ozGHwkYkHwa5U0zUE94RSc_g81WrpcETB5=w@mail.gmail.com Backpatch-through: 9.6	2021-10-18 11:55:42 +09:00
Peter Geoghegan	b76c1d6e84	Remove obsolete nbtree deduplication comments. Follow up to commit `2903f140`.	2021-10-15 15:25:20 -07:00
Robert Haas	811051c2e7	Postpone some end-of-recovery operations related to allowing WAL. CreateOverwriteContrecordRecord(), UpdateFullPageWrites(), PerformRecoveryXLogAction(), and CleanupAfterArchiveRecovery() are moved somewhat later in StartupXLOG(). This is preparatory work for a future patch that wants to allow recovery to end at one time and only later start to allow WAL writes. To do that, it's necessary to separate code that has to do with allowing WAL writes from other things that need to happen simply because recovery is ending, such as initializing shared memory data structures that depend on information that might not be accurate before redo is complete. This commit does not achieve that goal, but it is a step in that direction. For example, there are a few different bits of code that write things into WAL once we have finished recovery, and with this change, those bits of code are closer to each other than previously, with fewer unrelated bits of code interspersed. Robert Haas and Amul Sul Discussion: http://postgr.es/m/CAAJ_b97abMuq=470Wahun=aS1PHTSbStHtrjjPaD-C0YQ1AqVw@mail.gmail.com	2021-10-14 11:55:50 -04:00
Robert Haas	6df1543abf	Refactor some end-of-recovery code out of StartupXLOG(). Create a new function PerformRecoveryXLogAction() and move the code which either writes an end-of-recovery record or requests a checkpoint there. Also create a new function CleanupAfterArchiveRecovery() to perform a few tasks that we want to do after we've actually exited archive recovery but before we start accepting new WAL writes. More refactoring of this file is planned, but this commit is just straightforward code movement to make StartupXLOG() a little bit shorter and a little bit easier to understand. Robert Haas and Amul Sul Discussion: http://postgr.es/m/CAAJ_b97abMuq=470Wahun=aS1PHTSbStHtrjjPaD-C0YQ1AqVw@mail.gmail.com	2021-10-13 12:23:32 -04:00
Michael Paquier	68f7c4b57a	Clean up more code using "(expr) ? true : false" This is similar to `fd0625c`, taking care of any remaining code paths that are worth the cleanup. This also changes some cases using opposite expression patterns. Author: Justin Pryzby, Masahiko Sawada Discussion: https://postgr.es/m/CAD21AoCdF8dnUvr-BUWWGvA_XhKSoANacBMZb6jKyCk4TYfQ2Q@mail.gmail.com	2021-10-11 09:36:42 +09:00
Fujii Masao	68601985e6	Make recovery report error message when invalid page header is found. Commit `0668719801` changed XLogPageRead() so that it validated the page header, if invalid page header was found reset the error message and retried reading the page, to fix the scenario where streaming standby got stuck at a continuation record. This change hid the error message about invalid page header, which would make it harder for users to investigate what the actual issue was found in WAL. To fix the issue, this commit makes XLogPageRead() report the error message when invalid page header is found. When not in standby mode, an invalid page header should cause recovery to end, not retry reading the page, so XLogPageRead() doesn't need to validate the page header for the retry. Instead, ReadPageInternal() should be responsible for the validation in that case. Therefore this commit changes XLogPageRead() so that if not in standby mode it doesn't validate the page header for the retry. Reported-by: Yugo Nagata Author: Yugo Nagata, Kyotaro Horiguchi Reviewed-by: Ranier Vilela, Fujii Masao Discussion: https://postgr.es/m/20210718045505.32f463ed6c227111038d8ae4@sraoss.co.jp	2021-10-06 00:16:03 +09:00
Daniel Gustafsson	7111e332c5	Fix duplicate words in comments Remove accidentally duplicated words in code comments. Author: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://postgr.es/m/87bl45t0co.fsf@wibble.ilmari.org	2021-10-04 15:12:57 +02:00
Daniel Gustafsson	941921b875	Replace occurrences of InvalidXid with InvalidTransactionId While Xid is a known shortening of TransactionId, InvalidXid is not defined in the code. Fix comments which mistakenly were using the shorter version. Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://postgr.es/m/CALj2ACUQzdigML868nV4cojfELPkEzNLNOk7b91Pho4JB90fng@mail.gmail.com	2021-10-04 10:31:01 +02:00
Michael Paquier	8a4237908c	Fix snapshot builds during promotion of hot standby node with 2PC Some specific logic is done at the end of recovery when involving 2PC transactions: 1) Call RecoverPreparedTransactions(), to recover the state of 2PC transactions into memory (re-acquire locks, etc.). 2) ShutdownRecoveryTransactionEnvironment(), to move back to normal operations, mainly cleaning up recovery locks and KnownAssignedXids (including any 2PC transaction tracked previously). 3) Switch XLogCtl->SharedRecoveryState to RECOVERY_STATE_DONE, which is the tipping point for any process calling RecoveryInProgress() to check if the cluster is still in recovery or not. Any snapshot taken between steps 2) and 3) would be empty, causing any transaction relying on a snapshot at this point to potentially corrupt data as there could still be some 2PC transactions to track, with RecentXmin moving backwards on successive calls to GetSnapshotData() in the same transaction. As SharedRecoveryState is the point to take into account to know if it is safe to discard KnownAssignedXids, this commit moves step 2) after step 3), so as we can never finish with empty snapshots. This exists since the introduction of hot standby, so backpatch all the way down. The window with incorrect snapshots is extremely small, but I have seen it when running 023_pitr_prepared_xact.pl, as did buildfarm member fairywren. Thomas Munro also found it independently. Special thanks to Andres Freund for taking the time to analyze this issue. Reported-by: Thomas Munro, Michael Paquier Analyzed-by: Andres Freund Discussion: https://postgr.es/m/20210422203603.fdnh3fu2mmfp2iov@alap3.anarazel.de Backpatch-through: 9.6	2021-10-04 14:05:20 +09:00
Peter Geoghegan	2903f1404d	Enable deduplication in system catalog indexes. The "equality implies image equality" opclass infrastructure disallowed deduplication in system catalog indexes and TOAST indexes before now. That seemed like the right approach back when the infrastructure was added by commit `612a1ab7`, since ALTER INDEX cannot set deduplicate_items to 'off' (due to an old implementation restriction). But that decision now seems arbitrary at best. Remove special case handling implementing this policy. No catversion bump, since existing catalog indexes will still work. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-Wz=rYQHFaJ3WYBdK=xgwxKzaiGMSSrh-ZCREa-pS-7Zjew@mail.gmail.com	2021-10-02 17:12:59 -07:00
Alvaro Herrera	d186d233df	Remove unstable, unnecessary test; fix typo Commit `ff9f111bce` added some test code that's unportable and doesn't add meaningful coverage. Remove it rather than try and get it to work everywhere. While at it, fix a typo in a log message added by the aforementioned commit. Backpatch to 14. Discussion: https://postgr.es/m/3000074.1632947632@sss.pgh.pa.us	2021-10-01 18:03:11 -03:00
Tom Lane	7b5d4c29ed	Fix Portal snapshot tracking to handle subtransactions properly. Commit `84f5c2908` forgot to consider the possibility that EnsurePortalSnapshotExists could run inside a subtransaction with lifespan shorter than the Portal's. In that case, the new active snapshot would be popped at the end of the subtransaction, leaving a dangling pointer in the Portal, with mayhem ensuing. To fix, make sure the ActiveSnapshot stack entry is marked with the same subtransaction nesting level as the associated Portal. It's certainly safe to do so since we won't be here at all unless the stack is empty; hence we can't create an out-of-order stack. Let's also apply this logic in the case where PortalRunUtility sets portalSnapshot, just to be sure that path can't cause similar problems. It's slightly less clear that that path can't create an out-of-order stack, so add an assertion guarding it. Report and patch by Bertrand Drouvot (with kibitzing by me). Back-patch to v11, like the previous commit. Discussion: https://postgr.es/m/ff82b8c5-77f4-3fe7-6028-fcf3303e82dd@amazon.com	2021-10-01 11:10:12 -04:00
Alvaro Herrera	ff9f111bce	Fix WAL replay in presence of an incomplete record Physical replication always ships WAL segment files to replicas once they are complete. This is a problem if one WAL record is split across a segment boundary and the primary server crashes before writing down the segment with the next portion of the WAL record: WAL writing after crash recovery would happily resume at the point where the broken record started, overwriting that record ... but any standby or backup may have already received a copy of that segment, and they are not rewinding. This causes standbys to stop following the primary after the latter crashes: LOG: invalid contrecord length 7262 at A8/D9FFFBC8 because the standby is still trying to read the continuation record (contrecord) for the original long WAL record, but it is not there and it will never be. A workaround is to stop the replica, delete the WAL file, and restart it -- at which point a fresh copy is brought over from the primary. But that's pretty labor intensive, and I bet many users would just give up and re-clone the standby instead. A fix for this problem was already attempted in commit `515e3d84a0`, but it only addressed the case for the scenario of WAL archiving, so streaming replication would still be a problem (as well as other things such as taking a filesystem-level backup while the server is down after having crashed), and it had performance scalability problems too; so it had to be reverted. This commit fixes the problem using an approach suggested by Andres Freund, whereby the initial portion(s) of the split-up WAL record are kept, and a special type of WAL record is written where the contrecord was lost, so that WAL replay in the replica knows to skip the broken parts. With this approach, we can continue to stream/archive segment files as soon as they are complete, and replay of the broken records will proceed across the crash point without a hitch. Because a new type of WAL record is added, users should be careful to upgrade standbys first, primaries later. Otherwise they risk the standby being unable to start if the primary happens to write such a record. A new TAP test that exercises this is added, but the portability of it is yet to be seen. This has been wrong since the introduction of physical replication, so backpatch all the way back. In stable branches, keep the new XLogReaderState members at the end of the struct, to avoid an ABI break. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Nathan Bossart <bossartn@amazon.com> Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql	2021-09-29 11:21:51 -03:00
Peter Geoghegan	895267a326	Remove unneeded nbtree latestRemovedXid comments. Discussing the low level issue of nbtree VACUUM and recovery conflicts in btvacuumpage() now seems inappropriate. The same issue is discussed in nbtxlog.h, as well as in a comment block above _bt_delitems_vacuum(). The comment block made more sense when it was part of a broader discussion of nbtree VACUUM "pin scans". These were removed by commit `9f83468b`.	2021-09-26 20:25:14 -07:00
Peter Geoghegan	ce2a860533	Update obsolete nbtree deletion comments. _bt_delitems_delete() is no longer the high-level entry point used by index tuple deletion driven by index tuples whose LP_DEAD bits are set (now called "simple index tuple deletion"). It became a lower level routine that's only called by _bt_delitems_delete_check() following commit `d168b66682`.	2021-09-25 15:05:56 -07:00
Peter Geoghegan	c1a47dfe2e	vacuumlazy.c: Remove obsolete 'onecall' comment. Remove obsolete reference to lazy_vacuum()'s onecall argument. The function argument was removed by commit `3499df0dee`. Also remove adjoining comment block that introduces the wraparound failsafe concept. Talking about the failsafe here no longer makes sense, since lazy_vacuum() (and related functions) are no longer the only place where the failsafe might be triggered. This has been the case since commit `c242baa4a8` taught VACUUM to consider triggering the failsafe mechanism during its initial heap scan.	2021-09-25 10:22:53 -07:00
Peter Geoghegan	48064a8d33	nbtree README: Add note about latestRemovedXid. Point out that index tuple deletion generally needs a latestRemovedXid value for the deletion operation's WAL record. This is bound to be the most expensive part of the whole deletion operation now that it takes place up front, during original execution. This was arguably an oversight in commit `558a9165e0`, which moved the work required to generate these values from index deletion REDO routines to original execution of index deletion operations.	2021-09-24 13:53:48 -07:00
Peter Geoghegan	c7aeb775df	Document issue with heapam line pointer truncation. Checking that an offset number isn't past the end of a heap page's line pointer array was just a defensive sanity check for HOT-chain traversal code before commit `3c3b8a4b`. It's etrictly necessary now, though. Add comments that reference the issue to code in heapam that needs to get it right. Per suggestion from Alexander Lakhin. Discussion: https://postgr.es/m/f76a292c-9170-1aef-91a0-59d9443b99a3@gmail.com	2021-09-22 19:21:36 -07:00
Peter Geoghegan	dd94c2852e	Fix "single value strategy" index deletion issue. It is not appropriate for deduplication to apply single value strategy when triggered by a bottom-up index deletion pass. This wastes cycles because later bottom-up deletion passes will overinterpret older duplicate tuples that deduplication actually just skipped over "by design". It also makes bottom-up deletion much less effective for low cardinality indexes that happen to cross a meaningless "index has single key value per leaf page" threshold. To fix, slightly narrow the conditions under which deduplication's single value strategy is considered. We already avoided the strategy for a unique index, since our high level goal must just be to buy time for VACUUM to run (not to buy space). We'll now also avoid it when we just had a bottom-up pass that reported failure. The two cases share the same high level goal, and already overlapped significantly, so this approach is quite natural. Oversight in commit `d168b666`, which added bottom-up index deletion. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WznaOvM+Gyj-JQ0X=JxoMDxctDTYjiEuETdAGbF5EUc3MA@mail.gmail.com Backpatch: 14-, where bottom-up deletion was introduced.	2021-09-21 18:57:32 -07:00
Alvaro Herrera	ade24dab97	Document XLOG_INCLUDE_XID a little better I noticed that commit `0bead9af48` left this flag undocumented in XLogSetRecordFlags, which led me to discover that the flag doesn't actually do what the one comment on it said it does. Improve the situation by adding some more comments. Backpatch to 14, where the aforementioned commit appears. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/202109212119.c3nhfp64t2ql@alvherre.pgsql	2021-09-21 19:47:53 -03:00
Peter Geoghegan	5e6716cde5	Remove overzealous index deletion assertion. A broken HOT chain is not an unexpected condition, even when the offset number points past the end of the page's line pointer array. heap_prune_chain() does not (and never has) treated this condition as unexpected, so derivative code in heap_index_delete_tuples() shouldn't do so either. Oversight in commit `4228817449`. The assertion can probably only fail on Postgres 14 and master. Earlier releases don't have commit `3c3b8a4b`, which taught VACUUM to truncate the line pointer array of heap pages. Backpatch all the same, just to be consistent. Author: Peter Geoghegan <pg@bowt.ie> Reported-By: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/17197-9438f31f46705182@postgresql.org Backpatch: 12-, just like commit `4228817449`.	2021-09-20 14:26:25 -07:00
Tom Lane	2e4eae87d0	Send NOTIFY signals during CommitTransaction. Formerly, we sent signals for outgoing NOTIFY messages within ProcessCompletedNotifies, which was also responsible for sending relevant ones of those messages to our connected client. It therefore had to run during the main-loop processing that occurs just before going idle. This arrangement had two big disadvantages: * Now that procedures allow intra-command COMMITs, it would be useful to send NOTIFYs to other sessions immediately at COMMIT (though, for reasons of wire-protocol stability, we still shouldn't forward them to our client until end of command). * Background processes such as replication workers would not send NOTIFYs at all, since they never execute the client communication loop. We've had requests to allow triggers running in replication workers to send NOTIFYs, so that's a problem. To fix these things, move transmission of outgoing NOTIFY signals into AtCommit_Notify, where it will happen during CommitTransaction. Also move the possible call of asyncQueueAdvanceTail there, to ensure we don't bloat the async SLRU if a background worker sends many NOTIFYs with no one listening. We can also drop the call of asyncQueueReadAllNotifications, allowing ProcessCompletedNotifies to go away entirely. That's because commit `790026972` added a call of ProcessNotifyInterrupt adjacent to PostgresMain's call of ProcessCompletedNotifies, and that does its own call of asyncQueueReadAllNotifications, meaning that we were uselessly doing two such calls (inside two separate transactions) whenever inbound notify signals coincided with an outbound notify. We need only set notifyInterruptPending to ensure that ProcessNotifyInterrupt runs, and we're done. The existing documentation suggests that custom background workers should call ProcessCompletedNotifies if they want to send NOTIFY messages. To avoid an ABI break in the back branches, reduce it to an empty routine rather than removing it entirely. Removal will occur in v15. Although the problems mentioned above have existed for awhile, I don't feel comfortable back-patching this any further than v13. There was quite a bit of churn in adjacent code between 12 and 13. At minimum we'd have to also backpatch `51004c717`, and a good deal of other adjustment would also be needed, so the benefit-to-risk ratio doesn't look attractive. Per bug #15293 from Michael Powers (and similar gripes from others). Artur Zakirov and Tom Lane Discussion: https://postgr.es/m/153243441449.1404.2274116228506175596@wrigleys.postgresql.org	2021-09-14 17:18:25 -04:00
Michael Paquier	fd0625c7a9	Clean up some code using "(expr) ? true : false" All the code paths simplified here were already using a boolean or used an expression that led to zero or one, making the extra bits unnecessary. Author: Justin Pryzby Reviewed-by: Tom Lane, Michael Paquier, Peter Smith Discussion: https://postgr.es/m/20210428182936.GE27406@telsasoft.com	2021-09-08 09:44:04 +09:00
Tom Lane	b30cc0fd6d	Further portability tweaks for float4/float8 hash functions. Attempting to make hashfloat4() look as much as possible like hashfloat8(), I'd figured I could replace NaNs with get_float4_nan() before widening to float8. However, results from protosciurus and topminnow show that on some platforms that produces a different bit-pattern from get_float8_nan(), breaking the intent of `ce773f230`. Rearrange so that we use the result of get_float8_nan() for all NaN cases. As before, back-patch.	2021-09-04 16:29:08 -04:00
Alvaro Herrera	96b665083e	Revert "Avoid creating archive status ".ready" files too early" This reverts commit `515e3d84a0` and equivalent commits in back branches. This solution to the problem has a number of problems, so we'll try again with a different approach. Per note from Andres Freund Discussion: https://postgr.es/m/20210831042949.52eqp5xwbxgrfank@alap3.anarazel.de	2021-09-04 12:14:30 -04:00
Tom Lane	ce773f230d	Fix float4/float8 hash functions to produce uniform results for NaNs. The IEEE 754 standard allows a wide variety of bit patterns for NaNs, of which at least two ("NaN" and "-NaN") are pretty easy to produce from SQL on most machines. This is problematic because our btree comparison functions deem all NaNs to be equal, but our float hash functions know nothing about NaNs and will happily produce varying hash codes for them. That causes unexpected results from queries that hash a column containing different NaN values. It could also produce unexpected lookup failures when using a hash index on a float column, i.e. "WHERE x = 'NaN'" will not find all the rows it should. To fix, special-case NaN in the float hash functions, not too much unlike the existing special case that forces zero and minus zero to hash the same. I arranged for the most vanilla sort of NaN (that coming from the C99 NAN constant) to still have the same hash code as before, to reduce the risk to existing hash indexes. I dithered about whether to back-patch this into stable branches, but ultimately decided to do so. It's a clear improvement for queries that hash internally. If there is anybody who has -NaN in a hash index, they'd be well advised to re-index after applying this patch ... but the misbehavior if they don't will not be much worse than the misbehavior they had before. Per bug #17172 from Ma Liangzhu. Discussion: https://postgr.es/m/17172-7505bea9e04e230f@postgresql.org	2021-09-02 17:24:41 -04:00
Peter Eisentraut	590ecd9823	Fix incorrect format placeholders	2021-09-01 10:49:13 +02:00
Peter Geoghegan	b175b9cde7	VACUUM VERBOSE: Don't report "pages removed". It doesn't make any sense to report this information, since VACUUM VERBOSE reports on heap relation truncation directly. This was an oversight in commit `7ab96cf6`, which made VACUUM VERBOSE output a little more consistent with nearby autovacuum-specific log output. Adjust comments that describe how this is supposed to work in passing. Also bring truncation-related VACUUM VERBOSE output in line with the convention established for VACUUM VERBOSE output by commit `f4f4a649`. Author: Peter Geoghegan <pg@bowt.ie> Backpatch: 14-, where VACUUM VERBOSE's output changed.	2021-08-31 20:37:18 -07:00
Peter Geoghegan	0f6aa893cb	Remove obsolete nbtree relation extension comment. Commit `0d1fe9f7` improved the approach that vacuumlazy.c takes when it encounters an empty heap page. It no acquires the relation extension lock.	2021-08-31 16:55:39 -07:00
Peter Geoghegan	6320806ac3	vacuumlazy.c: Correct prune state comment. Oversight in commit `7ab96cf6b3`.	2021-08-31 16:35:01 -07:00
Peter Geoghegan	47029f775a	Remove unneeded old_rel_pages VACUUM state field. The field hasn't been used since commit `3d351d91`, which redefined pg_class.reltuples to be -1 before the first VACUUM or ANALYZE. Also rename a local variable of the same name ("old_rel_pages"). This is used by relation truncation to represent the original relation size at the start of the ongoing VACUUM operation. Rename it to orig_rel_pages, since that's a lot clearer. (This name matches similar nearby code.)	2021-08-31 14:59:52 -07:00
Alvaro Herrera	961dd75657	Report tuple address in data-corruption error message Most data-corruption reports mention the location of the problem, but this one failed to. Add it. Backpatch all the way back. In 12 and older, also assign the ERRCODE_DATA_CORRUPTED error code as was done in commit `fd6ec93bf8` for 13 and later. Discussion: https://postgr.es/m/202108191637.oqyzrdtnheir@alvherre.pgsql	2021-08-30 16:29:12 -04:00
Tom Lane	3778bcb39a	Count SP-GiST index scans in pg_stat statistics. Somehow, spgist overlooked the need to call pgstat_count_index_scan(). Hence, pg_stat_all_indexes.idx_scan and equivalent columns never became nonzero for an SP-GiST index, although the related per-tuple counters worked fine. This fix works a bit differently from other index AMs, in that the counter increment occurs in spgrescan not spggettuple/spggetbitmap. It looks like this won't make the user-visible semantics noticeably different, so I won't go to the trouble of introducing an is-this- the-first-call flag just to make the counter bumps happen in the same places. Per bug #17163 from Christian Quest. Back-patch to all supported versions. Discussion: https://postgr.es/m/17163-b8c5cc88322a5e92@postgresql.org	2021-08-27 19:53:05 -04:00
Peter Geoghegan	bda822554b	track_io_timing logging: Don't special case 0 ms. Adjust track_io_timing related logging code added by commit `94d13d474d`. Make it consistent with other nearby autovacuum and autoanalyze logging code by removing logic that suppressed zero millisecond outputs. log_autovacuum_min_duration log output now reliably shows "read:" and "write:" millisecond-based values in its report (when track_io_timing is enabled). Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Stephen Frost <sfrost@snowman.net> Discussion: https://postgr.es/m/CAH2-WznW0FNxSVQMSRazAMYNfZ6DR_gr5WE78hc6E1CBkkJpzw@mail.gmail.com Backpatch: 14-, where the track_io_timing logging was introduced.	2021-08-27 13:34:00 -07:00
Peter Geoghegan	fdfbfa24fa	Reorder log_autovacuum_min_duration log output. This order seems more natural. It starts with details that are particular to heap and index data structures, and ends with system-level costs incurred during the autovacuum worker's VACUUM/ANALYZE operation. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzkzxK6ahA9xxsOftRtBX_R0swuHZsvo4QUbak1Bz7hb7Q@mail.gmail.com Backpatch: 14-, which enhanced the log output in various ways.	2021-08-27 13:08:41 -07:00
Peter Geoghegan	de5dcb0796	vacuumlazy.c: Remove unnecessary parentheses. This was arguably a minor oversight in commit `b4af70cb`, which cleaned up the function signatures of functions that modify IndexBulkDeleteResult variables.	2021-08-27 09:47:16 -07:00
Robert Haas	a780b2fcce	Fix broken snapshot handling in parallel workers. Pengchengliu reported an assertion failure in a parallel woker while performing a parallel scan using an overflowed snapshot. The proximate cause is that TransactionXmin was set to an incorrect value. The underlying cause is incorrect snapshot handling in parallel.c. In particular, InitializeParallelDSM() was unconditionally calling GetTransactionSnapshot(), because I (rhaas) mistakenly thought that was always retrieving an existing snapshot whereas, at isolation levels less than REPEATABLE READ, it's actually taking a new one. So instead do this only at higher isolation levels where there actually is a single snapshot for the whole transaction. By itself, this is not a sufficient fix, because we still need to guarantee that TransactionXmin gets set properly in the workers. The easiest way to do that seems to be to install the leader's active snapshot as the transaction snapshot if the leader did not serialize a transaction snapshot. This doesn't affect the results of future GetTrasnactionSnapshot() calls since those have to take a new snapshot anyway; what we care about is the side effect of setting TransactionXmin. Report by Pengchengliu. Patch by Greg Nancarrow, except for some comment text which I supplied. Discussion: https://postgr.es/m/002f01d748ac$eaa781a0$bff684e0$@tju.edu.cn	2021-08-25 08:32:04 -04:00
Alvaro Herrera	515e3d84a0	Avoid creating archive status ".ready" files too early WAL records may span multiple segments, but XLogWrite() does not wait for the entire record to be written out to disk before creating archive status files. Instead, as soon as the last WAL page of the segment is written, the archive status file is created, and the archiver may process it. If PostgreSQL crashes before it is able to write and flush the rest of the record (in the next WAL segment), the wrong version of the first segment file lingers in the archive, which causes operations such as point-in-time restores to fail. To fix this, keep track of records that span across segments and ensure that segments are only marked ready-for-archival once such records have been completely written to disk. This has always been wrong, so backpatch all the way back. Author: Nathan Bossart <bossartn@amazon.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Ryo Matsumura <matsumura.ryo@fujitsu.com> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com	2021-08-23 15:50:35 -04:00
Alvaro Herrera	6f8127b739	Revert analyze support for partitioned tables This reverts the following commits: `1b5617eb84` Describe (auto-)analyze behavior for partitioned tables `0e69f705cc` Set pg_class.reltuples for partitioned tables `41badeaba8` Document ANALYZE storage parameters for partitioned tables `0827e8af70` autovacuum: handle analyze for partitioned tables There are efficiency issues in this code when handling databases with large numbers of partitions, and it doesn't look like there isn't any trivial way to handle those. There are some other issues as well. It's now too late in the cycle for nontrivial fixes, so we'll have to let Postgres 14 users continue to manually deal with ANALYZE their partitioned tables, and hopefully we can fix the issues for Postgres 15. I kept [most of] `be280cdad2` ("Don't reset relhasindex for partitioned tables on ANALYZE") because while we added it due to `0827e8af70`, it is a good bugfix in its own right, since it affects manual analyze as well as autovacuum-induced analyze, and there's no reason to revert it. I retained the addition of relkind 'p' to tables included by pg_stat_user_tables, because reverting that would require a catversion bump. Also, in pg14 only, I keep a struct member that was added to PgStat_TabStatEntry to avoid breaking compatibility with existing stat files. Backpatch to 14. Discussion: https://postgr.es/m/20210722205458.f2bug3z6qzxzpx2s@alap3.anarazel.de	2021-08-16 17:27:52 -04:00
Daniel Gustafsson	069d33d0c5	Emit namespace in the post-copy errmsg During a VACUUM or CLUSTER command, the initial output emits a fully qualified relation path with namespace. The post-action errmsg only emitted the relation name however, which may lead to hard to parse output when using multiple jobs with vacuumdb as the output from different jobs may be interleaved. Include the full path in the post-action errmsg to be consistent with the initial errmsg. Author: Mike Fiedler <miketheman@gmail.com> Reviewed-by: Corey Huinker <corey.huinker@gmail.com> Discussion: https://postgr.es/m/CAMerE0oz+8G-aORZL_BJcPxnBqewZAvND4bSUysjz+r-oT1BxQ@mail.gmail.com	2021-08-16 20:06:54 +02:00
Michael Paquier	e4ba1005c0	Refresh apply delay on reload of recovery_min_apply_delay at recovery This commit ensures that the wait interval in the replay delay loop waiting for an amount of time defined by recovery_min_apply_delay is correctly handled on reload, recalculating the delay if this GUC value is updated, based on the timestamp of the commit record being replayed. The previous behavior would be problematic for example with replay still waiting even if the delay got reduced or just cancelled. If the apply delay was increased to a larger value, the wait would have just respected the old value set, finishing earlier. Author: Soumyadeep Chakraborty, Ashwin Agrawal Reviewed-by: Kyotaro Horiguchi, Michael Paquier Discussion: https://postgr.es/m/CAE-ML+93zfr-HLN8OuxF0BjpWJ17O5dv1eMvSE5jsj9jpnAXZA@mail.gmail.com Backpatch-through: 9.6	2021-08-16 12:10:22 +09:00
John Naylor	b05f7ecec4	Fix grammar mistake in hash index README Dilip Kumar Discussion: https://www.postgresql.org/message-id/CAFiTN-tjZbuY6vy7kZZ6xO%2BD4mVcO5wOPB5KiwJ3AHhpytd8fg%40mail.gmail.com	2021-08-12 08:53:41 -04:00
Michael Paquier	710796f054	Avoid unnecessary shared invalidations in ROLLBACK PREPARED The performance gain is minimal, but this makes the logic more consistent with AtEOXact_Inval(). No other invalidation is needed in this case as PREPARE takes already care of sending any local ones. Author: Liu Huailing Reviewed-by: Tom Lane, Michael Paquier Discussion: https://postgr.es/m/OSZPR01MB6215AA84D71EF2B3D354CF86BE139@OSZPR01MB6215.jpnprd01.prod.outlook.com	2021-08-12 20:12:47 +09:00
Peter Eisentraut	ae03a7c739	Remove some unnecessary casts in format arguments We can use %zd or %zu directly, no need to cast to int. Conversely, some code was casting away from int when it could be using %d directly.	2021-08-08 22:08:07 +02:00
Peter Eisentraut	f4f4a649d8	Message style improvements	2021-08-07 12:09:37 +02:00
Andres Freund	fa91d4c91f	Make parallel worker shutdown complete entirely via before_shmem_exit(). This is a step toward storing stats in dynamic shared memory. As dynamic shared memory segments are detached from just after before_shmem_exit() callbacks are processed, but before on_shmem_exit() callbacks are, no stats can be collected after before_shmem_exit() callbacks have been processed. Parallel worker shutdown can cause stats to be emitted during DSM detach callbacks, e.g. for SharedFileSet (which closes its files, which can causes fd.c to emit stats about temporary files). Therefore parallel worker shutdown needs to complete during the processing of before_shmem_exit callbacks. One might think this problem could instead be solved by carefully ordering the attaching to DSM segments, so that the pgstats segments get detached from later than the parallel query ones. That turns out to not work because the stats hash might need to grow which can cause new segments to be allocated, which then will be detached from earlier. There are two code changes: First, call ParallelWorkerShutdown() via before_shmem_exit. That's a good idea on its own, because other shutdown callbacks like ShutdownPostgres and ShutdownAuxiliaryProcess are called via before_*. Second, explicitly detach from the parallel query DSM segment, thereby ensuring all stats are emitted during ParallelWorkerShutdown(). There are nicer solutions to these problems, but it's not obvious which of those solutions is the correct one. As the shared memory stats work already is a huge amount of work... Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20210405092914.mmxqe7j56lsjfsej@alap3.anarazel.de Discussion: https://postgr.es/m/20210803023612.iziacxk5syn2r4ut@alap3.anarazel.de	2021-08-06 19:08:56 -07:00
Andres Freund	1bc8e7b099	pgstat: split reporting/fetching of bgwriter and checkpointer stats. These have been unrelated since bgwriter and checkpointer were split into two processes in `806a2aee37`. As there several pending patches (shared memory stats, extending the set of tracked IO / buffer statistics) that are made a bit more awkward by the grouping, split them. Done separately to make reviewing easier. This does not change the contents of pg_stat_bgwriter or move fields out of bgwriter/checkpointer stats that arguably do not belong in either. However pgstat_fetch_global() was renamed and split into pgstat_fetch_stat_checkpointer() and pgstat_fetch_stat_bgwriter(). Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20210405092914.mmxqe7j56lsjfsej@alap3.anarazel.de	2021-08-04 19:16:04 -07:00
Peter Geoghegan	cc8033e1da	Make vacuum_index_cleanup reloption RELOPT_TYPE_ENUM. Oversight in commit `3499df0d`, which generalized the reloption as a way of giving users a way to consistently avoid VACUUM's index bypass optimization. Per off-list report from Nikolay Shaplov. Backpatch: 14-, where index cleanup reloption was extended.	2021-08-03 21:53:41 -07:00
Thomas Munro	8f7c8e2bef	Further simplify a bit of logic in StartupXLOG(). Commit `7ff23c6d27` left us with two identical cases. Collapse them. Author: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGJ8NRsqgkZEnsnRc2MFROBV-jCnacbYvtpptK2A9YYp9Q%40mail.gmail.com	2021-08-03 14:16:58 +12:00
Thomas Munro	7ff23c6d27	Run checkpointer and bgwriter in crash recovery. Start up the checkpointer and bgwriter during crash recovery (except in --single mode), as we do for replication. This wasn't done back in commit `cdd46c76` out of caution. Now it seems like a better idea to make the environment as similar as possible in both cases. There may also be some performance advantages. Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> Discussion: https://postgr.es/m/CA%2BhUKGJ8NRsqgkZEnsnRc2MFROBV-jCnacbYvtpptK2A9YYp9Q%40mail.gmail.com	2021-08-02 17:32:44 +12:00
Heikki Linnakangas	317632f307	Move InRecovery and standbyState global vars to xlogutils.c. They are used in code that runs both during normal operation and during WAL replay, and needs to behave differently during replay. Move them to xlogutils.c, because that's where we have other helper functions used by redo routines. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/b3b71061-4919-e882-4857-27e370ab134a%40iki.fi	2021-07-31 09:50:26 +03:00
Heikki Linnakangas	4fe8dcdff3	Extract code to describe recovery stop reason to a function. StartupXLOG() is very long, this makes it a little bit more readable. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/b3b71061-4919-e882-4857-27e370ab134a%40iki.fi	2021-07-31 09:49:30 +03:00
Heikki Linnakangas	6b16532811	Remove unnecessary 'restoredFromArchive' global variable. It might've been useful for debugging purposes, but meh. There's 'readSource' which does almost the same thing. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/b3b71061-4919-e882-4857-27e370ab134a%40iki.fi	2021-07-31 09:38:32 +03:00
Heikki Linnakangas	e9f5a0681c	Don't use O_SYNC or similar when opening signal file to fsync it. No need to use get_sync_bit() when we're calling pg_fsync() on the file. We're not writing to the files, so it doesn't make any difference in practice, but seems less surprising this way. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/b3b71061-4919-e882-4857-27e370ab134a%40iki.fi	2021-07-31 09:36:11 +03:00
Robert Haas	1d919de5eb	Remove unnecessary call to ReadCheckpointRecord(). It should always be the case that the last checkpoint record is still readable, because otherwise, a crash would leave us in a situation from which we can't recover. Therefore the test removed by this patch should always succeed. For it to fail, either there has to be a serious bug in the code someplace, or the user has to be manually modifying pg_wal while crash recovery is running. If it's the first one, we should fix the bug. If it's the second one, they should stop, or anyway they're doing so at their own risk. In neither case does a full checkpoint instead of an end-of-recovery record seem like a clear winner. Furthermore, rarely-taken code paths are particularly vulnerable to bugs, so let's simplify by getting rid of this one. Discussion: http://postgr.es/m/CA+TgmoYmw==TOJ6EzYb_vcjyS09NkzrVKSyBKUUyo1zBEaJASA@mail.gmail.com	2021-07-30 08:35:13 -04:00
Heikki Linnakangas	df9f0c716c	Update obsolete comment that still referred to CheckpointLock CheckpointLock was removed in commit `d18e75664a`, and commit `ce197e91d0` updated a leftover comment in CreateCheckPoint, but there was another copy of it in CreateRestartPoint still.	2021-07-30 12:52:44 +03:00
Alvaro Herrera	ce197e91d0	Close yet another race condition in replication slot test code Buildfarm shows that this test has a further failure mode when a checkpoint starts earlier than expected, so we detect a "checkpoint completed" line that's not the one we want. Change the config to try and prevent this. Per buildfarm While at it, update one comment that was forgotten in commit `d18e75664a`. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20210729.162038.534808353849568395.horikyota.ntt@gmail.com	2021-07-29 17:09:06 -04:00
Fujii Masao	a00c138b78	Update minimum recovery point on truncation during WAL replay of abort record. If a file is truncated, we must update minRecoveryPoint. Once a file is truncated, there's no going back; it would not be safe to stop recovery at a point earlier than that anymore. Commit `7bffc9b7bf` changed xact_redo_commit() so that it updates minRecoveryPoint on truncation, but forgot to change xact_redo_abort(). Back-patch to all supported versions. Reported-by: mengjuan.cmj@alibaba-inc.com Author: Fujii Masao Reviewed-by: Heikki Linnakangas Discussion: https://postgr.es/m/b029fce3-4fac-4265-968e-16f36ff4d075.mengjuan.cmj@alibaba-inc.com	2021-07-29 01:31:41 +09:00
Fujii Masao	7fcf2faf9c	Make XLOG_FPI_FOR_HINT records honor full_page_writes setting. Commit `2c03216d83` changed XLOG_FPI_FOR_HINT records so that they always included full-page images even when full_page_writes was disabled. However, in this setting, they don't need to do that because hint bit updates don't need to be protected from torn writes. Therefore, this commit makes XLOG_FPI_FOR_HINT records honor full_page_writes setting. That is, XLOG_FPI_FOR_HINT records may include no full-page images if full_page_writes is disabled, and WAL replay of them does nothing. Reported-by: Zhang Wenjie Author: Kyotaro Horiguchi Reviewed-by: Fujii Masao Discussion: https://postgr.es/m/tencent_60F11973A111EED97A8596FFECC4A91ED405@qq.com	2021-07-21 11:19:00 +09:00
Alvaro Herrera	ead9e51e82	Advance old-segment horizon properly after slot invalidation When some slots are invalidated due to the max_slot_wal_keep_size limit, the old segment horizon should move forward to stay within the limit. However, in commit `c655077639` we forgot to call KeepLogSeg again to recompute the horizon after invalidating replication slots. In cases where other slots remained, the limits would be recomputed eventually for other reasons, but if all slots were invalidated, the limits would not move at all afterwards. Repair. Backpatch to 13 where the feature was introduced. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reported-by: Marcin Krupowicz <mk@071.ovh> Discussion: https://postgr.es/m/17103-004130e8f27782c9@postgresql.org	2021-07-16 12:07:30 -04:00
Tom Lane	a49d081235	Replace explicit PIN entries in pg_depend with an OID range test. As of v14, pg_depend contains almost 7000 "pin" entries recording the OIDs of built-in objects. This is a fair amount of bloat for every database, and it adds time to pg_depend lookups as well as initdb. We can get rid of all of those entries in favor of an OID range check, i.e. "OIDs below FirstUnpinnedObjectId are pinned". (template1 and the public schema are exceptions. Those exceptions are now wired into IsPinnedObject() instead of initdb's code for filling pg_depend, but it's the same amount of cruft either way.) The contents of pg_shdepend are modified likewise. Discussion: https://postgr.es/m/3737988.1618451008@sss.pgh.pa.us	2021-07-15 11:41:47 -04:00
Amit Kapila	a8fd13cab0	Add support for prepared transactions to built-in logical replication. To add support for streaming transactions at prepare time into the built-in logical replication, we need to do the following things: * Modify the output plugin (pgoutput) to implement the new two-phase API callbacks, by leveraging the extended replication protocol. * Modify the replication apply worker, to properly handle two-phase transactions by replaying them on prepare. * Add a new SUBSCRIPTION option "two_phase" to allow users to enable two-phase transactions. We enable the two_phase once the initial data sync is over. We however must explicitly disable replication of two-phase transactions during replication slot creation, even if the plugin supports it. We don't need to replicate the changes accumulated during this phase, and moreover, we don't have a replication connection open so we don't know where to send the data anyway. The streaming option is not allowed with this new two_phase option. This can be done as a separate patch. We don't allow to toggle two_phase option of a subscription because it can lead to an inconsistent replica. For the same reason, we don't allow to refresh the publication once the two_phase is enabled for a subscription unless copy_data option is false. Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi, Greg Nancarrow Tested-By: Haiying Tang Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com	2021-07-14 07:33:50 +05:30
Tom Lane	f10f0ae420	Replace RelationOpenSmgr() with RelationGetSmgr(). The idea behind this patch is to design out bugs like the one fixed by commit `9d523119f`. Previously, once one did RelationOpenSmgr(rel), it was considered okay to access rel->rd_smgr directly for some not-very-clear interval. But since that pointer will be cleared by relcache flushes, we had bugs arising from overreliance on a previous RelationOpenSmgr call still being effective. Now, very little code except that in rel.h and relcache.c should ever touch the rd_smgr field directly. The normal coding rule is to use RelationGetSmgr(rel) and not expect the result to be valid for longer than one smgr function call. There are a couple of places where using the function every single time seemed like overkill, but they are now annotated with large warning comments. Amul Sul, after an idea of mine. Discussion: https://postgr.es/m/CANiYTQsU7yMFpQYnv=BrcRVqK_3U3mtAzAsJCaqtzsDHfsUbdQ@mail.gmail.com	2021-07-12 17:01:36 -04:00
Heikki Linnakangas	4c64b51dc5	Remove dead assignment to local variable. This should have been removed in commit `7e30c186da`, which split the loop into two. Only the first loop uses the 'from' variable; updating it in the second loop is bogus. It was never read after the first loop, so this was harmless and surely optimized away by the compiler, but let's be tidy. Backpatch to all supported versions. Author: Ranier Vilela Discussion: https://www.postgresql.org/message-id/CAEudQAoWq%2BAL3BnELHu7gms2GN07k-np6yLbukGaxJ1vY-zeiQ%40mail.gmail.com	2021-07-12 11:13:33 +03:00
Michael Paquier	0f80b47d24	Add forgotten LSN_FORMAT_ARGS() in xlogreader.c These should have been part of `4035cd5`, that introduced LZ4 support for wal_compression.	2021-07-09 15:27:36 +09:00
Michael Paquier	2aca19f298	Use WaitLatch() instead of pg_usleep() at the end of backups This concerns pg_stop_backup() and BASE_BACKUP, when waiting for the WAL segments required for a backup to be archived. This simplifies a bit the handling of the wait event used in this code path. Author: Bharath Rupireddy Reviewed-by: Michael Paquier, Stephen Frost Discussion: https://postgr.es/m/CALj2ACU4AdPCq6NLfcA-ZGwX7pPCK5FgEj-CAU0xCKzkASSy_A@mail.gmail.com	2021-07-06 08:10:59 +09:00
Peter Eisentraut	6bd3ec62d9	Use InvalidBucket instead of -1 where appropriate Reported-by: Ranier Vilela <ranier.vf@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAEudQAp%3DZwKjrP4L%2BCzqV7SmWiaQidPPRqj4tqdjDG4KBx5yrg%40mail.gmail.com	2021-07-02 11:59:55 +02:00
Michael Paquier	70685385d7	Use WaitLatch() instead of pg_usleep() at end-of-vacuum truncation This has the advantage to make a process more responsive when the postmaster dies, even if the wait time was rather limited as there was only a 50ms timeout here. Another advantage of this change is for monitoring, as we gain a new wait event for the end-of-vacuum truncation. Author: Bharath Rupireddy Reviewed-by: Aleksander Alekseev, Thomas Munro, Michael Paquier Discussion: https://postgr.es/m/CALj2ACU4AdPCq6NLfcA-ZGwX7pPCK5FgEj-CAU0xCKzkASSy_A@mail.gmail.com	2021-07-02 12:58:34 +09:00
Michael Paquier	17707c059c	Fix incorrect PITR message for transaction ROLLBACK PREPARED Reaching PITR on such a transaction would cause the generation of a LOG message mentioning a transaction committed, not aborted. Oversight in `4f1b890`. Author: Simon Riggs Discussion: https://postgr.es/m/CANbhV-GJ6KijeCgdOrxqMCQ+C8QiK657EMhCy4csjrPcEUFv_Q@mail.gmail.com Backpatch-through: 9.6	2021-06-30 11:48:53 +09:00
Michael Paquier	47f514dd9a	Fix compilation warning in xloginsert.c This is reproducible with gcc using at least -O0. The last checks validating the compression of a block could not be reached with this variable not set, but let's be clean. Oversight in `4035cd5`, per buildfarm member lapwing.	2021-06-29 11:57:18 +09:00
Michael Paquier	4035cd5d4e	Add support for LZ4 with compression of full-page writes in WAL The logic is implemented so as there can be a choice in the compression used when building a WAL record, and an extra per-record bit is used to track down if a block is compressed with PGLZ, LZ4 or nothing. wal_compression, the existing parameter, is changed to an enum with support for the following backward-compatible values: - "off", the default, to not use compression. - "pglz" or "on", to compress FPWs with PGLZ. - "lz4", the new mode, to compress FPWs with LZ4. Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be also an interesting choice, but going just with LZ4 for now makes the patch minimalistic as toast compression is already able to use LZ4, so there is no need to worry about any build-related needs for this implementation. Author: Andrey Borodin, Justin Pryzby Reviewed-by: Dilip Kumar, Michael Paquier Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru	2021-06-29 11:17:55 +09:00
Noah Misch	cc2c7d65fc	Skip WAL recycling and preallocation during archive recovery. The previous commit addressed the chief consequences of a race condition between InstallXLogFileSegment() and KeepFileRestoredFromArchive(). Fix three lesser consequences. A spurious durable_rename_excl() LOG message remained possible. KeepFileRestoredFromArchive() wasted the proceeds of WAL recycling and preallocation. Finally, XLogFileInitInternal() could return a descriptor for a file that KeepFileRestoredFromArchive() had already unlinked. That felt like a recipe for future bugs. Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com	2021-06-28 18:34:56 -07:00
Noah Misch	2b3e4672f7	Don't ERROR on PreallocXlogFiles() race condition. Before a restartpoint finishes PreallocXlogFiles(), a startup process KeepFileRestoredFromArchive() call can unlink the preallocated segment. If a CHECKPOINT sql command had elicited the restartpoint experiencing the race condition, that sql command failed. Moreover, the restartpoint omitted its log_checkpoints message and some inessential resource reclamation. Prevent the ERROR by skipping open() of the segment. Since these consequences are so minor, no back-patch. Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com	2021-06-28 18:34:56 -07:00
Noah Misch	421484f79c	Remove XLogFileInit() ability to unlink a pre-existing file. Only initdb used it. initdb refuses to operate on a non-empty directory and generally does not cope with pre-existing files of other kinds. Hence, use the opportunity to simplify. Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com	2021-06-28 18:34:56 -07:00
Noah Misch	85656bc305	In XLogFileInit(), fix *use_existent postcondition to suit callers. Infrequently, the mismatch caused log_checkpoints messages and TRACE_POSTGRESQL_CHECKPOINT_DONE() to witness an "added" count too high by one. Since that consequence is so minor, no back-patch. Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com	2021-06-28 18:34:55 -07:00
Noah Misch	c53c6b98d3	Remove XLogFileInit() ability to skip ControlFileLock. Cold paths, initdb and end-of-recovery, used it. Don't optimize them. Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com	2021-06-28 18:34:55 -07:00
Andrew Dunstan	e1c1c30f63	Pre branch pgindent / pgperltidy run Along the way make a slight adjustment to src/include/utils/queryjumble.h to avoid an unused typedef.	2021-06-28 11:05:54 -04:00
Peter Eisentraut	c31833779d	Message style improvements	2021-06-28 08:36:44 +02:00
Amit Kapila	b786304c29	Fix race condition in TransactionGroupUpdateXidStatus(). When we cannot immediately acquire XactSLRULock in exclusive mode at commit time, we add ourselves to a list of processes that need their XIDs status update. We do this if the clog page where we need to update the current transaction status is the same as the group leader's clog page, otherwise, we allow the caller to clear it by itself. Now, when we can't add ourselves to any group, we were not clearing the current proc if it has already become a member of some group which was leading to an assertion failure when the same proc was assigned to another backend after the current backend exits. Reported-by: Alexander Lakhin Bug: 17072 Author: Amit Kapila Tested-By: Alexander Lakhin Backpatch-through: 11, where it was introduced Discussion: https://postgr.es/m/17072-2f8764857ef2c92a@postgresql.org	2021-06-28 09:29:38 +05:30
Peter Eisentraut	a60c4c5c1a	Remove redundant variable pageSize in gistinitpage In gistinitpage, pageSize variable looks redundant, instead just pass BLCKSZ. This will be consistent with its peers BloomInitPage, brin_page_init and SpGistInitPage. Author: Bharath Rupireddy <bharath.rupireddy@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/CALj2ACWj=V1k5591eeZK2sOg2FYuBUp6azFO8tMkBtGfXf8PMQ@mail.gmail.com	2021-06-25 07:55:34 +02:00
Peter Geoghegan	e8f201ab82	Remove overzealous VACUUM failsafe assertions. The failsafe can trigger when index processing is already disabled. This can happen when VACUUM's INDEX_CLEANUP parameter is "off" and the failsafe happens to trigger. Remove assertions that assume that index processing is directly tied to the failsafe. Oversight in commit `c242baa4`, which made it possible for the failsafe to trigger in a two-pass strategy VACUUM that has yet to make its first call to lazy_vacuum_all_indexes().	2021-06-20 18:14:00 -07:00
Peter Geoghegan	3499df0dee	Support disabling index bypassing by VACUUM. Generalize the INDEX_CLEANUP VACUUM parameter (and the corresponding reloption): make it into a ternary style boolean parameter. It now exposes a third option, "auto". The "auto" option (which is now the default) enables the "bypass index vacuuming" optimization added by commit `1e55e7d1`. "VACUUM (INDEX_CLEANUP TRUE)" is redefined to once again make VACUUM simply do any required index vacuuming, regardless of how few dead tuples are encountered during the first scan of the target heap relation (unless there are exactly zero). This gives users a way of opting out of the "bypass index vacuuming" optimization, if for whatever reason that proves necessary. It is also expected to be used by PostgreSQL developers as a testing option from time to time. "VACUUM (INDEX_CLEANUP FALSE)" does the same thing as it always has: it forcibly disables both index vacuuming and index cleanup. It's not expected to be used much in PostgreSQL 14. The failsafe mechanism added by commit `1e55e7d1` addresses the same problem in a simpler way. INDEX_CLEANUP can now be thought of as a testing and compatibility option. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-By: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/CAH2-WznrBoCST4_Gxh_G9hA8NzGUbeBGnOUC8FcXcrhqsv6OHQ@mail.gmail.com	2021-06-18 20:04:07 -07:00
Heikki Linnakangas	d24c5658a8	Tidy up GetMultiXactIdMembers()'s behavior on error One of the error paths left members uninitialized. That's not a live bug, because most callers don't look at members when the function returns -1, but let's be tidy. One caller, in heap_lock_tuple(), does "if (members != NULL) pfree(members)", but AFAICS it never passes an invalid 'multi' value so it should not reach that error case. The callers are also a bit inconsistent in their expectations. heap_lock_tuple() pfrees the 'members' array if it's not-NULL, others pfree() it if "nmembers >= 0", and others if "nmembers > 0". That's not a live bug either, because the function should never return 0, but add an Assert for that to make it more clear. I left the callers alone for now. I also moved the line where we set *nmembers. It wasn't wrong before, but I like to do that right next to the 'return' statement, to make it clear that it's always set on return. Also remove one unreachable return statement after ereport(ERROR), for brevity and for consistency with the similar if-block right after it. Author: Greg Nancarrow with the additional changes by me Backpatch-through: 9.6, all supported versions	2021-06-17 14:50:42 +03:00
Heikki Linnakangas	d0303bc8d2	Fix outdated comment that talked about seek position of WAL file. Since commit `c24dcd0cfd`, we have been using pg_pread() to read the WAL file, which doesn't change the seek position (unless we fall back to the implementation in src/port/pread.c). Update comment accordingly. Backpatch-through: 12, where we started to use pg_pread()	2021-06-16 12:36:15 +03:00
Peter Geoghegan	958cfbcf2d	Remove unneeded field from VACUUM state. Bugfix commit `5fc89376` effectively made the lock_waiter_detected field from vacuumlazy.c's global state struct into private state owned by lazy_truncate_heap(). Finish this off by replacing the struct field with a local variable.	2021-06-15 08:59:36 -07:00
Michael Paquier	dbab0c07e5	Remove forced toast recompression in VACUUM FULL/CLUSTER The extra checks added by the recompression of toast data introduced in `bbe0a81` is proving to have a performance impact on VACUUM or CLUSTER even if no recompression is done. This is more noticeable with more toastable columns that contain non-NULL values. Improvements could be done to make those extra checks less expensive, but that's not material for 14 at this stage, and we are not sure either if the code path of VACUUM FULL/CLUSTER is adapted for this job. Per discussion with several people, including Andres Freund, Robert Haas, Álvaro Herrera, Tom Lane and myself. Discussion: https://postgr.es/m/20210527003144.xxqppojoiwurc2iz@alap3.anarazel.de	2021-06-14 09:25:50 +09:00
Alvaro Herrera	5cc1cd5028	Report sort phase progress in parallel btree build We were already reporting it, but only after the parallel workers were finished, which is visibly much later than what happens in a serial build. With this change we report it when the leader starts its own sort phase when participating in the build (the normal case). Now this might happen a little later than when the workers start their sorting phases, but a) communicating the actual phase start from workers is likely to be a hassle, and b) the sort phase start is pretty fuzzy anyway, since sorting per se is actually initiated by tuplesort.c internally earlier than tuplesort_performsort() is called. Backpatch to pg12, where the progress reporting code for CREATE INDEX went in. Reported-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Greg Nancarrow <gregn4422@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/1128176d-1eee-55d4-37ca-e63644422adb	2021-06-11 19:07:32 -04:00
David Rowley	55ba5973d9	Fix an asssortment of typos in brin_minmax_multi.c and mcv.c Discussion: https://postgr.es/m/CAApHDvrbyJNOPBws4RUhXghZ7+TBjtdO-rznTsqZECuowNorXg@mail.gmail.com	2021-06-10 20:13:44 +12:00
Robert Haas	caba8f0d43	Fix corner case failure of new standby to follow new primary. This only happens if (1) the new standby has no WAL available locally, (2) the new standby is starting from the old timeline, (3) the promotion happened in the WAL segment from which the new standby is starting, (4) the timeline history file for the new timeline is available from the archive but the WAL files for are not (i.e. this is a race), (5) the WAL files for the new timeline are available via streaming, and (6) recovery_target_timeline='latest'. Commit `ee994272ca` introduced this logic and was an improvement over the previous code, but it mishandled this case. If recovery_target_timeline='latest' and restore_command is set, validateRecoveryParameters() can change recoveryTargetTLI to be different from receiveTLI. If streaming is then tried afterward, expectedTLEs gets initialized with the history of the wrong timeline. It's supposed to be a list of entries explaining how to get to the target timeline, but in this case it ends up with a list of entries explaining how to get to the new standby's original timeline, which isn't right. Dilip Kumar and Robert Haas, reviewed by Kyotaro Horiguchi. Discussion: http://postgr.es/m/CAFiTN-sE-jr=LB8jQuxeqikd-Ux+jHiXyh4YDiZMPedgQKup0g@mail.gmail.com	2021-06-09 16:17:00 -04:00
Tomas Vondra	d1f0aa7696	Fix pg_visibility regression failure with CLOBBER_CACHE_ALWAYS Commit `8e03eb92e9` reverted a bit too much code, reintroducing one of the issues fixed by `39b66a91bd` - a page might have been left partially empty after relcache invalidation. Reported-By: Tom Lane Author: Masahiko Sawada Discussion: https://postgr.es/m/822752.1623032114@sss.pgh.pa.us Discussion: https://postgr.es/m/CAD21AoA%3D%3Df2VSw3c-Cp_y%3DWLKHMKc1D6s7g3YWsCOvgaYPpJcg%40mail.gmail.com	2021-06-08 19:33:11 +02:00
David Rowley	8bdb36aab2	Clean up some questionable usages of DatumGet* macros This tidies up some questionable coding which made use of DatumGetPointer() for Datums being passed into functions where the parameter is expected to be a cstring. We saw no compiler warnings with the old code as the Pointer type used in DatumGetPointer() happens to be a char * rather than a void *. However, that's no excuse and we should be using the correct macro for the job. Here we also make use of OutputFunctionCall() rather than using FunctionCall1() directly to call the type's output function. OutputFunctionCall() is the standard way to do this. It casts the returned value to a cstring for us. In passing get rid of a duplicate call to strlen(). Most compilers will likely optimize away the 2nd call, but there may be some that won't. In any case, this just aligns the code to some other nearby code that already does this. Discussion: https://postgr.es/m/CAApHDvq1D=ehZ8hey8Hz67N+_Zth0GHO5wiVCfv1YcGPMXJq0A@mail.gmail.com	2021-06-04 22:42:17 +12:00
David Rowley	7fc26d11e3	Adjust locations which have an incorrect copyright year A few patches committed after `ca3b37487` mistakenly forgot to make the copyright year 2021. Fix these. Discussion: https://postgr.es/m/CAApHDvqyLmd9P2oBQYJ=DbrV8QwyPRdmXtCTFYPE08h+ip0UJw@mail.gmail.com	2021-06-04 12:19:50 +12:00
David Rowley	f736e188ce	Standardize usages of appendStringInfo and appendPQExpBuffer Fix a few places that were using appendStringInfo() when they should have been using appendStringInfoString(). Also some cases of appendPQExpBuffer() that would have been better suited to use appendPQExpBufferChar(), and finally, some places that used appendPQExpBuffer() when appendPQExpBufferStr() would have suited better. There are no bugs are being fixed here. The aim is just to make the code use the most optimal function for the job. All the code being changed here is new to PG14. It makes sense to fix these before we branch for PG15. There are a few other places that we could fix, but those cases are older code so fixing those seems less worthwhile as it may cause unnecessary back-patching pain in the future. Author: Hou Zhijie Discussion: https://postgr.es/m/OS0PR01MB5716732158B1C4142C6FE375943D9@OS0PR01MB5716.jpnprd01.prod.outlook.com	2021-06-03 16:38:03 +12:00
Tomas Vondra	8e03eb92e9	Revert most of `39b66a91bd` Reverts most of commit `39b66a91bd`, which was found to cause significant regression for REFRESH MATERIALIZED VIEW. This means only rows inserted by heap_multi_insert will benefit from the optimization, implemented in commit `7db0cd2145`. Reported-by: Masahiko Sawada Discussion: https://postgr.es/m/CAD21AoA%3D%3Df2VSw3c-Cp_y%3DWLKHMKc1D6s7g3YWsCOvgaYPpJcg%40mail.gmail.com	2021-06-03 00:13:59 +02:00
Peter Geoghegan	9afdea9824	Fix VACUUM VERBOSE's LP_DEAD item pages output. Oversight in commit `5100010e`.	2021-05-27 17:09:16 -07:00
Tom Lane	e6241d8e03	Rethink definition of pg_attribute.attcompression. Redefine '\0' (InvalidCompressionMethod) as meaning "if we need to compress, use the current setting of default_toast_compression". This allows '\0' to be a suitable default choice regardless of datatype, greatly simplifying code paths that initialize tupledescs and the like. It seems like a more user-friendly approach as well, because now the default compression choice doesn't migrate into table definitions, meaning that changing default_toast_compression is usually sufficient to flip an installation's behavior; one needn't tediously issue per-column ALTER SET COMPRESSION commands. Along the way, fix a few minor bugs and documentation issues with the per-column-compression feature. Adopt more robust APIs for SetIndexStorageProperties and GetAttributeCompression. Bump catversion because typical contents of attcompression will now be different. We could get away without doing that, but it seems better to ensure v14 installations all agree on this. (We already forced initdb for beta2, anyway.) Discussion: https://postgr.es/m/626613.1621787110@sss.pgh.pa.us	2021-05-27 13:24:27 -04:00
Michael Paquier	190fa5a00a	Fix typo in heapam.c Author: Hou Zhijie Discussion: https://postgr.es/m/OS0PR01MB571612191738540B27A8DE5894249@OS0PR01MB5716.jpnprd01.prod.outlook.com	2021-05-26 19:53:03 +09:00
Michael Paquier	fb0f5f0172	Fix memory leak when de-toasting compressed values in VACUUM FULL/CLUSTER VACUUM FULL and CLUSTER can be used to enforce the use of the existing compression method of a toastable column if a value currently stored is compressed with a method that does not match the column's defined method. The code in charge of decompressing and recompressing toast values at rewrite left around the detoasted values, causing an accumulation of memory allocated in TopTransactionContext. When processing large relations, this could cause the system to run out of memory. The detoasted values are not needed once their tuple is rewritten, and this commit ensures that the necessary cleanup happens. Issue introduced by `bbe0a81d`. The comments of the area are reordered a bit while on it. Reported-by: Andres Freund Analyzed-by: Andres Freund Author: Michael Paquier Reviewed-by: Dilip Kumar Discussion: https://postgr.es/m/20210521211929.pcehg6f23icwstdb@alap3.anarazel.de	2021-05-25 14:27:18 +09:00
Peter Geoghegan	c242baa4a8	Consider triggering VACUUM failsafe during scan. The wraparound failsafe mechanism added by commit `1e55e7d1` handled the one-pass strategy case (i.e. the "table has no indexes" case) by adding a dedicated failsafe check. This made up for the fact that the usual one-pass checks inside lazy_vacuum_all_indexes() cannot ever be reached during a one-pass strategy VACUUM. This approach failed to account for two-pass VACUUMs that opt out of index vacuuming up-front. The INDEX_CLEANUP off case in the only case that works like that. Fix this by performing a failsafe check every 4GB during the first scan of the heap, regardless of the details of the VACUUM. This eliminates the special case, and will make the failsafe trigger more reliably. Author: Peter Geoghegan <pg@bowt.ie> Reported-By: Andres Freund <andres@anarazel.de> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/20210424002921.pb3t7h6frupdqnkp@alap3.anarazel.de	2021-05-24 17:14:02 -07:00
Tom Lane	f5024d8d7b	Re-order pg_attribute columns to eliminate some padding space. Now that attcompression is just a char, there's a lot of wasted padding space after it. Move it into the group of char-wide columns to save a net of 4 bytes per pg_attribute entry. While we're at it, swap the order of attstorage and attalign to make for a more logical grouping of these columns. Also re-order actions in related code to match the new field ordering. This patch also fixes one outright bug: equalTupleDescs() failed to compare attcompression. That could, for example, cause relcache reload to fail to adopt a new value following a change. Michael Paquier and Tom Lane, per a gripe from Andres Freund. Discussion: https://postgr.es/m/20210517204803.iyk5wwvwgtjcmc5w@alap3.anarazel.de	2021-05-23 12:12:09 -04:00
Tom Lane	f21fadafaf	Avoid detoasting failure after COMMIT inside a plpgsql FOR loop. exec_for_query() normally tries to prefetch a few rows at a time from the query being iterated over, so as to reduce executor entry/exit overhead. Unfortunately this is unsafe if we have COMMIT or ROLLBACK within the loop, because there might be TOAST references in the data that we prefetched but haven't yet examined. Immediately after the COMMIT/ROLLBACK, we have no snapshots in the session, meaning that VACUUM is at liberty to remove recently-deleted TOAST rows. This was originally reported as a case triggering the "no known snapshots" error in init_toast_snapshot(), but even if you miss hitting that, you can get "missing toast chunk", as illustrated by the added isolation test case. To fix, just disable prefetching in non-atomic contexts. Maybe there will be performance complaints prompting us to work harder later, but it's not clear at the moment that this really costs much, and I doubt we'd want to back-patch any complicated fix. In passing, adjust that error message in init_toast_snapshot() to be a little clearer about the likely cause of the problem. Patch by me, based on earlier investigation by Konstantin Knizhnik. Per bug #15990 from Andreas Wicht. Back-patch to v11 where intra-procedure COMMIT was added. Discussion: https://postgr.es/m/15990-eee2ac466b11293d@postgresql.org	2021-05-20 18:32:37 -04:00
Fujii Masao	167bd48049	Make standby promotion reset the recovery pause state to 'not paused'. If a promotion is triggered while recovery is paused, the paused state ends and promotion continues. But previously in that case pg_get_wal_replay_pause_state() returned 'paused' wrongly while a promotion was ongoing. This commit changes a standby promotion so that it marks the recovery pause state as 'not paused' when it's triggered, to fix the issue. Author: Fujii Masao Reviewed-by: Dilip Kumar, Kyotaro Horiguchi Discussion: https://postgr.es/m/f706876c-4894-0ba5-6f4d-79803eeea21b@oss.nttdata.com	2021-05-19 13:48:19 +09:00
Peter Geoghegan	8f72bbac3e	Harden nbtree deduplication posting split code. Add a defensive "can't happen" error to code that handles nbtree posting list splits (promote an existing assertion). This avoids a segfault in the event of an insertion of a newitem that is somehow identical to an existing non-pivot tuple in the index. An nbtree index should never have two index tuples with identical TIDs. This scenario is not particular unlikely in the event of any kind of corruption that leaves the index in an inconsistent state relative to the heap relation that is indexed. There are two known reports of preventable hard crashes. Doing nothing seems unacceptable given the general expectation that nbtree will cope reasonably well with corrupt data. Discussion: https://postgr.es/m/CAH2-Wz=Jr_d-dOYEEmwz0-ifojVNWho01eAqewfQXgKfoe114w@mail.gmail.com Backpatch: 13-, where nbtree deduplication was introduced.	2021-05-14 15:08:02 -07:00
Tom Lane	c3c35a733c	Prevent infinite insertion loops in spgdoinsert(). Formerly we just relied on operator classes that assert longValuesOK to eventually shorten the leaf value enough to fit on an index page. That fails since the introduction of INCLUDE-column support (commit `09c1c6ab4`), because the INCLUDE columns might alone take up more than a page, meaning no amount of leaf-datum compaction will get the job done. At least with spgtextproc.c, that leads to an infinite loop, since spgtextproc.c won't throw an error for not being able to shorten the leaf datum anymore. To fix without breaking cases that would otherwise work, add logic to spgdoinsert() to verify that the leaf tuple size is decreasing after each "choose" step. Some opclasses might not decrease the size on every single cycle, and in any case, alignment roundoff of the tuple size could obscure small gains. Therefore, allow up to 10 cycles without additional savings before throwing an error. (Perhaps this number will need adjustment, but it seems quite generous right now.) As long as we've developed this logic, let's back-patch it. The back branches don't have INCLUDE columns to worry about, but this seems like a good defense against possible bugs in operator classes. We already know that an infinite loop here is pretty unpleasant, so having a defense seems to outweigh the risk of breaking things. (Note that spgtextproc.c is actually the only known opclass with longValuesOK support, so that this is all moot for known non-core opclasses anyway.) Per report from Dilip Kumar. Discussion: https://postgr.es/m/CAFiTN-uxP_soPhVG840tRMQTBmtA_f_Y8N51G7DKYYqDh7XN-A@mail.gmail.com	2021-05-14 15:07:34 -04:00
Tom Lane	eb7a6b9229	Fix query-cancel handling in spgdoinsert(). Knowing that a buggy opclass could cause an infinite insertion loop, spgdoinsert() intended to allow its loop to be interrupted by query cancel. However, that never actually worked, because in iterations after the first, we'd be holding buffer lock(s) which would cause InterruptHoldoffCount to be positive, preventing servicing of the interrupt. To fix, check if an interrupt is pending, and if so fall out of the insertion loop and service the interrupt after we've released the buffers. If it was indeed a query cancel, that's the end of the matter. If it was a non-canceling interrupt reason, make use of the existing provision to retry the whole insertion. (This isn't as wasteful as it might seem, since any upper-level index tuples we already created should be usable in the next attempt.) While there's no known instance of such a bug in existing release branches, it still seems like a good idea to back-patch this to all supported branches, since the behavior is fairly nasty if a loop does happen --- not only is it uncancelable, but it will quickly consume memory to the point of an OOM failure. In any case, this code is certainly not working as intended. Per report from Dilip Kumar. Discussion: https://postgr.es/m/CAFiTN-uxP_soPhVG840tRMQTBmtA_f_Y8N51G7DKYYqDh7XN-A@mail.gmail.com	2021-05-14 13:29:39 -04:00
Peter Geoghegan	fbe9b80610	Fix autovacuum log output heap truncation issue. The percentage of blocks from the table value reported by autovacuum log output (following commit `5100010ee4`) should never exceed 100% because it describes the state of the table back when lazy_vacuum() was called. The value could nevertheless exceed 100% in the event of heap relation truncation. We failed to compensate for how truncation affects rel_pages. Fix the faulty accounting by using the original rel_pages value instead of the current/final rel_pages value. Reported-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20210423204306.5osfpkt2ggaedyvy@alap3.anarazel.de	2021-05-13 16:07:17 -07:00
Tom Lane	def5b065ff	Initial pgindent and pgperltidy run for v14. Also "make reformat-dat-files". The only change worthy of note is that pgindent messed up the formatting of launcher.c's struct LogicalRepWorkerId, which led me to notice that that struct wasn't used at all anymore, so I just took it out.	2021-05-12 13:14:10 -04:00
Peter Eisentraut	ec6e70c79f	Refactor some error messages for easier translation	2021-05-12 07:42:51 +02:00
Fujii Masao	d780d7c088	Change data type of counters in BufferUsage and WalUsage from long to int64. Previously long was used as the data type for some counters in BufferUsage and WalUsage. But long is only four byte, e.g., on Windows, and it's entirely possible to wrap a four byte counter. For example, emitting more than four billion WAL records in one transaction isn't actually particularly rare. To avoid the overflows of those counters, this commit changes the data type of them from long to int64. Suggested-by: Andres Freund Author: Masahiro Ikeda Reviewed-by: Fujii Masao Discussion: https://postgr.es/m/20201221211650.k7b53tcnadrciqjo@alap3.anarazel.de Discussion: https://postgr.es/m/af0964ac-7080-1984-dc23-513754987716@oss.nttdata.com	2021-05-12 09:56:34 +09:00
Tom Lane	049e1e2edb	Fix mishandling of resjunk columns in ON CONFLICT ... UPDATE tlists. It's unusual to have any resjunk columns in an ON CONFLICT ... UPDATE list, but it can happen when MULTIEXPR_SUBLINK SubPlans are present. If it happens, the ON CONFLICT UPDATE code path would end up storing tuples that include the values of the extra resjunk columns. That's fairly harmless in the short run, but if new columns are added to the table then the values would become accessible, possibly leading to malfunctions if they don't match the datatypes of the new columns. This had escaped notice through a confluence of missing sanity checks, including * There's no cross-check that a tuple presented to heap_insert or heap_update matches the table rowtype. While it's difficult to check that fully at reasonable cost, we can easily add assertions that there aren't too many columns. * The output-column-assignment cases in execExprInterp.c lacked any sanity checks on the output column numbers, which seems like an oversight considering there are plenty of assertion checks on input column numbers. Add assertions there too. * We failed to apply nodeModifyTable's ExecCheckPlanOutput() to the ON CONFLICT UPDATE tlist. That wouldn't have caught this specific error, since that function is chartered to ignore resjunk columns; but it sure seems like a bad omission now that we've seen this bug. In HEAD, the right way to fix this is to make the processing of ON CONFLICT UPDATE tlists work the same as regular UPDATE tlists now do, that is don't add "SET x = x" entries, and use ExecBuildUpdateProjection to evaluate the tlist and combine it with old values of the not-set columns. This adds a little complication to ExecBuildUpdateProjection, but allows removal of a comparable amount of now-dead code from the planner. In the back branches, the most expedient solution seems to be to (a) use an output slot for the ON CONFLICT UPDATE projection that actually matches the target table, and then (b) invent a variant of ExecBuildProjectionInfo that can be told to not store values resulting from resjunk columns, so it doesn't try to store into nonexistent columns of the output slot. (We can't simply ignore the resjunk columns altogether; they have to be evaluated for MULTIEXPR_SUBLINK to work.) This works back to v10. In 9.6, projections work much differently and we can't cheaply give them such an option. The 9.6 version of this patch works by inserting a JunkFilter when it's necessary to get rid of resjunk columns. In addition, v11 and up have the reverse problem when trying to perform ON CONFLICT UPDATE on a partitioned table. Through a further oversight, adjust_partition_tlist() discarded resjunk columns when re-ordering the ON CONFLICT UPDATE tlist to match a partition. This accidentally prevented the storing-bogus-tuples problem, but at the cost that MULTIEXPR_SUBLINK cases didn't work, typically crashing if more than one row has to be updated. Fix by preserving resjunk columns in that routine. (I failed to resist the temptation to add more assertions there too, and to do some minor code beautification.) Per report from Andres Freund. Back-patch to all supported branches. Security: CVE-2021-32028	2021-05-10 11:02:29 -04:00
Thomas Munro	c2dc19342e	Revert recovery prefetching feature. This set of commits has some bugs with known fixes, but at this late stage in the release cycle it seems best to revert and resubmit next time, along with some new automated test coverage for this whole area. Commits reverted: dc88460c: Doc: Review for "Optionally prefetch referenced data in recovery." 1d257577: Optionally prefetch referenced data in recovery. f003d9f8: Add circular WAL decoding buffer. 323cbe7c: Remove read_page callback from XLogReader. Remove the new GUC group WAL_RECOVERY recently added by `a55a9847`, as the corresponding section of config.sgml is now reverted. Discussion: https://postgr.es/m/CAOuzzgrn7iKnFRsB4MHp3UisEQAGgZMbk_ViTN4HV4-Ksq8zCg%40mail.gmail.com	2021-05-10 16:06:09 +12:00
Peter Geoghegan	c9787385db	Remove overzealous VACUUM visibility map assertion. The all_visible_according_to_vm variable's value is inherently prone to becoming invalidated concurrently, since it is set before we even acquire a lock on a related heap page buffer. Oversight in commit `7136bf34`, which added the assertion in passing. Author: Masahiko Sawada <sawada.mshk@gmail.com> Reported-By: Tang <tanghy.fnst@fujitsu.com> Diagnosed-By:: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAD21AoDzgc8_MYrA5m1fyydomw_eVKtQiYh7sfDK4KEhdMsf_g@mail.gmail.com	2021-05-06 13:17:39 -07:00
Michael Paquier	4aba61b870	Add some forgotten LSN_FORMAT_ARGS() in xlogreader.c `6f6f284` has introduced a specific macro to make printf()-ing of LSNs easier. This takes care of what looks like the remaining code paths that did not get the call. Author: Michael Paquier Reviewed-by: Kyotaro Horiguchi, Tom Lane Discussion: https://postgr.es/m/YIJS9x6K8ruizN7j@paquier.xyz	2021-04-24 09:09:02 +09:00
Michael Paquier	62aa2bb293	Remove use of [U]INT64_FORMAT in some translatable strings %lld with (long long), or %llu with (unsigned long long) are more adapted. This is similar to `3286065`. Author: Kyotaro Horiguchi Discussion: https://postgr.es/m/20210421.200000.1462448394029407895.horikyota.ntt@gmail.com	2021-04-23 13:25:49 +09:00
Peter Eisentraut	f0ec598b43	Fix typo	2021-04-21 08:07:37 +02:00
Tom Lane	783be78ca9	Improve WAL record descriptions for SP-GiST records. While tracking down the bug fixed in the preceding commit, I got quite annoyed by the low quality of spg_desc's output. Add missing fields, try to make the formatting consistent.	2021-04-20 17:01:49 -04:00
Peter Geoghegan	7136bf34f2	Document LP_DEAD accounting issues in VACUUM. Document VACUUM's soft assumption that any LP_DEAD items encountered during pruning will become LP_UNUSED items before VACUUM finishes up. This is integral to the accounting used by VACUUM to generate its final report on the table to the stats collector. It also affects how VACUUM determines which heap pages are truncatable. In both cases VACUUM is concerned with the likely contents of the page in the near future, not the current contents of the page. This state of affairs created the false impression that VACUUM's dead tuple accounting had significant difference with similar accounting used during ANALYZE. There were and are no substantive differences, at least when the soft assumption completely works out. This is far clearer now. Also document cases where things don't quite work out for VACUUM's dead tuple accounting. It's possible that a significant number of LP_DEAD items will be left behind by VACUUM, and won't be recorded as remaining dead tuples in VACUUM's statistics collector report. This behavior dates back to commit `a96c41fe`, which taught VACUUM to run without index and heap vacuuming at the user's request. The failsafe mechanism added to VACUUM more recently by commit `1e55e7d1` takes the same approach to dead tuple accounting. Reported-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=Jmtu18PrsYq3EvvZJGOmZqSO2u3bvKpx9xJa5uhNp=Q@mail.gmail.com	2021-04-19 18:55:31 -07:00
Michael Paquier	7ef8b52cf0	Fix typos and grammar in comments and docs Author: Justin Pryzby Discussion: https://postgr.es/m/20210416070310.GG3315@telsasoft.com	2021-04-19 11:32:30 +09:00
Peter Eisentraut	f59b58e2a1	Use correct format placeholder for block numbers Should be %u rather than %d.	2021-04-17 09:40:50 +02:00
Peter Eisentraut	07e5e66742	Improve quoting in some error messages	2021-04-14 09:11:29 +02:00
Peter Geoghegan	60f1f09ff4	Don't truncate heap when VACUUM's failsafe is in effect. It seems like a good idea to bypass heap truncation when the wraparound failsafe mechanism (which was added in commit `1e55e7d1`) is in effect. Deliberately don't bypass heap truncation in the INDEX_CLEANUP=off case, even though it is similar to the failsafe case. There is already a separate reloption (and related VACUUM parameter) for that. Reported-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAD21AoDWRh6oTN5T8wa+cpZUVpHXET8BJ8Da7WHVHpwkPP6KLg@mail.gmail.com	2021-04-13 12:58:31 -07:00
Tom Lane	34f581c39e	Avoid improbable PANIC during heap_update. heap_update needs to clear any existing "all visible" flag on the old tuple's page (and on the new page too, if different). Per coding rules, to do this it must acquire pin on the appropriate visibility-map page while not holding exclusive buffer lock; which creates a race condition since someone else could set the flag whenever we're not holding the buffer lock. The code is supposed to handle that by re-checking the flag after acquiring buffer lock and retrying if it became set. However, one code path through heap_update itself, as well as one in its subroutine RelationGetBufferForTuple, failed to do this. The end result, in the unlikely event that a concurrent VACUUM did set the flag while we're transiently not holding lock, is a non-recurring "PANIC: wrong buffer passed to visibilitymap_clear" failure. This has been seen a few times in the buildfarm since recent VACUUM changes that added code paths that could set the all-visible flag while holding only exclusive buffer lock. Previously, the flag was (usually?) set only after doing LockBufferForCleanup, which would insist on buffer pin count zero, thus preventing the flag from becoming set partway through heap_update. However, it's clear that it's heap_update not VACUUM that's at fault here. What's less clear is whether there is any hazard from these bugs in released branches. heap_update is certainly violating API expectations, but if there is no code path that can set all-visible without a cleanup lock then it's only a latent bug. That's not 100% certain though, besides which we should worry about extensions or future back-patch fixes that could introduce such code paths. I chose to back-patch to v12. Fixing RelationGetBufferForTuple before that would require also back-patching portions of older fixes (notably `0d1fe9f74`), which is more code churn than seems prudent to fix a hypothetical issue. Discussion: https://postgr.es/m/2247102.1618008027@sss.pgh.pa.us	2021-04-13 12:17:24 -04:00
Thomas Munro	b1df6b696b	Fix potential SSI hazard in heap_update(). Commit `6f38d4dac3` failed to heed a warning about the stability of the value pointed to by "otid". The caller is allowed to pass in a pointer to newtup->t_self, which will be updated during the execution of the function. Instead, the SSI check should use the value we copy into oldtup.t_self near the top of the function. Not a live bug, because newtup->t_self doesn't really get updated until a bit later, but it was confusing and broke the rule established by the comment. Back-patch to 13. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/2689164.1618160085%40sss.pgh.pa.us	2021-04-13 13:02:56 +12:00
Fujii Masao	08aa89b326	Remove COMMIT_TS_SETTS record. Commit `438fc4a39c` prevented the WAL replay from writing COMMIT_TS_SETTS record. By this change there is no code that generates COMMIT_TS_SETTS record in PostgreSQL core. Also we can think that there are no extensions using the record because we've not received so far any complaints about the issue that commit `438fc4a39c` fixed. Therefore this commit removes COMMIT_TS_SETTS record and its related code. Even without this record, the timestamp required for commit timestamp feature can be acquired from the COMMIT record. Bump WAL page magic. Reported-by: lx zou <zoulx1982@163.com> Author: Fujii Masao Reviewed-by: Alvaro Herrera Discussion: https://postgr.es/m/16931-620d0f2fdc6108f1@postgresql.org	2021-04-12 00:04:30 +09:00
Thomas Munro	dc88460c24	Doc: Review for "Optionally prefetch referenced data in recovery." Typos, corrections and language improvements in the docs, and a few in code comments too. Reported-by: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/20210409033703.GP6592%40telsasoft.com	2021-04-10 08:21:53 +12:00
Peter Geoghegan	796092fb84	Silence another _bt_check_unique compiler warning. Per complaint from Tom Lane Discussion: https://postgr.es/m/1922884.1617909599@sss.pgh.pa.us	2021-04-08 12:54:31 -07:00
Thomas Munro	1d257577e0	Optionally prefetch referenced data in recovery. Introduce a new GUC recovery_prefetch, disabled by default. When enabled, look ahead in the WAL and try to initiate asynchronous reading of referenced data blocks that are not yet cached in our buffer pool. For now, this is done with posix_fadvise(), which has several caveats. Better mechanisms will follow in later work on the I/O subsystem. The GUC maintenance_io_concurrency is used to limit the number of concurrent I/Os we allow ourselves to initiate, based on pessimistic heuristics used to infer that I/Os have begun and completed. The GUC wal_decode_buffer_size is used to limit the maximum distance we are prepared to read ahead in the WAL to find uncached blocks. Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (parts) Reviewed-by: Andres Freund <andres@anarazel.de> (parts) Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (parts) Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com> Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com	2021-04-08 23:20:42 +12:00
Thomas Munro	f003d9f872	Add circular WAL decoding buffer. Teach xlogreader.c to decode its output into a circular buffer, to support optimizations based on looking ahead. * XLogReadRecord() works as before, consuming records one by one, and allowing them to be examined via the traditional XLogRecGetXXX() macros. * An alternative new interface XLogNextRecord() is added that returns pointers to DecodedXLogRecord structs that can be examined directly. * XLogReadAhead() provides a second cursor that lets you see further ahead, as long as data is available and there is enough space in the decoding buffer. This returns DecodedXLogRecord pointers to the caller, but also adds them to a queue of records that will later be consumed by XLogNextRecord()/XLogReadRecord(). The buffer's size is controlled with wal_decode_buffer_size. The buffer could potentially be placed into shared memory, for future projects. Large records that don't fit in the circular buffer are called "oversized" and allocated separately with palloc(). Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com	2021-04-08 23:20:42 +12:00
Thomas Munro	323cbe7c7d	Remove read_page callback from XLogReader. Previously, the XLogReader module would fetch new input data using a callback function. Redesign the interface so that it tells the caller to insert more data with a special return value instead. This API suits later patches for prefetching, encryption and maybe other future projects that would otherwise require continually extending the callback interface. As incidental cleanup work, move global variables readOff, readLen and readSegNo inside XlogReaderState. Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> Author: Heikki Linnakangas <hlinnaka@iki.fi> (parts of earlier version) Reviewed-by: Antonin Houska <ah@cybertec.at> Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> Reviewed-by: Takashi Menjo <takashi.menjo@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/20190418.210257.43726183.horiguchi.kyotaro%40lab.ntt.co.jp	2021-04-08 23:20:42 +12:00
Alvaro Herrera	0827e8af70	autovacuum: handle analyze for partitioned tables Previously, autovacuum would completely ignore partitioned tables, which is not good regarding analyze -- failing to analyze those tables means poor plans may be chosen. Make autovacuum aware of those tables by propagating "changes since analyze" counts from the leaf partitions up the partitioning hierarchy. This also introduces necessary reloptions support for partitioned tables (autovacuum_enabled, autovacuum_analyze_scale_factor, autovacuum_analyze_threshold). It's unclear how best to document this aspect. Author: Yuzuko Hosoya <yuzukohosoya@gmail.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/CAKkQ508_PwVgwJyBY=0Lmkz90j8CmWNPUxgHvCUwGhMrouz6UA@mail.gmail.com	2021-04-08 01:19:36 -04:00
Peter Geoghegan	5100010ee4	Teach VACUUM to bypass unnecessary index vacuuming. VACUUM has never needed to call ambulkdelete() for each index in cases where there are precisely zero TIDs in its dead_tuples array by the end of its first pass over the heap (also its only pass over the heap in this scenario). Index vacuuming is simply not required when this happens. Index cleanup will still go ahead, but in practice most calls to amvacuumcleanup() are usually no-ops when there were zero preceding ambulkdelete() calls. In short, VACUUM has generally managed to avoid index scans when there were clearly no index tuples to delete from indexes. But cases with _close to_ no index tuples to delete were another matter -- a round of ambulkdelete() calls took place (one per index), each of which performed a full index scan. VACUUM now behaves just as if there were zero index tuples to delete in cases where there are in fact "virtually zero" such tuples. That is, it can now bypass index vacuuming and heap vacuuming as an optimization (though not index cleanup). Whether or not VACUUM bypasses indexes is determined dynamically, based on the just-observed number of heap pages in the table that have one or more LP_DEAD items (LP_DEAD items in heap pages have a 1:1 correspondence with index tuples that still need to be deleted from each index in the worst case). We only skip index vacuuming when 2% or less of the table's pages have one or more LP_DEAD items -- bypassing index vacuuming as an optimization must not noticeably impede setting bits in the visibility map. As a further condition, the dead_tuples array (i.e. VACUUM's array of LP_DEAD item TIDs) must not exceed 32MB at the point that the first pass over the heap finishes, which is also when the decision to bypass is made. (The VACUUM must also have been able to fit all TIDs in its maintenance_work_mem-bound dead_tuples space, though with a default maintenance_work_mem setting it can't matter.) This avoids surprising jumps in the duration and overhead of routine vacuuming with workloads where successive VACUUM operations consistently have almost zero dead index tuples. The number of LP_DEAD items may well accumulate over multiple VACUUM operations, before finally the threshold is crossed and VACUUM performs conventional index vacuuming. Even then, the optimization will have avoided a great deal of largely unnecessary index vacuuming. In the future we may teach VACUUM to skip index vacuuming on a per-index basis, using a much more sophisticated approach. For now we only consider the extreme cases, where we can be quite confident that index vacuuming just isn't worth it using simple heuristics. Also log information about how many heap pages have one or more LP_DEAD items when autovacuum logging is enabled. Author: Masahiko Sawada <sawada.mshk@gmail.com> Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzmkebqPd4MVGuPTOS9bMFvp9MDs5cRTCOsv1rQJ3jCbXw@mail.gmail.com	2021-04-07 16:14:54 -07:00
Peter Geoghegan	1e55e7d175	Add wraparound failsafe to VACUUM. Add a failsafe mechanism that is triggered by VACUUM when it notices that the table's relfrozenxid and/or relminmxid are dangerously far in the past. VACUUM checks the age of the table dynamically, at regular intervals. When the failsafe triggers, VACUUM takes extraordinary measures to finish as quickly as possible so that relfrozenxid and/or relminmxid can be advanced. VACUUM will stop applying any cost-based delay that may be in effect. VACUUM will also bypass any further index vacuuming and heap vacuuming -- it only completes whatever remaining pruning and freezing is required. Bypassing index/heap vacuuming is enabled by commit `8523492d`, which made it possible to dynamically trigger the mechanism already used within VACUUM when it is run with INDEX_CLEANUP off. It is expected that the failsafe will almost always trigger within an autovacuum to prevent wraparound, long after the autovacuum began. However, the failsafe mechanism can trigger in any VACUUM operation. Even in a non-aggressive VACUUM, where we're likely to not advance relfrozenxid, it still seems like a good idea to finish off remaining pruning and freezing. An aggressive/anti-wraparound VACUUM will be launched immediately afterwards. Note that the anti-wraparound VACUUM that follows will itself trigger the failsafe, usually before it even begins its first (and only) pass over the heap. The failsafe is controlled by two new GUCs: vacuum_failsafe_age, and vacuum_multixact_failsafe_age. There are no equivalent reloptions, since that isn't expected to be useful. The GUCs have rather high defaults (both default to 1.6 billion), and are expected to generally only be used to make the failsafe trigger sooner/more frequently. Author: Masahiko Sawada <sawada.mshk@gmail.com> Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzmgH3ySGYeC-m-eOBsa2=sDwa292-CFghV4rESYo39FsQ@mail.gmail.com	2021-04-07 12:37:45 -07:00
Peter Geoghegan	3c3b8a4b26	Truncate line pointer array during VACUUM. Teach VACUUM to truncate the line pointer array of each heap page when a contiguous group of LP_UNUSED line pointers appear at the end of the array -- these unused and unreferenced items are excluded. This process occurs during VACUUM's second pass over the heap, right after LP_DEAD line pointers on the page (those encountered/pruned during the first pass) are marked LP_UNUSED. Truncation avoids line pointer bloat with certain workloads, particularly those involving continual range DELETEs and bulk INSERTs against the same table. Also harden heapam code to check for an out-of-range page offset number in places where we weren't already doing so. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-By: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAEze2WjgaQc55Y5f5CQd3L=eS5CZcff2Obxp=O6pto8-f0hC4w@mail.gmail.com Discussion: https://postgr.es/m/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopJMhFQj_g1rAj4GWr3ZSzw@mail.gmail.com	2021-04-07 08:47:15 -07:00
Tomas Vondra	23607a8156	Don't add non-existent pages to bitmap from BRIN The code in bringetbitmap() simply added the whole matching page range to the TID bitmap, as determined by pages_per_range, even if some of the pages were beyond the end of the heap. The query then might fail with an error like this: ERROR: could not open file "base/20176/20228.2" (target block 262144): previous segment is only 131021 blocks In this case, the relation has 262093 pages (131072 and 131021 pages), but we're trying to acess block 262144, i.e. first block of the 3rd segment. At that point _mdfd_getseg() notices the preceding segment is incomplete, and fails. Hitting this in practice is rather unlikely, because: * Most indexes use power-of-two ranges, so segments and page ranges align perfectly (segment end is also a page range end). * The table size has to be just right, with the last segment being almost full - less than one page range from full segment, so that the last page range actually crosses the segment boundary. * Prefetch has to be enabled. The regular page access checks that pages are not beyond heap end, but prefetch does not. On older releases (before 12) the execution stops after hitting the first non-existent page, so the prefetch distance has to be sufficient to reach the first page in the next segment to trigger the issue. Since 12 it's enough to just have prefetch enabled, the prefetch distance does not matter. Fixed by not adding non-existent pages to the TID bitmap. Backpatch all the way back to 9.6 (BRIN indexes were introduced in 9.5, but that release is EOL). Backpatch-through: 9.6	2021-04-07 15:58:36 +02:00
Heikki Linnakangas	d92b1cdbab	Revert "Add sortsupport for gist_btree opclasses, for faster index builds." This reverts commit `9f984ba6d2`. It was making the buildfarm unhappy, apparently setting client_min_messages in a regression test produces different output if log_statement='all'. Another issue is that I now suspect the bit sortsupport function was in fact not correct to call byteacmp(). Revert to investigate both of those issues.	2021-04-07 14:33:21 +03:00
Heikki Linnakangas	9f984ba6d2	Add sortsupport for gist_btree opclasses, for faster index builds. Commit `16fa9b2b30` introduced a faster way to build GiST indexes, by sorting all the data. This commit adds the sortsupport functions needed to make use of that feature for btree_gist. Author: Andrey Borodin Discussion: https://www.postgresql.org/message-id/2F3F7265-0D22-44DB-AD71-8554C743D943@yandex-team.ru	2021-04-07 13:22:05 +03:00
Michael Paquier	4c0239cb7a	Remove redundant memset(0) calls for page init of some index AMs Bloom, GIN, GiST and SP-GiST rely on PageInit() to initialize the contents of a page, and this routine fills entirely a page with zeros for a size of BLCKSZ, including the special space. Those index AMs have been using an extra memset() call to fill with zeros the special page space, or even the whole page, which is not necessary as PageInit() already does this work, so let's remove them. GiST was not doing this extra call, but has commented out a system call that did so since `6236991`. While on it, remove one MAXALIGN() for SP-GiST as PageInit() takes care of that. This makes the whole page initialization logic more consistent across all index AMs. Author: Bharath Rupireddy Reviewed-by: Vignesh C, Mahendra Singh Thalor Discussion: https://postgr.es/m/CALj2ACViOo2qyaPT7krWm4LRyRTw9kOXt+g6PfNmYuGA=YHj9A@mail.gmail.com	2021-04-07 14:35:26 +09:00
Peter Geoghegan	8523492d4e	Remove tupgone special case from vacuumlazy.c. Retry the call to heap_prune_page() in rare cases where there is disagreement between the heap_prune_page() call and the call to HeapTupleSatisfiesVacuum() that immediately follows. Disagreement is possible when a concurrently-aborted transaction makes a tuple DEAD during the tiny window between each step. This was the only case where a tuple considered DEAD by VACUUM still had storage following pruning. VACUUM's definition of dead tuples is now uniformly simple and unambiguous: dead tuples from each page are always LP_DEAD line pointers that were encountered just after we performed pruning (and just before we considered freezing remaining items with tuple storage). Eliminating the tupgone=true special case enables INDEX_CLEANUP=off style skipping of index vacuuming that takes place based on flexible, dynamic criteria. The INDEX_CLEANUP=off case had to know about skipping indexes up-front before now, due to a subtle interaction with the special case (see commit `dd695979`) -- this was a special case unto itself. Now there are no special cases. And so now it won't matter when or how we decide to skip index vacuuming: it won't affect how pruning behaves, and it won't be affected by any of the implementation details of pruning or freezing. Also remove XLOG_HEAP2_CLEANUP_INFO records. These are no longer necessary because we now rely entirely on heap pruning taking care of recovery conflicts. There is no longer any need to generate recovery conflicts for DEAD tuples that pruning just missed. This also means that heap vacuuming now uses exactly the same strategy for recovery conflicts as index vacuuming always has: REDO routines never need to process a latestRemovedXid from the WAL record, since earlier REDO of the WAL record from pruning is sufficient in all cases. The generic XLOG_HEAP2_CLEAN record type is now split into two new record types to reflect this new division (these are called XLOG_HEAP2_PRUNE and XLOG_HEAP2_VACUUM). Also stop acquiring a super-exclusive lock for heap pages when they're vacuumed during VACUUM's second heap pass. A regular exclusive lock is enough. This is correct because heap page vacuuming is now strictly a matter of setting the LP_DEAD line pointers to LP_UNUSED. No other backend can have a pointer to a tuple located in a pinned buffer that can be invalidated by a concurrent heap page vacuum operation. Heap vacuuming can now be thought of as conceptually similar to index vacuuming and conceptually dissimilar to heap pruning. Heap pruning now has sole responsibility for anything involving the logical contents of the database (e.g., managing transaction status information, recovery conflicts, considering what to do with HOT chains). Index vacuuming and heap vacuuming are now only concerned with recycling garbage items from physical data structures that back the logical database. Bump XLOG_PAGE_MAGIC due to pruning and heap page vacuum WAL record changes. Credit for the idea of retrying pruning a page to avoid the tupgone case goes to Andres Freund. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com	2021-04-06 08:49:22 -07:00
Peter Geoghegan	7ab96cf6b3	Refactor lazy_scan_heap() loop. Add a lazy_scan_heap() subsidiary function that handles heap pruning and tuple freezing: lazy_scan_prune(). This is a great deal cleaner. The code that remains in lazy_scan_heap()'s per-block loop can now be thought of as code that either comes before or after the call to lazy_scan_prune(), which is now the clear focal point. This division is enforced by the way in which we now manage state. lazy_scan_prune() outputs state (using its own struct) that describes what to do with the page following pruning and freezing (e.g., visibility map maintenance, recording free space in the FSM). It doesn't get passed any special instructional state from the preamble code, though. Also cleanly separate the logic used by a VACUUM with INDEX_CLEANUP=off from the logic used by single-heap-pass VACUUMs. The former case is now structured as the omission of index and heap vacuuming by a two pass VACUUM. The latter case goes back to being used only when the table happens to have no indexes (just as it was before commit `a96c41fe`). This structure is much more natural, since the whole point of INDEX_CLEANUP=off is to skip the index and heap vacuuming that would otherwise take place. The single-heap-pass case doesn't skip any useful work, though -- it just does heap pruning and heap vacuuming together when the table happens to have no indexes. Both of these changes are preparation for an upcoming patch that generalizes the mechanism used by INDEX_CLEANUP=off. The later patch will allow VACUUM to give up on index and heap vacuuming dynamically, as problems emerge (e.g., with wraparound), so that an affected VACUUM operation can finish up as soon as possible. Also fix a very old bug in single-pass VACUUM VERBOSE output. We were reporting the number of tuples deleted via pruning as a direct substitute for reporting the number of LP_DEAD items removed in a function that deals with the second pass over the heap. But that doesn't work at all -- they're two different things. To fix, start tracking the total number of LP_DEAD items encountered during pruning, and use that in the report instead. A single pass VACUUM will always vacuum away whatever LP_DEAD items a heap page has immediately after it is pruned, so the total number of LP_DEAD items encountered during pruning equals the total number vacuumed-away. (They are _not_ equal in the INDEX_CLEANUP=off case, but that's okay because skipping index vacuuming is now a totally orthogonal concept to one-pass VACUUM.) Also stop reporting the count of LP_UNUSED items in VACUUM VERBOSE output. This makes the output of VACUUM VERBOSE more consistent with log_autovacuum's output (because it never showed information about LP_UNUSED items). VACUUM VERBOSE reported LP_UNUSED items left behind by the last VACUUM, and LP_UNUSED items created via pruning HOT chains during the current VACUUM (it never included LP_UNUSED items left behind by the current VACUUM's second pass over the heap). This makes it useless as an indicator of line pointer bloat, which must have been the original intention. (Like the first VACUUM VERBOSE issue, this issue was arguably an oversight in commit `282d2a03`, which added the heap-only tuple optimization.) Finally, stop reporting empty_pages in VACUUM VERBOSE output, and start reporting pages_removed instead. This also makes the output of VACUUM VERBOSE more consistent with log_autovacuum's output (which does not show empty_pages, but does show pages_removed). An empty page isn't meaningfully different to a page that is almost empty, or a page that is empty but for only a small number of remaining LP_UNUSED items. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com	2021-04-06 07:49:39 -07:00
Tom Lane	091e22b2e6	Clean up treatment of missing default and CHECK-constraint records. Andrew Gierth reported that it's possible to crash the backend if no pg_attrdef record is found to match an attribute that has atthasdef set. AttrDefaultFetch warns about this situation, but then leaves behind a relation tupdesc that has null "adbin" pointer(s), which most places don't guard against. We considered promoting the warning to an error, but throwing errors during relcache load is pretty drastic: it effectively locks one out of using the relation at all. What seems better is to leave the load-time behavior as a warning, but then throw an error in any code path that wants to use a default and can't find it. This confines the error to a subset of INSERT/UPDATE operations on the table, and in particular will at least allow a pg_dump to succeed. Also, we should fix AttrDefaultFetch to not leave any null pointers in the tupdesc, because that just creates an untested bug hazard. While at it, apply the same philosophy of "warn at load, throw error only upon use of the known-missing info" to CHECK constraints. CheckConstraintFetch is very nearly the same logic as AttrDefaultFetch, but for reasons lost in the mists of time, it was throwing ERROR for the same cases that AttrDefaultFetch treats as WARNING. Make the two functions more nearly alike. In passing, get rid of potentially-O(N^2) loops in equalTupleDesc by making AttrDefaultFetch sort the entries after fetching them, so that equalTupleDesc can assume that entries in two equal tupdescs must be in matching order. (CheckConstraintFetch already was sorting CHECK constraints, but equalTupleDesc hadn't been told about it.) There's some argument for back-patching this, but with such a small number of field reports, I'm content to fix it in HEAD. Discussion: https://postgr.es/m/87pmzaq4gx.fsf@news-spur.riddles.org.uk	2021-04-06 10:34:39 -04:00
Fujii Masao	9de9294b0c	Stop archive recovery if WAL generated with wal_level=minimal is found. Previously if hot standby was enabled, archive recovery exited with an error when it found WAL generated with wal_level=minimal. But if hot standby was disabled, it just reported a warning and continued in that case. Which could lead to data loss or errors during normal operation. A warning was emitted, but users could easily miss that and not notice this serious situation until they encountered the actual errors. To improve this situation, this commit changes archive recovery so that it exits with FATAL error when it finds WAL generated with wal_level=minimal whatever the setting of hot standby. This enables users to notice the serious situation soon. The FATAL error is thrown if archive recovery starts from a base backup taken before wal_level is changed to minimal. When archive recovery exits with the error, if users have a base backup taken after setting wal_level to higher than minimal, they can recover the database by starting archive recovery from that newer backup. But note that if such backup doesn't exist, there is no easy way to complete archive recovery, which may make the database server unstartable and users may lose whole database. The commit adds the note about this risk into the document. Even in the case of unstartable database server, previously by just disabling hot standby users could avoid the error during archive recovery, forcibly start up the server and salvage data from it. But note that this commit makes this procedure unavailable at all. Author: Takamichi Osumi Reviewed-by: Laurenz Albe, Kyotaro Horiguchi, David Steele, Fujii Masao Discussion: https://postgr.es/m/OSBPR01MB4888CBE1DA08818FD2D90ED8EDF90@OSBPR01MB4888.jpnprd01.prod.outlook.com	2021-04-06 22:56:51 +09:00
Peter Geoghegan	f6b8f19a08	Allocate access strategy in parallel VACUUM workers. Commit `49f49def` took entirely the wrong approach to fixing this issue. Just allocate a local buffer access strategy in each individual worker instead of trying to propagate state. This state was never propagated by parallel VACUUM in the first place. It looks like the only reason that this worked following commit `40d964ec` was that it involved static global variables, which are initialized to 0 per the C standard. A more comprehensive fix may be necessary, even on HEAD. This fix should at least get the buildfarm green once again. Thanks once again to Thomas Munro for continued off-list assistance with the issue.	2021-04-05 17:17:40 -07:00
Tom Lane	09c1c6ab4b	Support INCLUDE'd columns in SP-GiST. Not much to say here: does what it says on the tin. We steal a previously-always-zero bit from the nextOffset field of leaf index tuples in order to track whether there is a nulls bitmap. Otherwise it works about like included columns in other index types. Pavel Borisov, reviewed by Andrey Borodin and Anastasia Lubennikova, and rather heavily editorialized on by me Discussion: https://postgr.es/m/CALT9ZEFi-vMp4faht9f9Junb1nO3NOSjhpxTmbm1UGLMsLqiEQ@mail.gmail.com	2021-04-05 18:41:21 -04:00
Peter Geoghegan	49f49defe7	Propagate parallel VACUUM's buffer access strategy. Parallel VACUUM relied on global variable state from the leader process being propagated to workers on fork(). Commit `b4af70cb` removed most uses of global variables inside vacuumlazy.c, but did not account for the buffer access strategy state. To fix, propagate the state through shared memory instead. Per buildfarm failures on elver, curculio, and morepork. Many thanks to Thomas Munro for off-list assistance with this issue.	2021-04-05 14:56:56 -07:00
Peter Geoghegan	b4af70cb21	Simplify state managed by VACUUM. Reorganize the state struct used by VACUUM -- group related items together to make it easier to understand. Also stop relying on stack variables inside lazy_scan_heap() -- move those into the state struct instead. Doing things this way simplifies large groups of related functions whose function signatures had a lot of unnecessary redundancy. Switch over to using int64 for the struct fields used to count things that are reported to the user via log_autovacuum and VACUUM VERBOSE output. We were using double, but that doesn't seem to have any advantages. Using int64 makes it possible to add assertions that verify that the first pass over the heap (pruning) encounters precisely the same number of LP_DEAD items that get deleted from indexes later on, in the second pass over the heap. These assertions will be added in later commits. Finally, adjust the signatures of functions with IndexBulkDeleteResult pointer arguments in cases where there was ambiguity about whether or not the argument relates to a single index or all indexes. Functions now use the idiom that both ambulkdelete() and amvacuumcleanup() have always used (where appropriate): accept a mutable IndexBulkDeleteResult pointer argument, and return a result IndexBulkDeleteResult pointer to caller. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/CAH2-WzkeOSYwC6KNckbhk2b1aNnWum6Yyn0NKP9D-Hq1LGTDPw@mail.gmail.com	2021-04-05 13:21:44 -07:00
Tom Lane	dfc843d465	Fix more confusion in SP-GiST. spg_box_quad_leaf_consistent unconditionally returned the leaf datum as leafValue, even though in its usage for poly_ops that value is of completely the wrong type. In versions before 12, that was harmless because the core code did nothing with leafValue in non-index-only scans ... but since commit `2a6368343`, if we were doing a KNN-style scan, spgNewHeapItem would unconditionally try to copy the value using the wrong datatype parameters. Said copying is a waste of time and space if we're not going to return the data, but it accidentally failed to fail until I fixed the datatype confusion in `ac9099fc1`. Hence, change spgNewHeapItem to not copy the datum unless we're actually going to return it later. This saves cycles and dodges the question of whether lossy opclasses are returning the right type. Also change spg_box_quad_leaf_consistent to not return data that might be of the wrong type, as insurance against somebody introducing a similar bug into the core code in future. It seems like a good idea to back-patch these two changes into v12 and v13, although I'm afraid to change spgNewHeapItem's mistaken idea of which datatype to use in those branches. Per buildfarm results from `ac9099fc1`. Discussion: https://postgr.es/m/3728741.1617381471@sss.pgh.pa.us	2021-04-04 17:57:07 -04:00
Tom Lane	ac9099fc1d	Fix confusion in SP-GiST between attribute type and leaf storage type. According to the documentation, the attType passed to the opclass config function (and also relied on by the core code) is the type of the heap column or expression being indexed. But what was actually being passed was the type stored for the index column. This made no difference for user-defined SP-GiST opclasses, because we weren't allowing the STORAGE clause of CREATE OPCLASS to be used, so the two types would be the same. But it's silly not to allow that, seeing that the built-in poly_ops opclass has a different value for opckeytype than opcintype, and that if you want to do lossy storage then the types must really be different. (Thus, user-defined opclasses doing lossy storage had to lie about what type is in the index.) Hence, remove the restriction, and make sure that we use the input column type not opckeytype where relevant. For reasons of backwards compatibility with existing user-defined opclasses, we can't quite insist that the specified leafType match the STORAGE clause; instead just add an amvalidate() warning if they don't match. Also fix some bugs that would only manifest when trying to return index entries when attType is different from attLeafType. It's not too surprising that these have not been reported, because the only usual reason for such a difference is to store the leaf value lossily, rendering index-only scans impossible. Add a src/test/modules module to exercise cases where attType is different from attLeafType and yet index-only scan is supported. Discussion: https://postgr.es/m/3728741.1617381471@sss.pgh.pa.us	2021-04-04 14:28:57 -04:00
Tomas Vondra	d9c5b9a9ee	Fix bug in brin_minmax_multi_union When calling sort_expanded_ranges() we need to remember the return value, because the function sorts and also deduplicates the ranges. So the number of ranges may decrease. brin_minmax_multi_union failed to do that, which resulted in crashes due to bogus ranges (equal minval/maxval but not marked as compacted). Reported-by: Jaime Casanova Discussion: https://postgr.es/m/20210404052550.GA4376%40ahch-to	2021-04-04 19:36:12 +02:00
Tomas Vondra	1dad2a5ea3	Fix order of parameters in BRIN minmax-multi calls The BRIN minmax-multi consistent function incorrectly assumed it can lookup an operator, and then swap the arguments to get the commutator. For example <(a,b) would be called as <(b,a) to get >(a,b). This works when the arguments are of the same type, but with cross-type opclasses this fails. We can't swap <(float4,float8) arguments, for example. Fixed by passing arguments in the right order. Discussion: https://postgr.es/m/CAJKUy5jLZFLCxyxfT%3DMfK5mtPfSzHA1rVLowR-j4RRsFVvKm7A%40mail.gmail.com	2021-04-04 19:25:41 +02:00
Tomas Vondra	e1fbe1181c	Fix BRIN minmax-multi distance for inet type The distance calculation ignored the mask, unlike the inet comparator, which resulted in negative distance in some cases. Fixed by applying the mask in brin_minmax_multi_distance_inet. I've considered simply calling inetmi() to calculate the delta, but that does not consider mask either. Reviewed-by: Zhihong Yu Discussion: https://postgr.es/m/1a0a7b9d-9bda-e3a2-7fa4-88f15042a051%40enterprisedb.com	2021-04-04 19:23:32 +02:00
Tomas Vondra	7262f2421a	Fix BRIN minmax-multi distance for timetz type The distance calculation ignored the time zone, so the result of (b-a) might have ended negative even if (b > a). Fixed by considering the time zone difference. Reported-by: Jaime Casanova Discussion: https://postgr.es/m/CAJKUy5jLZFLCxyxfT%3DMfK5mtPfSzHA1rVLowR-j4RRsFVvKm7A%40mail.gmail.com	2021-04-04 19:22:23 +02:00
Tomas Vondra	2b10e0e3c2	Fix BRIN minmax-multi distance for interval type The distance calculation for interval type was treating months as having 31 days, which is inconsistent with the interval comparator (using 30 days). Due to this it was possible to get negative distance (b-a) when (a<b), trigerring an assert. Fixed by adopting the same logic as interval_cmp_value. Reported-by: Jaime Casanova Discussion: https://postgr.es/m/CAJKUy5jKH0Xhneau2mNftNPtTy-BVgQfXc8zQkEvRvBHfeUThQ%40mail.gmail.com	2021-04-04 19:19:51 +02:00
Tom Lane	1ebdec8c03	Rethink handling of pass-by-value leaf datums in SP-GiST. The existing convention in SP-GiST is that any pass-by-value datatype is stored in Datum representation, i.e. it's of width sizeof(Datum) even when typlen is less than that. This is okay, or at least it's too late to change it, for prefix datums and node-label datums in inner (upper) tuples. But it's problematic for leaf datums, because we'd prefer those to be stored in Postgres' standard on-disk representation so that we can easily extend leaf tuples to carry additional "included" columns. I believe, however, that we can get away with just up and changing that. This would be an unacceptable on-disk-format break, but there are two big mitigating factors: 1. It seems quite unlikely that there are any SP-GiST opclasses out there that use pass-by-value leaf datatypes. Certainly none of the ones in core do, nor has codesearch.debian.net heard of any. Given what SP-GiST is good for, it's hard to conceive of a use-case where the leaf-level values would be both small and fixed-width. (As an example, if you wanted to index text values with the leaf level being just a byte, then every text string would have to be represented with one level of inner tuple per preceding byte, which would be horrendously space-inefficient and slow to access. You always want to use as few inner-tuple levels as possible, leaving as much as possible in the leaf values.) 2. Even granting that you have such an index, this change only breaks things on big-endian machines. On little-endian, the high order bytes of the Datum format will now just appear to be alignment padding space. So, change the code to store pass-by-value leaf datums in their usual on-disk form. Inner-tuple datums are not touched. This is extracted from a larger patch that intends to add support for "included" columns. I'm committing it separately for visibility in our commit logs. Pavel Borisov and Tom Lane, reviewed by Andrey Borodin Discussion: https://postgr.es/m/CALT9ZEFi-vMp4faht9f9Junb1nO3NOSjhpxTmbm1UGLMsLqiEQ@mail.gmail.com	2021-04-01 17:55:17 -04:00
Noah Misch	0ff8bbdee1	Accept slightly-filled pages for tuples larger than fillfactor. We always inserted a larger-than-fillfactor tuple into a newly-extended page, even when existing pages were empty or contained nothing but an unused line pointer. This was unnecessary relation extension. Start tolerating page usage up to 1/8 the maximum space that could be taken up by line pointers. This is somewhat arbitrary, but it should allow more cases to reuse pages. This has no effect on tables with fillfactor=100 (the default). John Naylor and Floris van Nee. Reviewed by Matthias van de Meent. Reported by Floris van Nee. Discussion: https://postgr.es/m/6e263217180649339720afe2176c50aa@opammb0562.comp.optiver.com	2021-03-30 18:53:44 -07:00
David Rowley	af527705ed	Adjust design of per-worker parallel seqscan data struct The design of the data structures which allow storage of the per-worker memory during parallel seq scans were not ideal. The work done in `56788d215` required an additional data structure to allow workers to remember the range of pages that had been allocated to them for processing during a parallel seqscan. That commit added a void pointer field to TableScanDescData to allow heapam to store the per-worker allocation information. However putting the field there made very little sense given that we have AM specific structs for that, e.g. HeapScanDescData. Here we remove the void pointer field from TableScanDescData and add a dedicated field for this purpose to HeapScanDescData. Previously we also allocated memory for this parallel per-worker data for all scans, regardless if it was a parallel scan or not. This was just a wasted allocation for non-parallel scans, so here we make the allocation conditional on the scan being parallel. Also, add previously missing pfree() to free the per-worker data in heap_endscan(). Reported-by: Andres Freund Reviewed-by: Andres Freund Discussion: https://postgr.es/m/20210317023101.anvejcfotwka6gaa@alap3.anarazel.de	2021-03-30 10:17:09 +13:00
Tomas Vondra	73b96bad4a	Fix alignment in BRIN minmax-multi deserialization The deserialization failed to ensure correct alignment, as it assumed it can simply point into the serialized value. The serialization however ignores alignment and copies just the significant bytes in order to make the result as small as possible. This caused failures on systems that are sensitive to mialigned addresses, like sparc, or with address sanitizer enabled. Fixed by copying the serialized data to ensure proper alignment. While at it, fix an issue with serialization on big endian machines, using the same store_att_byval/fetch_att trick as extended statistics. Discussion: https://postgr.es/0c8c3304-d3dd-5e29-d5ac-b50589a23c8c%40enterprisedb.com	2021-03-26 16:48:36 +01:00
Tomas Vondra	ab596105b5	BRIN minmax-multi indexes Adds BRIN opclasses similar to the existing minmax, except that instead of summarizing the page range into a single [min,max] range, the summary consists of multiple ranges and/or points, allowing gaps. This allows more efficient handling of data with poor correlation to physical location within the table and/or outlier values, for which the regular minmax opclassed tend to work poorly. It's possible to specify the number of values kept for each page range, either as a single point or an interval boundary. CREATE TABLE t (a int); CREATE INDEX ON t USING brin (a int4_minmax_multi_ops(values_per_range=16)); When building the summary, the values are combined into intervals with the goal to minimize the "covering" (sum of interval lengths), using a support procedure computing distance between two values. Bump catversion, due to various catalog changes. Author: Tomas Vondra <tomas.vondra@postgresql.org> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Sokolov Yura <y.sokolov@postgrespro.ru> Reviewed-by: John Naylor <john.naylor@enterprisedb.com> Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com Discussion: https://postgr.es/m/5d78b774-7e9c-c94e-12cf-fef51cc89b1a%402ndquadrant.com	2021-03-26 13:54:30 +01:00

... 4 5 6 7 8 ...

4603 Commits