postgresql

Commit Graph

Author	SHA1	Message	Date
Tomas Vondra	c1ec02be1d	Reuse BrinDesc and BrinRevmap in brininsert The brininsert code used to initialize (and destroy) BrinDesc and BrinRevmap for each tuple, which is not free. This patch initializes these structures only once, and reuses them for all inserts in the same command. The data is passed through indexInfo->ii_AmCache. This also introduces an optional AM callback "aminsertcleanup" that allows performing custom cleanup in case simply pfree-ing ii_AmCache is not sufficient (which is the case when the cache contains TupleDesc, Buffers, and so on). Author: Soumyadeep Chakraborty Reviewed-by: Alvaro Herrera, Matthias van de Meent, Tomas Vondra Discussion: https://postgr.es/m/CAE-ML%2B9r2%3DaO1wwji1sBN9gvPz2xRAtFUGfnffpd0ZqyuzjamA%40mail.gmail.com	2023-11-25 20:27:28 +01:00
Peter Eisentraut	611806cd72	Add trailing commas to enum definitions Since C99, there can be a trailing comma after the last value in an enum definition. A lot of new code has been introducing this style on the fly. Some new patches are now taking an inconsistent approach to this. Some add the last comma on the fly if they add a new last value, some are trying to preserve the existing style in each place, some are even dropping the last comma if there was one. We could nudge this all in a consistent direction if we just add the trailing commas everywhere once. I omitted a few places where there was a fixed "last" value that will always stay last. I also skipped the header files of libpq and ecpg, in case people want to use those with older compilers. There were also a small number of cases where the enum type wasn't used anywhere (but the enum values were), which ended up confusing pgindent a bit, so I left those alone. Discussion: https://www.postgresql.org/message-id/flat/386f8c45-c8ac-4681-8add-e3b0852c1620%40eisentraut.org	2023-10-26 09:20:54 +02:00
Jeff Davis	00d7fb5e2e	Assert that buffers are marked dirty before XLogRegisterBuffer(). Enforce the rule from transam/README in XLogRegisterBuffer(), and update callers to follow the rule. Hash indexes sometimes register clean pages as a part of the locking protocol, so provide a REGBUF_NO_CHANGE flag to support that use. Discussion: https://postgr.es/m/c84114f8-c7f1-5b57-f85a-3adc31e1a904@iki.fi Reviewed-by: Heikki Linnakangas	2023-10-23 17:17:46 -07:00
Robert Haas	afd12774ae	During online checkpoints, insert XLOG_CHECKPOINT_REDO at redo point. This allows tools that read the WAL sequentially to identify (possible) redo points when they're reached, rather than only being able to detect them in retrospect when XLOG_CHECKPOINT_ONLINE is found, possibly much later in the WAL stream. There are other possible applications as well; see the discussion links below. Any redo location that precedes the checkpoint location should now point to an XLOG_CHECKPOINT_REDO record, so add a cross-check to verify this. While adjusting the code in CreateCheckPoint() for this patch, I made it call WALInsertLockAcquireExclusive a bit later than before, since there appears to be no need for it to be held while checking whether the system is idle, whether this is an end-of-recovery checkpoint, or what the current timeline is. Bump XLOG_PAGE_MAGIC. Patch by me, based in part on earlier work from Dilip Kumar. Review by Dilip Kumar, Amit Kapila, Andres Freund, and Michael Paquier. Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com Discussion: http://postgr.es/m/20230614194717.jyuw3okxup4cvtbt%40awork3.anarazel.de Discussion: http://postgr.es/m/CA+hUKG+b2ego8=YNW2Ohe9QmSiReh1-ogrv8V_WZpJTqP3O+2w@mail.gmail.com	2023-10-19 14:47:29 -04:00
Nathan Bossart	8d140c5822	Improve the naming in wal_sync_method code. * sync_method is renamed to wal_sync_method. * sync_method_options[] is renamed to wal_sync_method_options[]. * assign_xlog_sync_method() is renamed to assign_wal_sync_method(). * The names of the available synchronization methods are now prefixed with "WAL_SYNC_METHOD_" and have been moved into a WalSyncMethod enum. * PLATFORM_DEFAULT_SYNC_METHOD is renamed to PLATFORM_DEFAULT_WAL_SYNC_METHOD, and DEFAULT_SYNC_METHOD is renamed to DEFAULT_WAL_SYNC_METHOD. These more descriptive names help distinguish the code for wal_sync_method from the code for DataDirSyncMethod (e.g., the recovery_init_sync_method configuration parameter and the --sync-method option provided by several frontend utilities). This change also prevents name collisions between the aforementioned sets of code. Since this only improves the naming of internal identifiers, there should be no behavior change. Author: Maxim Orlov Discussion: https://postgr.es/m/CACG%3DezbL1gwE7_K7sr9uqaCGkWhmvRTcTEnm3%2BX1xsRNwbXULQ%40mail.gmail.com	2023-10-13 15:16:45 -05:00
Peter Eisentraut	1d91d24d9a	Add const to values and nulls arguments This excludes any changes that would change the external AM APIs. Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://www.postgresql.org/message-id/flat/14c31f4a-0347-0805-dce8-93a9072c05a5%40eisentraut.org	2023-10-10 07:50:43 +02:00
Alexander Korotkov	e0b1ee17dc	Skip checking of scan keys required for directional scan in B-tree Currently, B-tree code matches every scan key to every item on the page. Imagine the ordered B-tree scan for the query like this. SELECT * FROM tbl WHERE col > 'a' AND col < 'b' ORDER BY col; The (col > 'a') scan key will be always matched once we find the location to start the scan. The (col < 'b') scan key will match every item on the page as long as it matches the last item on the page. This patch implements prechecking of the scan keys required for directional scan on beginning of page scan. If precheck is successful we can skip this scan keys check for the items on the page. That could lead to significant acceleration especially if the comparison operator is expensive. Idea from patch by Konstantin Knizhnik. Discussion: https://postgr.es/m/079c3f8e-3371-abe2-e93c-fc8a0ae3f571%40garret.ru Reviewed-by: Peter Geoghegan, Pavel Borisov	2023-10-06 10:40:51 +03:00
Peter Eisentraut	04e485273b	Move BuildDescForRelation() from tupdesc.c to tablecmds.c BuildDescForRelation() main job is to convert ColumnDef lists to pg_attribute/tuple descriptor arrays, which is really mostly an internal subroutine of DefineRelation() and some related functions, which is more the remit of tablecmds.c and doesn't have much to do with the basic tuple descriptor interfaces in tupdesc.c. This is also supported by observing the header includes we can remove in tupdesc.c. By moving it over, we can also (in the future) make BuildDescForRelation() use more internals of tablecmds.c that are not sensible to be exposed in tupdesc.c. Discussion: https://www.postgresql.org/message-id/flat/52a125e4-ff9a-95f5-9f61-b87cf447e4da@eisentraut.org	2023-10-05 16:20:46 +02:00
Robert Haas	1ccc1e05ae	Remove retry loop in heap_page_prune(). The retry loop is needed because heap_page_prune() calls HeapTupleSatisfiesVacuum() and then lazy_scan_prune() does the same thing again, and they might get different answers due to concurrent clog updates. But this patch makes heap_page_prune() return the HeapTupleSatisfiesVacuum() results that it computed back to the caller, which allows lazy_scan_prune() to avoid needing to recompute those values in the first place. That's nice both because it eliminates the need for a retry loop and also because it's cheaper. Melanie Plageman, reviewed by David Geier, Andres Freund, and me. Discussion: https://postgr.es/m/CAAKRu_br124qsGJieuYA0nGjywEukhK1dKBfRdby_4yY3E9SXA%40mail.gmail.com	2023-10-02 11:40:07 -04:00
Heikki Linnakangas	f0bd0b4489	Add rmgrdesc README In the README, briefly explain what rmgrdesc functions are, and why they are in a separate directory. Commit `c03c2eae0a` added some guidelines on the preferred output format; move that to the README too. Reviewed-by: Melanie Plageman, Peter Geoghegan Discussion: https://www.postgresql.org/message-id/9159daf7-f42d-781b-458f-1b2cf32cb256%40iki.fi	2023-10-02 12:18:57 +03:00
Noah Misch	e1f95ec8cf	Correct assertion and comments about XLogRecordMaxSize. The largest allocation, of xl_tot_len+8192, is in allocate_recordbuf(). Discussion: https://postgr.es/m/20230812211327.GB2326466@rfd.leadboat.com	2023-10-01 12:20:55 -07:00
Peter Geoghegan	714780dcdd	Fix btmarkpos/btrestrpos array key wraparound bug. nbtree's mark/restore processing failed to correctly handle an edge case involving array key advancement and related search-type scan key state. Scans with ScalarArrayScalarArrayOpExpr quals requiring mark/restore processing (for a merge join) could incorrectly conclude that an affected array/scan key must not have advanced during the time between marking and restoring the scan's position. As a result of all this, array key handling within btrestrpos could skip a required call to _bt_preprocess_keys(). This confusion allowed later primitive index scans to overlook tuples matching the true current array keys. The scan's search-type scan keys would still have spurious values corresponding to the final array element(s) -- not values matching the first/now-current array element(s). To fix, remember that "array key wraparound" has taken place during the ongoing btrescan in a flag variable stored in the scan's state, and use that information at the point where btrestrpos decides if another call to _bt_preprocess_keys is required. Oversight in commit `70bc5833`, which taught nbtree to handle array keys during mark/restore processing, but missed this subtlety. That commit was itself a bug fix for an issue in commit `9e8da0f7`, which taught nbtree to handle ScalarArrayOpExpr quals natively. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzkgP3DDRJxw6DgjCxo-cu-DKrvjEv_ArkP2ctBJatDCYg@mail.gmail.com Backpatch: 11- (all supported branches).	2023-09-28 16:29:37 -07:00
Robert Haas	4e9fc3a976	Return data from heap_page_prune via a struct. Previously, one of the values in the struct was returned as the return value, and another was returned via an output parameter. In preparation for returning more stuff, consolidate both values into a struct returned via an output parameter. Melanie Plageman, reviewed by Andres Freund and by me. Discussion: https://postgr.es/m/CAAKRu_br124qsGJieuYA0nGjywEukhK1dKBfRdby_4yY3E9SXA%40mail.gmail.com	2023-09-28 10:36:34 -04:00
Peter Eisentraut	ebf76f2753	Add TupleDescGetDefault() This unifies some repetitive code. Note: I didn't push the "not found" error message into the new function, even though all existing callers would be able to make use of it. Using the existing error handling as-is would probably require exposing the Relation type via tupdesc.h, which doesn't seem desirable. (Or even if we changed it to just report the OID, it would inject the concept of a relation containing the tuple descriptor into tupdesc.h, which might be a layering violation. Perhaps some further improvements could be considered here separately.) Discussion: https://www.postgresql.org/message-id/flat/52a125e4-ff9a-95f5-9f61-b87cf447e4da%40eisentraut.org	2023-09-27 18:52:40 +01:00
Peter Eisentraut	b0ae29512c	MergeAttributes() and related variable renaming Mainly, rename "schema" to "columns" and related changes. The previous naming has long been confusing. Discussion: https://www.postgresql.org/message-id/flat/52a125e4-ff9a-95f5-9f61-b87cf447e4da%40eisentraut.org	2023-09-26 16:08:35 +01:00
Thomas Munro	9f0602539d	Remove some more "snapshot too old" vestiges. Commit `f691f5b8` removed the logic, but left behind some now-useless Snapshot arguments to various AM-internal functions, and missed a couple of comments. Reported-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-Wznj9qSNXZ1P1uWTUD_FeaTezbUazb416EPwi4Qr_jR_6A%40mail.gmail.com	2023-09-08 17:12:12 +12:00
Thomas Munro	f691f5b80a	Remove the "snapshot too old" feature. Remove the old_snapshot_threshold setting and mechanism for producing the error "snapshot too old", originally added by commit `848ef42b`. Unfortunately it had a number of known problems in terms of correctness and performance, mostly reported by Andres in the course of his work on snapshot scalability. We agreed to remove it, after a long period without an active plan to fix it. This is certainly a desirable feature, and someone might propose a new or improved implementation in the future. Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CACG%3DezYV%2BEvO135fLRdVn-ZusfVsTY6cH1OZqWtezuEYH6ciQA%40mail.gmail.com Discussion: https://postgr.es/m/20200401064008.qob7bfnnbu4w5cw4%40alap3.anarazel.de Discussion: https://postgr.es/m/CA%2BTgmoY%3Daqf0zjTD%2B3dUWYkgMiNDegDLFjo%2B6ze%3DWtpik%2B3XqA%40mail.gmail.com	2023-09-05 19:53:43 +12:00
Andres Freund	82a4edabd2	hio: Take number of prior relation extensions into account The new relation extension logic, introduced in `00d1e02be2`, could lead to slowdowns in some scenarios. E.g., when loading narrow rows into a table using COPY, the caller of RelationGetBufferForTuple() will only request a small number of pages. Without concurrency, we just extended using pwritev() in that case. However, if there is some concurrency, we switched between extending by a small number of pages and a larger number of pages, depending on the number of waiters for the relation extension logic. However, some filesystems, XFS in particular, do not perform well when switching between extending files using fallocate() and pwritev(). To avoid that issue, remember the number of prior relation extensions in BulkInsertState and extend more aggressively if there were prior relation extensions. That not just avoids the aforementioned slowdown, but also leads to noticeable performance gains in other situations, primarily due to extending more aggressively when there is no concurrency. I should have done it this way from the get go. Reported-by: Masahiko Sawada <sawada.mshk@gmail.com> Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAD21AoDvDmUQeJtZrau1ovnT_smN940=Kp6mszNGK3bq9yRN6g@mail.gmail.com Backpatch: 16-, where the new relation extension code was added	2023-08-14 11:33:09 -07:00
Peter Geoghegan	d088ba5a5a	nbtree: Allocate new pages in separate function. Split nbtree's _bt_getbuf function is two: code that read locks or write locks existing pages remains in _bt_getbuf, while code that deals with allocating new pages is moved to a new, dedicated function called _bt_allocbuf. This simplifies most _bt_getbuf callers, since it is no longer necessary for them to pass a heaprel argument. Many of the changes to nbtree from commit `61b313e4` can be reverted. This minimizes the divergence between HEAD/PostgreSQL 16 and earlier release branches. _bt_allocbuf replaces the previous nbtree idiom of passing P_NEW to _bt_getbuf. There are only 3 affected call sites, all of which continue to pass a heaprel for recovery conflict purposes. Note that nbtree's use of P_NEW was superficial; nbtree never actually relied on the P_NEW code paths in bufmgr.c, so this change is strictly mechanical. GiST already took the same approach; it has a dedicated function for allocating new pages called gistNewBuffer(). That factor allowed commit `61b313e4` to make much more targeted changes to GiST. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/CAH2-Wz=8Z9qY58bjm_7TAHgtW6RzZ5Ke62q5emdCEy9BAzwhmg@mail.gmail.com	2023-06-10 14:08:25 -07:00
Tom Lane	0245f8db36	Pre-beta mechanical code beautification. Run pgindent, pgperltidy, and reformat-dat-files. This set of diffs is a bit larger than typical. We've updated to pg_bsd_indent 2.1.2, which properly indents variable declarations that have multi-line initialization expressions (the continuation lines are now indented one tab stop). We've also updated to perltidy version 20230309 and changed some of its settings, which reduces its desire to add whitespace to lines to make assignments etc. line up. Going forward, that should make for fewer random-seeming changes to existing code. Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql	2023-05-19 17:24:48 -04:00
Tomas Vondra	3581cbdcd6	Fix handling of empty ranges and NULLs in BRIN BRIN indexes did not properly distinguish between summaries for empty (no rows) and all-NULL ranges, treating them as essentially the same thing. Summaries were initialized with allnulls=true, and opclasses simply reset allnulls to false when processing the first non-NULL value. This however produces incorrect results if the range starts with a NULL value (or a sequence of NULL values), in which case we forget the range contains NULL values when adding the first non-NULL value. This happens because the allnulls flag is used for two separate purposes - to mark empty ranges (not representing any rows yet) and ranges containing only NULL values. Opclasses don't know which of these cases it is, and so don't know whether to set hasnulls=true. Setting the flag in both cases would make it correct, but it would also make BRIN indexes useless for queries with IS NULL clauses. All ranges start empty (and thus allnulls=true), so all ranges would end up with either allnulls=true or hasnulls=true. The severity of the issue is somewhat reduced by the fact that it only happens when adding values to an existing summary with allnulls=true. This can happen e.g. for small tables (because a summary for the first range exists for all BRIN indexes), or for tables with large fraction of NULL values in the indexed columns. Bulk summarization (e.g. during CREATE INDEX or automatic summarization) that processes all values at once is not affected by this issue. In this case the flags were updated in a slightly different way, not forgetting the NULL values. To identify empty ranges we use a new flag, stored in an unused bit in the BRIN tuple header so the on-disk format remains the same. A matching flag is added to BrinMemTuple, into a 3B gap after bt_placeholder. That means there's no risk of ABI breakage, although we don't actually pass the BrinMemTuple to any public API. We could also skip storing index tuples for empty summaries, but then we'd have to always process such ranges - even if there are no rows in large parts of the table (e.g. after a bulk DELETE), it would still require reading the pages etc. So we store them, but ignore them when building the bitmap. Backpatch to 11. The issue exists since BRIN indexes were introduced in 9.5, but older releases are already EOL. Backpatch-through: 11 Reviewed-by: Justin Pryzby, Matthias van de Meent, Alvaro Herrera Discussion: https://postgr.es/m/402430e4-7d9d-6cf1-09ef-464d80afff3b@enterprisedb.com	2023-05-19 01:29:44 +02:00
Michael Paquier	8961cb9a03	Fix typos in comments The changes done in this commit impact comments with no direct user-visible changes, with fixes for incorrect function, variable or structure names. Author: Alexander Lakhin Discussion: https://postgr.es/m/e8c38840-596a-83d6-bd8d-cebc51111572@gmail.com	2023-05-02 12:23:08 +09:00
Peter Geoghegan	50547a3fae	Fix wal_consistency_checking enhanced desc output. Recent enhancements to rmgr desc routines that made the output summarize certain block data (added by commits `7d8219a4` and `1c453cfd`) dealt with records that lack relevant block data (and so have nothing to give a more detailed summary of) by testing !DecodedBkpBlock.has_image. As a result, more detailed descriptions of block data were not output when wal_consistency_checking was enabled. This bug affected records with summarizable block data that also happened to have an FPI that the REDO routine isn't supposed to apply (FPIs used for consistency checking purposes only). The presence of such an FPI was incorrectly taken to indicate the absence of block data. To fix, test DecodedBkpBlock.has_data, not !DecodedBkpBlock.has_image. This is the exact condition that we care about, not an inexact proxy. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-Wzm5Sc9cBg1qWV_cEBfLNJCrW9FjS-SoHVt8FLA7Ldn8yg@mail.gmail.com	2023-04-19 10:42:39 -07:00
Peter Geoghegan	06e0652750	Remove useless argument from nbtree dedup function. _bt_dedup_pass()'s heapRel argument hasn't been needed or used since commit `cf2acaf4dc` made deleting any existing LP_DEAD index tuples the caller's responsibility.	2023-04-18 10:33:15 -07:00
David Rowley	eef231e816	Fix some typos and some incorrectly duplicated words Author: Justin Pryzby Reviewed-by: David Rowley Discussion: https://postgr.es/m/ZD3D1QxoccnN8A1V@telsasoft.com	2023-04-18 14:03:49 +12:00
David Rowley	b4dbf3e924	Fix various typos This fixes many spelling mistakes in comments, but a few references to invalid parameter names, function names and option names too in comments and also some in string constants Also, fix an #undef that was undefining the incorrect definition Author: Alexander Lakhin Reviewed-by: Justin Pryzby Discussion: https://postgr.es/m/d5f68d19-c0fc-91a9-118d-7c6a5a3f5fad@gmail.com	2023-04-18 13:23:23 +12:00
Peter Geoghegan	cd7cdc550c	Fix incorrect comment about nbtree WAL record. The nbtree VACUUM WAL record stores its page offset number payload in blk 0 (just like the closely related nbtree DELETE WAL record). Commit `ebd551f5` fixed a similar issue with the DELETE WAL record, but missed this one.	2023-04-17 09:58:18 -07:00
Peter Geoghegan	c03c2eae0a	Refine the guidelines for rmgrdesc authors. Clarify the goals of the recently added guidelines for rmgrdesc authors: to avoid gratuitous inconsistencies across resource managers, and to make it reasonably easy to write a reusable custom parser. Beyond that, the guidelines leave rmgrdesc authors with a significant amount of leeway. This even includes the leeway to invent custom conventions (in cases where it's warranted). Follow-up to commit `7d8219a4`. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAH2-WzkbYuvwYKm-Y-72QEh6SPMQcAo9uONv+mR3bMGcu9E_Cg@mail.gmail.com	2023-04-11 15:26:24 -07:00
Peter Geoghegan	e944063294	Fix xl_heap_lock WAL record field's data type. Make xl_heap_lock's infobits_set field of type uint8, not int8. Using int8 isn't appropriate given that the field just holds status bits. This fixes an oversight in commit `0ac5ad5134`. In passing rename the nearby TransactionId field to "xmax" to make things consistency with related records, such as xl_heap_lock_updated. Deliberately avoid a bump in XLOG_PAGE_MAGIC. No backpatch, either. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzkCd3kOS8b7Rfxw7Mh1_6jvX=Nzo-CWR1VBTiOtVZkWHA@mail.gmail.com	2023-04-11 14:07:54 -07:00
Peter Geoghegan	5d6728e588	Fix nbtree posting list update desc output. We cannot use the generic array_desc approach with per-tuple nbtree posting list update metadata because array_desc can only deal with fixed width elements (e.g., page offset numbers). Using array_desc led to incorrect rmgr descriptions for updates from nbtree DELETE/VACUUM WAL records. To fix, add specialized code to describe the update metadata as array elements in desc output. We now iterate over the update metadata using an approach that matches related REDO routines. Also stop showing the updates offset number array separately in nbtree DELETE/VACUUM desc output. It's redundant information, since the same page offset numbers appear in the description of each individual update element. Also make some small tweaks to the way that we format arrays in all desc routines (not just nbtree desc routines) to make arrays a little less verbose. Oversight in commit `1c453cfd`, which enhanced the nbtree rmgr desc routines. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzkbYuvwYKm-Y-72QEh6SPMQcAo9uONv+mR3bMGcu9E_Cg@mail.gmail.com	2023-04-10 11:15:41 -07:00
Andres Freund	0fdab27ad6	Allow logical decoding on standbys Unsurprisingly, this requires wal_level = logical to be set on the primary and standby. The infrastructure added in `26669757b6` ensures that slots are invalidated if the primary's wal_level is lowered. Creating a slot on a standby waits for a xl_running_xact record to be processed. If the primary is idle (and thus not emitting xl_running_xact records), that can take a while. To make that faster, this commit also introduces the pg_log_standby_snapshot() function. By executing it on the primary, completion of slot creation on the standby can be accelerated. Note that logical decoding on a standby does not itself enforce that required catalog rows are not removed. The user has to use physical replication slots + hot_standby_feedback or other measures to prevent that. If catalog rows required for a slot are removed, the slot is invalidated. See `6af1793954` for an overall design of logical decoding on a standby. Bumps catversion, for the addition of the pg_log_standby_snapshot() function. Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> (in an older version) Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version) Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: FabrÌzio de Royes Mello <fabriziomello@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-By: Robert Haas <robertmhaas@gmail.com>	2023-04-08 02:20:05 -07:00
Peter Geoghegan	1c453cfd89	Show more detail in nbtree rmgr descriptions. Show a detailed description of the page offset number arrays that appear in certain nbtree WAL records. Also brings nbtree desc routines in line with the guidelines established by recent commit `7d8219a4`. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-By: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/flat/20230109215842.fktuhesvayno6o4g%40awork3.anarazel.de	2023-04-07 16:46:23 -07:00
Peter Geoghegan	7d8219a444	Show more detail in heapam rmgr descriptions. Add helper functions that output arrays in a standard format, and use the functions inside heapdesc routines. This allows tools like pg_walinspect to show a detailed description of the page offset number arrays for records like PRUNE and VACUUM (unless there was an FPI). Also document the conventions that desc routines should follow. Only the heapdesc routines follow the conventions for now, so they're just guidelines for the time being. Based on a suggestion from Andres Freund. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-By: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/flat/20230109215842.fktuhesvayno6o4g%40awork3.anarazel.de	2023-04-07 16:08:52 -07:00
Michael Paquier	8fcb32db98	Add more protections in WAL record APIs against overflows This commit adds a limit to the size of an XLogRecord at 1020MB, based on a suggestion by Heikki Linnakangas. This counts for the overhead needed by the XLogReader when allocating the memory it needs to read a record in DecodeXLogRecordRequiredSpace(), based on the record size. An assertion based on that is added to detect that any additions in the XLogReader facilities would not cause any overflows. If that's ever the case, the upper bound allowed would need to be adjusted. Before this, it was possible for an external module to create WAL records large enough to be assembled but not replayable, causing failures when replaying such WAL records on standbys. One case mentioned where this is possible is the in-core function pg_logical_emit_message() (wrapper for LogLogicalMessage), that allows to emit WAL records with an arbitrary amount of data potentially higher than the replay limit of approximately 1GB (limit of a palloc, minus the overhead needed by a XLogReader). This commit is a follow-up of `ffd1b6b` that has added similar protections for the block-level data. Here, the checks are extended to the whole record length, mainrdata_len being extended from uint32 to uint64 with the routines registering buffer and record data still limited to uint32 to minimize the checks when assembling a record. All the error messages related to overflow checks are improved to provide more context about the error happening. Author: Matthias van de Meent Reviewed-by: Andres Freund, Heikki Linnakangas, Michael Paquier Discussion: https://postgr.es/m/CAEze2WgGiw+LZt+vHf8tWqB_6VxeLsMeoAuod0N=ij1q17n5pw@mail.gmail.com	2023-04-07 10:10:17 +09:00
Andres Freund	00d1e02be2	hio: Use ExtendBufferedRelBy() to extend tables more efficiently While we already had some form of bulk extension for relations, it was fairly limited. It only amortized the cost of acquiring the extension lock, the relation itself was still extended one-by-one. Bulk extension was also solely triggered by contention, not by the amount of data inserted. To address this, use ExtendBufferedRelBy(), introduced in `31966b151e`, to extend the relation. We try to extend the relation by multiple blocks in two situations: 1) The caller tells RelationGetBufferForTuple() that it will need multiple pages. For now that's only used by heap_multi_insert(), see commit FIXME. 2) If there is contention on the extension lock, use the number of waiters for the lock as a multiplier for the number of blocks to extend by. This is similar to what we already did. Previously we additionally multiplied the numbers of waiters by 20, but with the new relation extension infrastructure I could not see a benefit in doing so. Using the freespacemap to provide empty pages can cause significant contention, and adds measurable overhead, even if there is no contention. To reduce that, remember the blocks the relation was extended by in the BulkInsertState, in the extending backend. In case 1) from above, the blocks the extending backend needs are not entered into the FSM, as we know that we will need those blocks. One complication with using the FSM to record empty pages, is that we need to insert blocks into the FSM, when we already hold a buffer content lock. To avoid doing IO while holding a content lock, release the content lock before recording free space. Currently that opens a small window in which another backend could fill the block, if a concurrent VACUUM records the free space. If that happens, we retry, similar to the already existing case when otherBuffer is provided. In the future it might be worth closing the race by preventing VACUUM from recording the space in newly extended pages. This change provides very significant wins (3x at 16 clients, on my workstation) for concurrent COPY into a single relation. Even single threaded COPY is measurably faster, primarily due to not dirtying pages while extending, if supported by the operating system (see commit `4d330a61bb`). Even single-row INSERTs benefit, although to a much smaller degree, as the relation extension lock rarely is the primary bottleneck. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de	2023-04-06 16:53:17 -07:00
Andres Freund	5279e9db8e	heapam: Pass number of required pages to RelationGetBufferForTuple() A future commit will use this information to determine how aggressively to extend the relation by. In heap_multi_insert() we know accurately how many pages we need once we need to extend the relation, providing an accurate lower bound for how much to extend. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de	2023-04-06 16:17:16 -07:00
Peter Geoghegan	e48c817395	Recycle deleted nbtree pages more aggressively. Commit `61b313e4` made nbtree consistently pass down a heaprel to low level routines like _bt_getbuf(). Although this was primarily intended as preparation for logical decoding on standbys, it also made it easy to correct an old deficiency in how nbtree VACUUM determines whether or not it's now safe to recycle deleted pages. Pass the heaprel to GlobalVisTestFor() in nbtree routines that deal with recycle safety. nbtree now makes less pessimistic assumptions about recycle safety within non-catalog relations. This enhancement complements the recycling enhancement added by commit `9dd963ae25`. nbtree remains just as pessimistic as ever when it comes to recycle safety within indexes on catalog relations. There is no fundamental reason why we need to treat catalog relations differently, though. The behavioral inconsistency is a consequence of the way that nbtree uses nextXID values to implement what Lanin and Shasha call "the drain technique". Note in particular that it has nothing to do with whether or not index tuples might still be required for an older MVCC snapshot. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/CAH2-WzkaiDxCje0yPuH=3Uh2p1V_2pFGY==xfbZoZu7Ax_NB8g@mail.gmail.com	2023-04-03 11:31:43 -07:00
Peter Geoghegan	a349b86603	Move heaprel struct field next to index rel field. Commit `61b313e4` added a heaprel struct member to IndexVacuumInfo, but placed it last. Move the heaprel struct member next to the index struct member to improve the code's readability. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WznG=TV6S9d3VA=y0vBHbXwnLs9_LLdiML=aNJuHeriwxg@mail.gmail.com	2023-04-03 11:01:11 -07:00
Alexander Korotkov	2b65bf046d	Revert `11470f544e` Discussion: https://postgr.es/m/20230323003003.plgaxjqahjgkuxrk%40awork3.anarazel.de	2023-04-03 16:54:31 +03:00
Andres Freund	6af1793954	Add info in WAL records in preparation for logical slot conflict handling This commit only implements one prerequisite part for allowing logical decoding. The commit message contains an explanation of the overall design, which later commits will refer back to. Overall design: 1. We want to enable logical decoding on standbys, but replay of WAL from the primary might remove data that is needed by logical decoding, causing error(s) on the standby. To prevent those errors, a new replication conflict scenario needs to be addressed (as much as hot standby does). 2. Our chosen strategy for dealing with this type of replication slot is to invalidate logical slots for which needed data has been removed. 3. To do this we need the latestRemovedXid for each change, just as we do for physical replication conflicts, but we also need to know whether any particular change was to data that logical replication might access. That way, during WAL replay, we know when there is a risk of conflict and, if so, if there is a conflict. 4. We can't rely on the standby's relcache entries for this purpose in any way, because the startup process can't access catalog contents. 5. Therefore every WAL record that potentially removes data from the index or heap must carry a flag indicating whether or not it is one that might be accessed during logical decoding. Why do we need this for logical decoding on standby? First, let's forget about logical decoding on standby and recall that on a primary database, any catalog rows that may be needed by a logical decoding replication slot are not removed. This is done thanks to the catalog_xmin associated with the logical replication slot. But, with logical decoding on standby, in the following cases: - hot_standby_feedback is off - hot_standby_feedback is on but there is no a physical slot between the primary and the standby. Then, hot_standby_feedback will work, but only while the connection is alive (for example a node restart would break it) Then, the primary may delete system catalog rows that could be needed by the logical decoding on the standby (as it does not know about the catalog_xmin on the standby). So, it’s mandatory to identify those rows and invalidate the slots that may need them if any. Identifying those rows is the purpose of this commit. Implementation: When a WAL replay on standby indicates that a catalog table tuple is to be deleted by an xid that is greater than a logical slot's catalog_xmin, then that means the slot's catalog_xmin conflicts with the xid, and we need to handle the conflict. While subsequent commits will do the actual conflict handling, this commit adds a new field isCatalogRel in such WAL records (and a new bit set in the xl_heap_visible flags field), that is true for catalog tables, so as to arrange for conflict handling. The affected WAL records are the ones that already contain the snapshotConflictHorizon field, namely: - gistxlogDelete - gistxlogPageReuse - xl_hash_vacuum_one_page - xl_heap_prune - xl_heap_freeze_page - xl_heap_visible - xl_btree_reuse_page - xl_btree_delete - spgxlogVacuumRedirect Due to this new field being added, xl_hash_vacuum_one_page and gistxlogDelete do now contain the offsets to be deleted as a FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignment. It's not needed on the others struct where isCatalogRel has been added. This commit just introduces the WAL format changes mentioned above. Handling the actual conflicts will follow in future commits. Bumps XLOG_PAGE_MAGIC as the several WAL records are changed. Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> (in an older version) Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version) Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>	2023-04-02 12:32:19 -07:00
Andres Freund	61b313e47e	Pass down table relation into more index relation functions This is done in preparation for logical decoding on standby, which needs to include whether visibility affecting WAL records are about a (user) catalog table. Which is only known for the table, not the indexes. It's also nice to be able to pass the heap relation to GlobalVisTestFor() in vacuumRedirectAndPlaceholder(). Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/21b700c3-eecf-2e05-a699-f8c78dd31ec7@gmail.com	2023-04-01 20:18:29 -07:00
Alexander Korotkov	11470f544e	Allow locking updated tuples in tuple_update() and tuple_delete() Currently, in read committed transaction isolation mode (default), we have the following sequence of actions when tuple_update()/tuple_delete() finds the tuple updated by concurrent transaction. 1. Attempt to update/delete tuple with tuple_update()/tuple_delete(), which returns TM_Updated. 2. Lock tuple with tuple_lock(). 3. Re-evaluate plan qual (recheck if we still need to update/delete and calculate the new tuple for update). 4. Second attempt to update/delete tuple with tuple_update()/tuple_delete(). This attempt should be successful, since the tuple was previously locked. This patch eliminates step 2 by taking the lock during first tuple_update()/tuple_delete() call. Heap table access method saves some efforts by checking the updated tuple once instead of twice. Future undo-based table access methods, which will start from the latest row version, can immediately place a lock there. The code in nodeModifyTable.c is simplified by removing the nested switch/case. Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp Reviewed-by: Andres Freund, Chris Travers	2023-03-23 00:26:59 +03:00
Tomas Vondra	19d8e2308b	Ignore BRIN indexes when checking for HOT updates When determining whether an index update may be skipped by using HOT, we can ignore attributes indexed by block summarizing indexes without references to individual tuples that need to be cleaned up. A new type TU_UpdateIndexes provides a signal to the executor to determine which indexes to update - no indexes, all indexes, or only the summarizing indexes. This also removes rd_indexattr list, and replaces it with rd_attrsvalid flag. The list was not used anywhere, and a simple flag is sufficient. This was originally committed as `5753d4ee32`, but then got reverted by `e3fcca0d0d` because of correctness issues. Original patch by Josef Simanek, various fixes and improvements by Tomas Vondra and me. Authors: Matthias van de Meent, Josef Simanek, Tomas Vondra Reviewed-by: Tomas Vondra, Alvaro Herrera Discussion: https://postgr.es/m/05ebcb44-f383-86e3-4f31-0a97a55634cf@enterprisedb.com Discussion: https://postgr.es/m/CAFp7QwpMRGcDAQumN7onN9HjrJ3u4X3ZRXdGFT0K5G2JWvnbWg%40mail.gmail.com	2023-03-20 11:02:42 +01:00
Robert Haas	ebd551f586	Update some incorrect comments about xlog records. The comments claim that certain pieces of data are part of the main WAL record data when in reality they are part of the data for block 0. Repair. Bertrand Drouvot, reviewed by Amit Kapila. Originally reported by me. Discussion: http://postgr.es/m/80db7836-4415-d54a-64c3-66b88b1430e7@gmail.com	2023-03-03 12:52:04 -05:00
Amit Kapila	a6cd1fc692	Change xl_hash_vacuum_one_page.ntuples from int to uint16. This will create two bytes of padding space in xl_hash_vacuum_one_page which can be used for future patches. This makes the datatype of xl_hash_vacuum_one_page.ntuples same as gistxlogDelete.ntodelete which is advisable as both are used for the same purpose. Author: Bertrand Drouvot Reviewed-by: Nathan Bossart Discussion: https://postgr.es/m/b0e20c40-cb7a-fc1c-c607-2a78dac5021e@gmail.com	2023-02-27 08:32:45 +05:30
David Rowley	9ed50ab349	Remove stray duplicated comment in heapam.h This is just the same as what's written under the rs_numblocks field. Reported-by: Melanie Plageman Discussion: https://postgr.es/m/20230207204127.7vs6krqjqn5farr7@liskov	2023-02-08 16:03:26 +13:00
Michael Paquier	2f6e15ac93	Revert refactoring of restore command code to shell_restore.c This reverts commits 24c35ec and 57169ad. PreRestoreCommand() and PostRestoreCommand() need to be put closer to the system() call calling a restore_command, as they enable in_restore_command for the startup process which would in turn trigger an immediate proc_exit() in the SIGTERM handler. Perhaps we could get rid of this behavior entirely, but 24c35ec has made the window where the flag is enabled much larger than it was, and any Postgres-like actions (palloc, etc.) taken by code paths while the flag is enabled could lead to more severe issues in the shutdown processing. Note that curculio has showed that there are much more problems in this area, unrelated to this change, actually, hence the issues related to that had better be addressed first. Keeping the code of HEAD in line with the stable branches should make that a bit easier. Per discussion with Andres Freund and Nathan Bossart. Discussion: https://postgr.es/m/Y979NR3U5VnWrTwB@paquier.xyz	2023-02-06 08:28:42 +09:00
David Rowley	f9bc34fcb6	Further refactor of heapgettup and heapgettup_pagemode Backward and forward scans share much of the same page acquisition code. Here we consolidate that code to reduce some duplication. Additionally, add a new rs_coffset field to HeapScanDescData to track the offset of the current tuple. The new field fits nicely into the padding between a bool and BlockNumber field and saves having to look at the last returned tuple to figure out which offset we should be looking at for the current tuple. Author: Melanie Plageman Reviewed-by: David Rowley Discussion: https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_T8QEqZqOL02rpi2bw@mail.gmail.com	2023-02-03 11:48:39 +13:00
David Rowley	e9aaf06328	Remove dead NoMovementScanDirection code Here remove some dead code from heapgettup() and heapgettup_pagemode() which was trying to support NoMovementScanDirection scans. This code can never be reached as standard_ExecutorRun() never calls ExecutePlan with NoMovementScanDirection. Additionally, plans which were scanning an unordered index would use NoMovementScanDirection rather than ForwardScanDirection. There was no real need for this, so here we adjust this so we use ForwardScanDirection for unordered index scans. A comment in pathnodes.h claimed that NoMovementScanDirection was used for PathKey reasons, but if that was true, it no longer is, per code in build_index_paths(). This does change the non-text format of the EXPLAIN output so that unordered index scans now have a "Forward" scan direction rather than "NoMovement". The text format of EXPLAIN has not changed. Author: Melanie Plageman Reviewed-by: Tom Lane, David Rowley Discussion: https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_T8QEqZqOL02rpi2bw@mail.gmail.com	2023-02-01 10:52:41 +13:00
Michael Paquier	9a740f81eb	Refactor code in charge of running shell-based recovery commands The code specific to the execution of archive_cleanup_command, recovery_end_command and restore_command is moved to a new file named shell_restore.c. The code is split into three functions: - shell_restore(), that attempts the execution of a shell-based restore_command. - shell_archive_cleanup(), for archive_cleanup_command. - shell_recovery_end(), for recovery_end_command. This introduces no functional changes, with failure patterns and logs generated in consequence being the same as before (one case actually generates one less DEBUG2 message "could not restore" when a restore command succeeds but the follow-up stat() to check the size fails, but that only matters with a elevel high enough). This is preparatory work for allowing recovery modules, a facility similar to archive modules, with callbacks shaped similarly to the functions introduced here. Author: Nathan Bossart Reviewed-by: Andres Freund, Michael Paquier Discussion: https://postgr.es/m/20221227192449.GA3672473@nathanxps13	2023-01-16 16:31:43 +09:00

1 2 3 4 5 ...

1731 Commits