postgresql

Commit Graph

Author	SHA1	Message	Date
Peter Eisentraut	3e0242b24c	Message fixes and style improvements	2020-09-14 06:42:30 +02:00
Alvaro Herrera	9f1cf97bb5	Print WAL logical message contents in pg_waldump This helps debuggability when looking at WAL streams containing logical messages. Author: Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/CAExHW5sWx49rKmXbg5H1Xc1t+nRv9PaYKQmgw82HPt6vWDVmDg@mail.gmail.com	2020-09-10 19:37:02 -03:00
Tom Lane	a5cc4dab6d	Yet more elimination of dead stores and useless initializations. I'm not sure what tool Ranier was using, but the ones I contributed were found by using a newer version of scan-build than I tried before. Ranier Vilela and Tom Lane Discussion: https://postgr.es/m/CAEudQAo1+AcGppxDSg8k+zF4+Kv+eJyqzEDdbpDg58-=MQcerQ@mail.gmail.com	2020-09-05 13:17:32 -04:00
Alvaro Herrera	f43e295f68	Report expected contrecord length on mismatch When reading a WAL record fails to find continuation record(s) of the proper length, report what it expects, for clarity. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/20200903212152.GA15319@alvherre.pgsql	2020-09-04 14:58:32 -04:00
Tom Lane	38a2d70329	Remove some more useless assignments. Found with clang's scan-build tool. It also whines about a lot of other dead stores that we should not change IMO, either as a matter of style or future-proofing. But these places seem like clear oversights. Discussion: https://postgr.es/m/CAEudQAo1+AcGppxDSg8k+zF4+Kv+eJyqzEDdbpDg58-=MQcerQ@mail.gmail.com	2020-09-04 14:32:19 -04:00
Bruce Momjian	e36e936e0e	remove redundant initializations Reported-by: Ranier Vilela Discussion: https://postgr.es/m/CAEudQAo1+AcGppxDSg8k+zF4+Kv+eJyqzEDdbpDg58-=MQcerQ@mail.gmail.com Author: Ranier Vilela Backpatch-through: master	2020-09-03 22:57:35 -04:00
Michael Paquier	1d65416661	Improve handling of dropped relations for REINDEX DATABASE/SCHEMA/SYSTEM When multiple relations are reindexed, a scan of pg_class is done first to build the list of relations to work on. However the REINDEX logic has never checked if a relation listed still exists when beginning the work on it, causing for example sudden cache lookup failures. This commit adds safeguards against dropped relations for REINDEX, similarly to VACUUM or CLUSTER where we try to open the relation, ignoring it if it is missing. A new option is added to the REINDEX routines to control if a missed relation is OK to ignore or not. An isolation test, based on REINDEX SCHEMA, is added for the concurrent and non-concurrent cases. Author: Michael Paquier Reviewed-by: Anastasia Lubennikova Discussion: https://postgr.es/m/20200813043805.GE11663@paquier.xyz	2020-09-02 09:08:12 +09:00
Tom Lane	a7212be8b9	Set cutoff xmin more aggressively when vacuuming a temporary table. Since other sessions aren't allowed to look into a temporary table of our own session, we do not need to worry about the global xmin horizon when setting the vacuum XID cutoff. Indeed, if we're not inside a transaction block, we may set oldestXmin to be the next XID, because there cannot be any in-doubt tuples in a temp table, nor any tuples that are dead but still visible to some snapshot of our transaction. (VACUUM, of course, is never inside a transaction block; but we need to test that because CLUSTER shares the same code.) This approach allows us to always clean out a temp table completely during VACUUM, independently of concurrent activity. Aside from being useful in its own right, that simplifies building reproducible test cases. Discussion: https://postgr.es/m/3490536.1598629609@sss.pgh.pa.us	2020-09-01 18:40:43 -04:00
Tom Lane	3d351d916b	Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE. Historically, we've considered the state with relpages and reltuples both zero as indicating that we do not know the table's tuple density. This is problematic because it's impossible to distinguish "never yet vacuumed" from "vacuumed and seen to be empty". In particular, a user cannot use VACUUM or ANALYZE to override the planner's normal heuristic that an empty table should not be believed to be empty because it is probably about to get populated. That heuristic is a good safety measure, so I don't care to abandon it, but there should be a way to override it if the table is indeed intended to stay empty. Hence, represent the initial state of ignorance by setting reltuples to -1 (relpages is still set to zero), and apply the minimum-ten-pages heuristic only when reltuples is still -1. If the table is empty, VACUUM or ANALYZE (but not CREATE INDEX) will override that to reltuples = relpages = 0, and then we'll plan on that basis. This requires a bunch of fiddly little changes, but we can get rid of some ugly kluges that were formerly needed to maintain the old definition. One notable point is that FDWs' GetForeignRelSize methods will see baserel->tuples = -1 when no ANALYZE has been done on the foreign table. That seems like a net improvement, since those methods were formerly also in the dark about what baserel->tuples = 0 really meant. Still, it is an API change. I bumped catversion because code predating this change would get confused by seeing reltuples = -1. Discussion: https://postgr.es/m/F02298E0-6EF4-49A1-BCB6-C484794D9ACC@thebuild.com	2020-08-30 12:21:51 -04:00
Tom Lane	10564ee02c	Fix code for re-finding scan position in a multicolumn GIN index. collectMatchBitmap() needs to re-find the index tuple it was previously looking at, after transiently dropping lock on the index page it's on. The tuple should still exist and be at its prior position or somewhere to the right of that, since ginvacuum never removes tuples but concurrent insertions could add one. However, there was a thinko in that logic, to the effect of expecting any inserted tuples to have the same index "attnum" as what we'd been scanning. Since there's no physical separation of tuples with different attnums, it's not terribly hard to devise scenarios where this fails, leading to transient "lost saved point in index" errors. (While I've duplicated this with manual testing, it seems impossible to make a reproducible test case with our available testing technology.) Fix by just continuing the scan when the attnum doesn't match. While here, improve the error message used if we do fail, so that it matches the wording used in btree for a similar case. collectMatchBitmap()'s posting-tree code path was previously not exercised at all by our regression tests. While I can't make a regression test that exhibits the bug, I can at least improve the code coverage here, so do that. The test case I made for this is an extension of one added by `4b754d6c1`, so it only works in HEAD and v13; didn't seem worth trying hard to back-patch it. Per bug #16595 from Jesse Kinkead. This has been broken since multicolumn capability was added to GIN (commit `27cb66fdf`), so back-patch to all supported branches. Discussion: https://postgr.es/m/16595-633118be8eef9ce2@postgresql.org	2020-08-27 17:36:13 -04:00
Amit Kapila	7e453634bb	Add additional information in the vacuum error context. The additional information added will be an offset number for heap operations. This information will help us in finding the exact tuple due to which the error has occurred. Author: Mahendra Singh Thalor and Amit Kapila Reviewed-by: Sawada Masahiko, Justin Pryzby and Amit Kapila Discussion: https://postgr.es/m/CAKYtNApK488TDF4bMbw+1QH8HJf9cxdNDXquhU50TK5iv_FtCQ@mail.gmail.com	2020-08-26 09:40:52 +05:30
Amit Kapila	a3c66de6c5	Improve the vacuum error context phase information. We were displaying the wrong phase information for 'info' message in the index clean up phase because we were switching to the previous phase a bit early. We were also not displaying context information for heap phase unless the block number is valid which is fine for error cases but for messages at 'info' or lower error level it appears to be inconsistent with index phase information. Reported-by: Sawada Masahiko Author: Sawada Masahiko Reviewed-by: Amit Kapila Backpatch-through: 13, where it was introduced Discussion: https://postgr.es/m/CA+fd4k4HcbhPnCs7paRTw1K-AHin8y4xKomB9Ru0ATw0UeTy2w@mail.gmail.com	2020-08-24 08:16:19 +05:30
Andres Freund	c62a0a49f3	Revert "Make vacuum a bit more verbose to debug BF failure." This reverts commit `49967da65a`. Enough time has passed that we can be confident that `07f32fcd23` resolved the issue. Therefore we can remove the temporary debugging aids. Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/E1k7tGP-0005V0-5k@gemulon.postgresql.org	2020-08-20 12:59:00 -07:00
Heikki Linnakangas	a28d731a11	Mark commit and abort WAL records with XLR_SPECIAL_REL_UPDATE. If a commit or abort record includes "dropped relfilenodes", then replaying the record will remove data files. That is surely a "special rel update", but the records were not marked as such. Fix that, teach pg_rewind to expect and ignore them, and add a test case to cover it. It's always been like this, but no backporting for fear of breaking existing applications. If an application parsed the WAL but was not handling commit/abort records, it would stop working. That might be a good thing if it really needed to handle the dropped rels, but it will be caught when the application is updated to work with PostgreSQL v14 anyway. Discussion: https://www.postgresql.org/message-id/07b33e2c-46a6-86a1-5f9e-a7da73fddb95%40iki.fi Reviewed-by: Amit Kapila, Michael Paquier	2020-08-17 10:52:58 +03:00
Andres Freund	49967da65a	Make vacuum a bit more verbose to debug BF failure. This is temporary. While possibly some more error checking / debugging in this path would be a good thing, it'll not look exactly like this. Discussion: https://postgr.es/m/20200816181604.l54m6kss5ntd6xow@alap3.anarazel.de	2020-08-16 12:57:01 -07:00
Noah Misch	676a9c3cc4	Correct several behavior descriptions in comments. Reuse cautionary language from src/test/ssl/README in src/test/kerberos/README. SLRUs have had access to six-character segments names since commit `73c986adde`, and recovery stopped calling HeapTupleHeaderAdvanceLatestRemovedXid() in commit `558a9165e0`. The other corrections are more self-evident.	2020-08-15 20:21:52 -07:00
Noah Misch	566372b3d6	Prevent concurrent SimpleLruTruncate() for any given SLRU. The SimpleLruTruncate() header comment states the new coding rule. To achieve this, add locktype "frozenid" and two LWLocks. This closes a rare opportunity for data loss, which manifested as "apparent wraparound" or "could not access status of transaction" errors. Data loss is more likely in pg_multixact, due to released branches' thin margin between multiStopLimit and multiWrapLimit. If a user's physical replication primary logged ": apparent wraparound" messages, the user should rebuild standbys of that primary regardless of symptoms. At less risk is a cluster having emitted "not accepting commands" errors or "must be vacuumed" warnings at some point. One can test a cluster for this data loss by running VACUUM FREEZE in every database. Back-patch to 9.5 (all supported versions). Discussion: https://postgr.es/m/20190218073103.GA1434723@rfd.leadboat.com	2020-08-15 10:15:53 -07:00
Andres Freund	73487a60fc	snapshot scalability: Move subxact info to ProcGlobal, remove PGXACT. Similar to the previous changes this increases the chance that data frequently needed by GetSnapshotData() stays in l2 cache. In many workloads subtransactions are very rare, and this makes the check for that considerably cheaper. As this removes the last member of PGXACT, there is no need to keep it around anymore. On a larger 2 socket machine this and the two preceding commits result in a ~1.07x performance increase in read-only pgbench. For read-heavy mixed r/w workloads without row level contention, I see about 1.1x. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de	2020-08-14 15:33:35 -07:00
Andres Freund	5788e258bb	snapshot scalability: Move PGXACT->vacuumFlags to ProcGlobal->vacuumFlags. Similar to the previous commit this increases the chance that data frequently needed by GetSnapshotData() stays in l2 cache. As we now take care to not unnecessarily write to ProcGlobal->vacuumFlags, there should be very few modifications to the ProcGlobal->vacuumFlags array. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de	2020-08-14 15:33:35 -07:00
Andres Freund	941697c3c1	snapshot scalability: Introduce dense array of in-progress xids. The new array contains the xids for all connected backends / in-use PGPROC entries in a dense manner (in contrast to the PGPROC/PGXACT arrays which can have unused entries interspersed). This improves performance because GetSnapshotData() always needs to scan the xids of all live procarray entries and now there's no need to go through the procArray->pgprocnos indirection anymore. As the set of running top-level xids changes rarely, compared to the number of snapshots taken, this substantially increases the likelihood of most data required for a snapshot being in l2 cache. In read-mostly workloads scanning the xids[] array will sufficient to build a snapshot, as most backends will not have an xid assigned. To keep the xid array dense ProcArrayRemove() needs to move entries behind the to-be-removed proc's one further up in the array. Obviously moving array entries cannot happen while a backend sets it xid. I.e. locking needs to prevent that array entries are moved while a backend modifies its xid. To avoid locking ProcArrayLock in GetNewTransactionId() - a fairly hot spot already - ProcArrayAdd() / ProcArrayRemove() now needs to hold XidGenLock in addition to ProcArrayLock. Adding / Removing a procarray entry is not a very frequent operation, even taking 2PC into account. Due to the above, the dense array entries can only be read or modified while holding ProcArrayLock and/or XidGenLock. This prevents a concurrent ProcArrayRemove() from shifting the dense array while it is accessed concurrently. While the new dense array is very good when needing to look at all xids it is less suitable when accessing a single backend's xid. In particular it would be problematic to have to acquire a lock to access a backend's own xid. Therefore a backend's xid is not just stored in the dense array, but also in PGPROC. This also allows a backend to only access the shared xid value when the backend had acquired an xid. The infrastructure added in this commit will be used for the remaining PGXACT fields in subsequent commits. They are kept separate to make review easier. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de	2020-08-14 15:33:35 -07:00
Peter Geoghegan	914140e85a	Fix obsolete comment in xlogutils.c. Oversight in commit `2c03216d83`.	2020-08-14 11:09:08 -07:00
Andres Freund	1f51c17c68	snapshot scalability: Move PGXACT->xmin back to PGPROC. Now that xmin isn't needed for GetSnapshotData() anymore, it leads to unnecessary cacheline ping-pong to have it in PGXACT, as it is updated considerably more frequently than the other PGXACT members. After the changes in `dc7420c2c9`, this is a very straight-forward change. For highly concurrent, snapshot acquisition heavy, workloads this change alone can significantly increase scalability. E.g. plain pgbench on a smaller 2 socket machine gains 1.07x for read-only pgbench, 1.22x for read-only pgbench when submitting queries in batches of 100, and 2.85x for batches of 100 'SELECT';. The latter numbers are obviously not to be expected in the real-world, but micro-benchmark the snapshot computation scalability (previously spending ~80% of the time in GetSnapshotData()). Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de	2020-08-13 16:25:21 -07:00
Alvaro Herrera	a811ea5bde	Handle new HOT chains in index-build table scans When a table is scanned by heapam_index_build_range_scan (née IndexBuildHeapScan) and the table lock being held allows concurrent data changes, it is possible for new HOT chains to sprout in a page that were unknown when the scan of a page happened. This leads to an error such as ERROR: failed to find parent tuple for heap-only tuple at (X,Y) in table "tbl" because the root tuple was not present when we first obtained the list of the page's root tuples. This can be fixed by re-obtaining the list of root tuples, if we see that a heap-only tuple appears to point to a non-existing root. This was reported by Anastasia as occurring for BRIN summarization (which exists since 9.5), but I think it could theoretically also happen with CREATE INDEX CONCURRENTLY (much older) or REINDEX CONCURRENTLY (very recent). It seems a happy coincidence that BRIN forces us to backpatch this all the way to 9.5. Reported-by: Anastasia Lubennikova <a.lubennikova@postgrespro.ru> Diagnosed-by: Anastasia Lubennikova <a.lubennikova@postgrespro.ru> Co-authored-by: Anastasia Lubennikova <a.lubennikova@postgrespro.ru> Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/602d8487-f0b2-5486-0088-0f372b2549fa@postgrespro.ru Backpatch: 9.5 - master	2020-08-13 17:33:49 -04:00
Andres Freund	b8443eae72	Fix out-of-date version reference, grammar. Time appears to be passing fast. Reported-By: Peter Geoghegan <pg@bowt.ie>	2020-08-12 17:04:51 -07:00
Andres Freund	dc7420c2c9	snapshot scalability: Don't compute global horizons while building snapshots. To make GetSnapshotData() more scalable, it cannot not look at at each proc's xmin: While snapshot contents do not need to change whenever a read-only transaction commits or a snapshot is released, a proc's xmin is modified in those cases. The frequency of xmin modifications leads to, particularly on higher core count systems, many cache misses inside GetSnapshotData(), despite the data underlying a snapshot not changing. That is the most significant source of GetSnapshotData() scaling poorly on larger systems. Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons / thresholds as it has so far. But we don't really have to: The horizons don't actually change that much between GetSnapshotData() calls. Nor are the horizons actually used every time a snapshot is built. The trick this commit introduces is to delay computation of accurate horizons until there use and using horizon boundaries to determine whether accurate horizons need to be computed. The use of RecentGlobal[Data]Xmin to decide whether a row version could be removed has been replaces with new GlobalVisTest* functions. These use two thresholds to determine whether a row can be pruned: 1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed are definitely still visible. 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can definitely be removed GetSnapshotData() updates definitely_needed to be the xmin of the computed snapshot. When testing whether a row can be removed (with GlobalVisTestIsRemovableXid()) and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID < definitely_needed) the boundaries can be recomputed to be more accurate. As it is not cheap to compute accurate boundaries, we limit the number of times that happens in short succession. As the boundaries used by GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by GetSnapshotData()), it is likely that further test can benefit from an earlier computation of accurate horizons. To avoid regressing performance when old_snapshot_threshold is set (as that requires an accurate horizon to be computed), heap_page_prune_opt() doesn't unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the computation of the limited horizon, and the triggering of errors (with SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove tuples. This commit just removes the accesses to PGXACT->xmin from GetSnapshotData(), but other members of PGXACT residing in the same cache line are accessed. Therefore this in itself does not result in a significant improvement. Subsequent commits will take advantage of the fact that GetSnapshotData() now does not need to access xmins anymore. Note: This contains a workaround in heap_page_prune_opt() to keep the snapshot_too_old tests working. While that workaround is ugly, the tests currently are not meaningful, and it seems best to address them separately. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de	2020-08-12 16:03:49 -07:00
Alvaro Herrera	1f42d35a1d	BRIN: Handle concurrent desummarization properly If a page range is desummarized at just the right time concurrently with an index walk, BRIN would raise an error indicating index corruption. This is scary and unhelpful; silently returning that the page range is not summarized is sufficient reaction. This bug was introduced by commit `975ad4e602` as additional protection against a bug whose actual fix was elsewhere. Backpatch equally. Reported-By: Anastasia Lubennikova <a.lubennikova@postgrespro.ru> Diagnosed-By: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/2588667e-d07d-7e10-74e2-7e1e46194491@postgrespro.ru Backpatch: 9.5 - master	2020-08-12 15:33:36 -04:00
Andres Freund	3bd7f9969a	Track latest completed xid as a FullTransactionId. The reason for doing so is that a subsequent commit will need that to avoid wraparound issues. As the subsequent change is large this was split out for easier review. The reason this is not a perfect straight-forward change is that we do not want track 64bit xids in the procarray or the WAL. Therefore we need to advance lastestCompletedXid in relation to 32 bit xids. The code for that is now centralized in MaintainLatestCompletedXid*. Author: Andres Freund Reviewed-By: Thomas Munro, Robert Haas, David Rowley Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de	2020-08-11 17:41:18 -07:00
Andres Freund	fea10a6434	Rename VariableCacheData.nextFullXid to nextXid. Including Full in variable names duplicates the type information and leads to overly long names. As FullTransactionId cannot accidentally be casted to TransactionId that does not seem necessary. Author: Andres Freund Discussion: https://postgr.es/m/20200724011143.jccsyvsvymuiqfxu@alap3.anarazel.de	2020-08-11 12:07:14 -07:00
Peter Eisentraut	1784f278a6	Replace remaining StrNCpy() by strlcpy() They are equivalent, except that StrNCpy() zero-fills the entire destination buffer instead of providing just one trailing zero. For all but a tiny number of callers, that's just overhead rather than being desirable. Remove StrNCpy() as it is now unused. In some cases, namestrcpy() is the more appropriate function to use. While we're here, simplify the API of namestrcpy(): Remove the return value, don't check for NULL input. Nothing was using that anyway. Also, remove a few unused name-related functions. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/44f5e198-36f6-6cdb-7fa9-60e34784daae%402ndquadrant.com	2020-08-10 23:20:37 +02:00
Peter Geoghegan	d129c07499	Correct nbtree page split lock coupling comment. There is no reason to distinguish between readers and writers here.	2020-08-09 12:01:15 -07:00
Amit Kapila	7259736a6e	Implement streaming mode in ReorderBuffer. Instead of serializing the transaction to disk after reaching the logical_decoding_work_mem limit in memory, we consume the changes we have in memory and invoke stream API methods added by commit `45fdc9738b`. However, sometimes if we have incomplete toast or speculative insert we spill to the disk because we can't generate the complete tuple and stream. And, as soon as we get the complete tuple we stream the transaction including the serialized changes. We can do this incremental processing thanks to having assignments (associating subxact with toplevel xacts) in WAL right away, and thanks to logging the invalidation messages at each command end. These features are added by commits `0bead9af48` and `c55040ccd0` respectively. Now that we can stream in-progress transactions, the concurrent aborts may cause failures when the output plugin consults catalogs (both system and user-defined). We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK sqlerrcode from system table scan APIs to the backend or WALSender decoding a specific uncommitted transaction. The decoding logic on the receipt of such a sqlerrcode aborts the decoding of the current transaction and continue with the decoding of other transactions. We have ReorderBufferTXN pointer in each ReorderBufferChange by which we know which xact it belongs to. The output plugin can use this to decide which changes to discard in case of stream_abort_cb (e.g. when a subxact gets discarded). We also provide a new option via SQL APIs to fetch the changes being streamed. Author: Dilip Kumar, Tomas Vondra, Amit Kapila, Nikhil Sontakke Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com	2020-08-08 07:47:06 +05:30
Peter Geoghegan	0a7d771f0f	Make nbtree split REDO locking match original execution. Make the nbtree page split REDO routine consistent with original execution in its approach to acquiring and releasing buffer locks (at least for pages on the tree level of the page being split). This brings btree_xlog_split() in line with btree_xlog_unlink_page(), which was taught to couple buffer locks by commit `9a9db08a`. Note that the precise order in which we both acquire and release sibling buffer locks in btree_xlog_split() now matches original execution exactly (the precise order in which the locks are released probably doesn't matter much, but we might as well be consistent about it). The rule for nbtree REDO routines from here on is that same-level locks should be acquired in an order that's consistent with original execution. It's not practical to have a similar rule for cross-level page locks, since for the most part original execution holds those locks for a period that spans multiple atomic actions/WAL records. It's also not necessary, because clearly the cross-level lock coupling is only truly needed during original execution because of the presence of concurrent inserters. This is not a bug fix (unlike the similar aforementioned commit, commit `9a9db08a`). The immediate reason to tighten things up in this area is to enable an upcoming enhancement to contrib/amcheck that allows it to verify that sibling links are in agreement with only an AccessShareLock (this check produced false positives when run on a replica server on account of the inconsistency fixed by this commit). But that's not the only reason to be stricter here. It is generally useful to make locking on replicas be as close to what happens during original execution as practically possible. It makes it less likely that hard to catch bugs will slip in in the future. The previous state of affairs seems to be a holdover from before the introduction of Hot Standby, when buffer lock acquisitions during recovery were totally unnecessary. See also: commit `3bbf668d`, which tightened things up in this area a few years after the introduction of Hot Standby. Discussion: https://postgr.es/m/CAH2-Wz=465cJj11YXD9RKH8z=nhQa2dofOZ_23h67EXUGOJ00Q@mail.gmail.com	2020-08-07 15:27:56 -07:00
Peter Geoghegan	3df92bbd1d	Rename nbtree split REDO routine variables. Make the nbtree page split REDO routine variable names consistent with _bt_split() (which handles the original execution of page splits). These names make the code easier to follow by making the distinction between the original page and the left half of the split clear. (The left half of the split page is a temp page that REDO creates to replace the origpage contents.) Also reduce the elevel used when adding a new high key to the temp page from PANIC to ERROR to be consistent. We already only raise an ERROR when data item PageAddItem() temp page calls fail.	2020-08-07 09:53:27 -07:00
Alexander Korotkov	f47b5e1395	Remove btree page items after page unlink Currently, page unlink leaves remaining items "as is", but replay of corresponding WAL-record re-initializes page leaving it with no items. For the sake of consistency, this commit makes primary delete all the items during page unlink as well. Thanks to this change, we now don't mask contents of deleted btree page for WAL consistency checking. Discussion: https://postgr.es/m/CAPpHfdt_OTyQpXaPJcWzV2N-LNeNJseNB-K_A66qG%3DL518VTFw%40mail.gmail.com Author: Alexander Korotkov Reviewed-by: Peter Geoghegan	2020-08-05 02:16:13 +03:00
Peter Geoghegan	9a9db08ae4	Fix replica backward scan race condition. It was possible for the logic used by backward scans (which must reason about concurrent page splits/deletions in its own peculiar way) to become confused when running on a replica. Concurrent replay of a WAL record that describes the second phase of page deletion could cause _bt_walk_left() to get confused. btree_xlog_unlink_page() simply failed to adhere to the same locking protocol that we use on the primary, which is obviously wrong once you consider these two disparate functions together. This bug is present in all stable branches. More concretely, the problem was that nothing stopped _bt_walk_left() from observing inconsistencies between the deletion's target page and its original sibling pages when running on a replica. This is true even though the second phase of page deletion is supposed to work as a single atomic action. Queries running on replicas raised "could not find left sibling of block %u in index %s" can't-happen errors when they went back to their scan's "original" page and observed that the page has not been marked deleted (even though it really was concurrently deleted). There is no evidence that this actually happened in the real world. The issue came to light during unrelated feature development work. Note that _bt_walk_left() is the only code that cares about the difference between a half-dead page and a fully deleted page that isn't also exclusively used by nbtree VACUUM (unless you include contrib/amcheck code). It seems very likely that backward scans are the only thing that could become confused by the inconsistency. Even amcheck's complex bt_right_page_check_scankey() dance was unaffected. To fix, teach btree_xlog_unlink_page() to lock the left sibling, target, and right sibling pages in that order before releasing any locks (just like _bt_unlink_halfdead_page()). This is the simplest possible approach. There doesn't seem to be any opportunity to be more clever about lock acquisition in the REDO routine, and it hardly seems worth the trouble in any case. This fix might enable contrib/amcheck verification of leaf page sibling links with only an AccessShareLock on the relation. An amcheck patch from Andrey Borodin was rejected back in January because it clashed with btree_xlog_unlink_page()'s lax approach to locking pages. It now seems likely that the real problem was with btree_xlog_unlink_page(), not the patch. This is a low severity, low likelihood bug, so no backpatch. Author: Michail Nikolaev Diagnosed-By: Michail Nikolaev Discussion: https://postgr.es/m/CANtu0ohkR-evAWbpzJu54V8eCOtqjJyYp3PQ_SGoBTRGXWhWRw@mail.gmail.com	2020-08-03 15:54:38 -07:00
Peter Geoghegan	a451b7d442	Add nbtree page deletion assertion. Add a documenting assertion that's similar to the nearby assertion added by commit `cd8c73a3`. This conveys that the entire call to _bt_pagedel() does no work if it isn't possible to get a descent stack for the initial scanblkno page.	2020-08-03 13:04:42 -07:00
Noah Misch	cd5e82256d	Change XID and mxact limits to warn at 40M and stop at 3M. We have edge-case bugs when assigning values in the last few dozen pages before the wrap limit. We may introduce similar bugs in the future. At default BLCKSZ, this makes such bugs unreachable outside of single-user mode. Also, when VACUUM began to consume mxacts, multiStopLimit did not change to compensate. pg_upgrade may fail on a cluster that was already printing "must be vacuumed" warnings. Follow the warning's instructions to clear the warning, then run pg_upgrade again. One can still, peacefully consume 98% of XIDs or mxacts, so DBAs need not change routine VACUUM settings. Discussion: https://postgr.es/m/20200621083513.GA3074645@rfd.leadboat.com	2020-08-01 15:31:01 -07:00
Tom Lane	9f9682783b	Invent "amadjustmembers" AM method for validating opclass members. This allows AM-specific knowledge to be applied during creation of pg_amop and pg_amproc entries. Specifically, the AM knows better than core code which entries to consider as required or optional. Giving the latter entries the appropriate sort of dependency allows them to be dropped without taking out the whole opclass or opfamily; which is something we'd like to have to correct obsolescent entries in extensions. This callback also opens the door to performing AM-specific validity checks during opclass creation, rather than hoping than an opclass developer will remember to test with "amvalidate". For the most part I've not actually added any such checks yet; that can happen in a follow-on patch. (Note that we shouldn't remove any tests from "amvalidate", as those are still needed to cross-check manually constructed entries in the initdb data. So adding tests to "amadjustmembers" will be somewhat duplicative, but it seems like a good idea anyway.) Patch by me, reviewed by Alexander Korotkov, Hamid Akhtar, and Anastasia Lubennikova. Discussion: https://postgr.es/m/4578.1565195302@sss.pgh.pa.us	2020-08-01 17:12:47 -04:00
Thomas Munro	e2b37d9e7c	Use pg_pread() and pg_pwrite() in slru.c. This avoids lseek() system calls at every SLRU I/O, as was done for relation files in commit `c24dcd0c`. Reviewed-by: Ashwin Agrawal <aagrawal@pivotal.io> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKG%2Biqke4uTRFj8D8uEUUgj%2BRokPSp%2BCWM6YYzaaamG9Wvg%40mail.gmail.com Discussion: https://postgr.es/m/CA%2BhUKGJ%2BoHhnvqjn3%3DHro7xu-YDR8FPr0FL6LF35kHRX%3D_bUzg%40mail.gmail.com	2020-08-02 00:23:35 +12:00
Thomas Munro	c5315f4f44	Cache smgrnblocks() results in recovery. Avoid repeatedly calling lseek(SEEK_END) during recovery by caching the size of each fork. For now, we can't use the same technique in other processes, because we lack a shared invalidation mechanism. Do this by generalizing the pre-existing caching used by FSM and VM to support all forks. Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com	2020-07-31 14:29:52 +12:00
Michael Paquier	e3931d01f3	Use multi-inserts for pg_attribute and pg_shdepend For pg_attribute, this allows to insert at once a full set of attributes for a relation (roughly 15% of WAL reduction in extreme cases). For pg_shdepend, this reduces the work done when creating new shared dependencies from a database template. The number of slots used for the insertion is capped at 64kB of data inserted for both, depending on the number of items to insert and the length of the rows involved. More can be done for other catalogs, like pg_depend. This part requires a different approach as the number of slots to use depends also on the number of entries discarded as pinned dependencies. This is also related to the rework or dependency handling for ALTER TABLE and CREATE TABLE, mainly. Author: Daniel Gustafsson Reviewed-by: Andres Freund, Michael Paquier Discussion: https://postgr.es/m/20190213182737.mxn6hkdxwrzgxk35@alap3.anarazel.de	2020-07-31 10:54:26 +09:00
Fujii Masao	b5310e4ff6	Remove non-fast promotion. When fast promotion was supported in 9.3, non-fast promotion became undocumented feature and it's basically not available for ordinary users. However we decided not to remove non-fast promotion at that moment, to leave it for a release or two for debugging purpose or as an emergency method because fast promotion might have some issues, and then to remove it later. Now, several versions were released since that decision and there is no longer reason to keep supporting non-fast promotion. Therefore this commit removes non-fast promotion. Author: Fujii Masao Reviewed-by: Hamid Akhtar, Kyotaro Horiguchi Discussion: https://postgr.es/m/76066434-648f-f567-437b-54853b43398f@oss.nttdata.com	2020-07-29 21:24:26 +09:00
Thomas Munro	cb04ad4985	Move syncscan.c to src/backend/access/common. Since the tableam.c code needs to make use of the syncscan.c routines itself, and since other block-oriented AMs might also want to use it one day, it didn't make sense for it to live under src/backend/access/heap. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKGLCnG%3DNEAByg6bk%2BCT9JZD97Y%3DAxKhh27Su9FeGWOKvDg%40mail.gmail.com	2020-07-29 16:59:33 +12:00
David Rowley	56788d2156	Allocate consecutive blocks during parallel seqscans Previously we would allocate blocks to parallel workers during a parallel sequential scan 1 block at a time. Since other workers were likely to request a block before a worker returns for another block number to work on, this could lead to non-sequential I/O patterns in each worker which could cause the operating system's readahead to perform poorly or not at all. Here we change things so that we allocate consecutive "chunks" of blocks to workers and have them work on those until they're done, at which time we allocate another chunk for the worker. The size of these chunks is based on the size of the relation. Initial patch here was by Thomas Munro which showed some good improvements just having a fixed chunk size of 64 blocks with a simple ramp-down near the end of the scan. The revisions of the patch to make the chunk size based on the relation size and the adjusted ramp-down in powers of two was done by me, along with quite extensive benchmarking to determine the optimal chunk sizes. For the most part, benchmarks have shown significant performance improvements for large parallel sequential scans on Linux, FreeBSD and Windows using SSDs. It's less clear how this affects the performance of cloud providers. Tests done so far are unable to obtain stable enough performance to provide meaningful benchmark results. It is possible that this could cause some performance regressions on more obscure filesystems, so we may need to later provide users with some ability to get something closer to the old behavior. For now, let's leave that until we see that it's really required. Author: Thomas Munro, David Rowley Reviewed-by: Ranier Vilela, Soumyadeep Chakraborty, Robert Haas Reviewed-by: Amit Kapila, Kirk Jamison Discussion: https://postgr.es/m/CA+hUKGJ_EErDv41YycXcbMbCBkztA34+z1ts9VQH+ACRuvpxig@mail.gmail.com	2020-07-26 21:02:45 +12:00
Amit Kapila	c55040ccd0	WAL Log invalidations at command end with wal_level=logical. When wal_level=logical, write invalidations at command end into WAL so that decoding can use this information. This patch is required to allow the streaming of in-progress transactions in logical decoding. The actual work to allow streaming will be committed as a separate patch. We still add the invalidations to the cache and write them to WAL at commit time in RecordTransactionCommit(). This uses the existing XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource manager (see LogStandbyInvalidations for details). So existing code relying on those invalidations (e.g. redo) does not need to be changed. The invalidations written at command end uses a new xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See LogLogicalInvalidations for details. These new xlog records are ignored by existing redo procedures, which still rely on the invalidations written to commit records. The invalidations are decoded and accumulated in top-transaction, and then executed during replay. This obviates the need to decode the invalidations as part of a commit record. Bump XLOG_PAGE_MAGIC, since this introduces XLOG_XACT_INVALIDATIONS. Author: Dilip Kumar, Tomas Vondra, Amit Kapila Reviewed-by: Amit Kapila Tested-by: Neha Sharma and Mahendra Singh Thalor Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com	2020-07-23 08:34:48 +05:30
Peter Geoghegan	4a70f829d8	Add nbtree Valgrind buffer lock checks. Holding just a buffer pin (with no buffer lock) on an nbtree buffer/page provides very weak guarantees, especially compared to heapam, where it's often safe to read a page while only holding a buffer pin. This commit has Valgrind enforce the following rule: it is never okay to access an nbtree buffer without holding both a pin and a lock on the buffer. A draft version of this patch detected questionable code that was cleaned up by commits `fa7ff642` and `7154aa16`. The code in question used to access an nbtree buffer page's special/opaque area with no buffer lock (only a buffer pin). This practice (which isn't obviously unsafe) is hereby formally disallowed in nbtree. There doesn't seem to be any reason to allow it, and banning it keeps things simple for Valgrind. The new checks are implemented by adding custom nbtree client requests (located in LockBuffer() wrapper functions); these requests are "superimposed" on top of the generic bufmgr.c Valgrind client requests added by commit `1e0dfd16`. No custom resource management cleanup code is needed to undo the effects of marking buffers as non-accessible under this scheme. Author: Peter Geoghegan Reviewed-By: Anastasia Lubennikova, Georgios Kokolatos Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com	2020-07-21 15:50:58 -07:00
Fujii Masao	c3fe108c02	Rename wal_keep_segments to wal_keep_size. max_slot_wal_keep_size that was added in v13 and wal_keep_segments are the GUC parameters to specify how much WAL files to retain for the standby servers. While max_slot_wal_keep_size accepts the number of bytes of WAL files, wal_keep_segments accepts the number of WAL files. This difference of setting units between those similar parameters could be confusing to users. To alleviate this situation, this commit renames wal_keep_segments to wal_keep_size, and make users specify the WAL size in it instead of the number of WAL files. There was also the idea to rename max_slot_wal_keep_size to max_slot_wal_keep_segments, in the discussion. But we have been moving away from measuring in segments, for example, checkpoint_segments was replaced by max_wal_size. So we concluded to rename wal_keep_segments to wal_keep_size. Back-patch to v13 where max_slot_wal_keep_size was added. Author: Fujii Masao Reviewed-by: Álvaro Herrera, Kyotaro Horiguchi, David Steele Discussion: https://postgr.es/m/574b4ea3-e0f9-b175-ead2-ebea7faea855@oss.nttdata.com	2020-07-20 13:30:18 +09:00
Amit Kapila	0bead9af48	Immediately WAL-log subtransaction and top-level XID association. The logical decoding infrastructure needs to know which top-level transaction the subxact belongs to, in order to decode all the changes. Until now that might be delayed until commit, due to the caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring incremental decoding. So we also write the assignment info into WAL immediately, as part of the next WAL record (to minimize overhead) only when wal_level=logical. We can not remove the existing XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in the hot standby snapshot. Bump XLOG_PAGE_MAGIC, since this introduces XLR_BLOCK_ID_TOPLEVEL_XID. Author: Tomas Vondra, Dilip Kumar, Amit Kapila Reviewed-by: Amit Kapila Tested-by: Neha Sharma and Mahendra Singh Thalor Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com	2020-07-20 08:48:26 +05:30
Peter Geoghegan	a766d6ca22	Avoid harmless Valgrind no-buffer-pin errors. Valgrind builds with assertions enabled sometimes perform a theoretically unsafe page access inside an assertion in heapam_tuple_lock(). This happened when the eval-plan-qual isolation test ran one of the permutations added by commit `a2418f9e23`. Avoid complaints from Valgrind by moving the assertion ever so slightly. This is minor cleanup for commit `1e0dfd16`, which added Valgrind buffer access instrumentation. No backpatch, since this only happens within an assertion, and seems very unlikely to cause any real problems even with assert-enabled builds.	2020-07-19 16:12:51 -07:00
Peter Geoghegan	5da8bf8bbb	Avoid CREATE INDEX unique index deduplication. There is no advantage to attempting deduplication for a unique index during CREATE INDEX, since there cannot possibly be any duplicates. Doing so wastes cycles due to unnecessary copying. Make sure that we avoid it consistently. We already avoided unique index deduplication in the case where there were some spool2 tuples to merge. That didn't account for the fact that spool2 is removed early/unset in the common case where it has no tuples that need to be merged (i.e. it failed to account for the "spool2 turns out to be unnecessary" optimization in _bt_spools_heapscan()). Oversight in commit `0d861bbb`, which added nbtree deduplication Backpatch: 13-, where nbtree deduplication was introduced.	2020-07-17 09:50:48 -07:00

1 2 3 4 5 ...

3898 Commits