postgresql

mirror of https://git.postgresql.org/git/postgresql.git synced 2024-07-17 12:21:09 +02:00

Author	SHA1	Message	Date
Alvaro Herrera	450c8230b1	Don't hold ProcArrayLock longer than needed in rare cases While cancelling an autovacuum worker, we hold ProcArrayLock while formatting a debugging log string. We can make this shorter by saving the data we need to produce the message and doing the formatting outside the locked region. This isn't terribly critical, as it only occurs pretty rarely: when a backend runs deadlock detection and it happens to be blocked by a autovacuum running autovacuum. Still, there's no need to cause a hiccup in ProcArrayLock processing, which can be very high-traffic in some cases. While at it, rework code so that we only print the string when it is really going to be used, as suggested by Michael Paquier. Discussion: https://postgr.es/m/20201118214127.GA3179@alvherre.pgsql Reviewed-by: Michael Paquier <michael@paquier.xyz>	2020-11-23 18:55:23 -03:00
Thomas Munro	7888b09994	Add BarrierArriveAndDetachExceptLast(). Provide a way for one process to continue the remaining phases of a (previously) parallel computation alone. Later patches will use this to extend Parallel Hash Join. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKG%2BA6ftXPz4oe92%2Bx8Er%2BxpGZqto70-Q_ERwRaSyA%3DafNg%40mail.gmail.com	2020-11-19 18:13:46 +13:00
Alvaro Herrera	27838981be	Relax lock level for setting PGPROC->statusFlags We don't actually need a lock to set PGPROC->statusFlags itself; what we do need is a shared lock on either XidGenLock or ProcArrayLock in order to ensure MyProc->pgxactoff keeps still while we modify the mirror array in ProcGlobal->statusFlags. Some places were using an exclusive lock for that, which is excessive. Relax those to use shared lock only. procarray.c has a couple of places with somewhat brittle assumptions about PGPROC changes: ProcArrayEndTransaction uses only shared lock, so it's permissible to change MyProc only. On the other hand, ProcArrayEndTransactionInternal also changes other procs, so it must hold exclusive lock. Add asserts to ensure those assumptions continue to hold. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/20201117155501.GA13805@alvherre.pgsql	2020-11-18 13:24:22 -03:00
Tom Lane	2bd49b493a	Don't Insert() a VFD entry until it's fully built. Otherwise, if FDDEBUG is enabled, the debugging output fails because it tries to read the fileName, which isn't set up yet (and should in fact always be NULL). AFAICT, this has been wrong since Berkeley. Before `96bf88d52`, it would accidentally fail to crash on platforms where snprintf() is forgiving about being passed a NULL pointer for %s; but the file name intended to be included in the debug output wouldn't ever have shown up. Report and fix by Greg Nancarrow. Although this is only visibly broken in custom-made builds, it still seems worth back-patching to all supported branches, as the FDDEBUG code is pretty useless as it stands. Discussion: https://postgr.es/m/CAJcOf-cUDgm9qYtC_B6XrC6MktMPNRby2p61EtSGZKnfotMArw@mail.gmail.com	2020-11-16 20:32:55 -05:00
Alvaro Herrera	cd9c1b3e19	Rename PGPROC->vacuumFlags to statusFlags With more flags associated to a PGPROC entry that are not related to vacuum (currently existing or planned), the name "statusFlags" describes its purpose better. (The same is done to the mirroring PROC_HDR->vacuumFlags.) No functional changes in this commit. This was suggested first by Hari Babu Kommi in [1] and then by Michael Paquier at [2]. [1] https://postgr.es/m/CAJrrPGcsDC-oy1AhqH0JkXYa0Z2AgbuXzHPpByLoBGMxfOZMEQ@mail.gmail.com [2] https://postgr.es/m/20200820060929.GB3730@paquier.xyz Author: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20201116182446.qcg3o6szo2zookyr@localhost	2020-11-16 19:42:55 -03:00
Michael Paquier	788dd0b839	Fix some typos Author: Daniel Gustafsson Discussion: https://postgr.es/m/C36ADFDF-D09A-4EE5-B186-CB46C3653F4C@yesql.se	2020-11-14 11:43:10 +09:00
Michael Paquier	e152506ade	Revert pg_relation_check_pages() This reverts the following set of commits, following complaints about the lack of portability of the central part of the code in bufmgr.c as well as the use of partition mapping locks during page reads: `c780a7a9` `f2b88396` `b787d4ce` `ce7f772c` `60a51c6b` Per discussion with Andres Freund, Robert Haas and myself. Bump catalog version. Discussion: https://postgr.es/m/20201029181729.2nrub47u7yqncsv7@alap3.anarazel.de	2020-11-04 10:21:46 +09:00
Amit Kapila	8c2d8f6cc4	Fix typos. Author: Hou Zhijie Discussion: https://postgr.es/m/855a9421839d402b8b351d273c89a8f8@G08CNEXMBPEKD05.g08.fujitsu.local	2020-11-03 08:38:27 +05:30
Michael Paquier	8a15e735be	Fix some grammar and typos in comments and docs The documentation fixes are backpatched down to where they apply. Author: Justin Pryzby Discussion: https://postgr.es/m/20201031020801.GD3080@telsasoft.com Backpatch-through: 9.6	2020-11-02 15:14:41 +09:00
Andres Freund	1c7675a7a4	Fix wrong data table horizon computation during backend startup. When ComputeXidHorizons() was called before MyDatabaseOid is set, e.g. because a dead row in a shared relation is encountered during InitPostgres(), the horizon for normal tables was computed too aggressively, ignoring all backends connected to a database. During subsequent pruning in a data table the too aggressive horizon could end up still being used, possibly leading to still needed tuples being removed. Not good. This is a bug in `dc7420c2c9`, which the test added in `94bc27b576` made visible, if run with force_parallel_mode set to regress. In that case the bug is reliably triggered, because "pruning_query" is run in a parallel worker and the start of that parallel worker is likely to encounter a dead row in pg_database. The fix is trivial: Compute a more pessimistic data table horizon if MyDatabaseId is not yet known. Author: Andres Freund Discussion: https://postgr.es/m/20201029040030.p4osrmaywhqaesd4@alap3.anarazel.de	2020-10-28 21:49:07 -07:00
Andres Freund	94bc27b576	Centralize horizon determination for temp tables, fixing bug due to skew. This fixes a bug in the edge case where, for a temp table, heap_page_prune() can end up with a different horizon than heap_vacuum_rel(). Which can trigger errors like "ERROR: cannot freeze committed xmax ...". The bug was introduced due to interaction of `a7212be8b9` "Set cutoff xmin more aggressively when vacuuming a temporary table." with `dc7420c2c9` "snapshot scalability: Don't compute global horizons while building snapshots.". The problem is caused by lazy_scan_heap() assuming that the only reason its HeapTupleSatisfiesVacuum() call would return HEAPTUPLE_DEAD is if the tuple is a HOT tuple, or if the tuple's inserting transaction has aborted since the heap_page_prune() call. But after `a7212be8b9` that was also possible in other cases for temp tables, because heap_page_prune() uses a different visibility test after `dc7420c2c9`. The fix is fairly simple: Move the special case logic for temp tables from vacuum_set_xid_limits() to the infrastructure introduced in `dc7420c2c9`. That ensures that the horizon used for pruning is at least as aggressive as the one used by lazy_scan_heap(). The concrete horizon used for temp tables is slightly different than the logic in `dc7420c2c9`, but should always be as aggressive as before (see comments). A significant benefit to centralizing the logic procarray.c is that now the more aggressive horizons for temp tables does not just apply to VACUUM but also to e.g. HOT pruning and the nbtree killtuples logic. Because isTopLevel is not needed by vacuum_set_xid_limits() anymore, I undid the the related changes from `a7212be8b9`. This commit also adds an isolation test ensuring that the more aggressive vacuuming and pruning of temp tables keeps working. Debugged-By: Amit Kapila <amit.kapila16@gmail.com> Debugged-By: Tom Lane <tgl@sss.pgh.pa.us> Debugged-By: Ashutosh Sharma <ashu.coek88@gmail.com> Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20201014203103.72oke6hqywcyhx7s@alap3.anarazel.de Discussion: https://postgr.es/m/20201015083735.derdzysdtqdvxshp@alap3.anarazel.de	2020-10-28 18:02:31 -07:00
Michael Paquier	c780a7a90a	Add CheckBuffer() to check on-disk pages without shared buffer loading CheckBuffer() is designed to be a concurrent-safe function able to run sanity checks on a relation page without loading it into the shared buffers. The operation is done using a lock on the partition involved in the shared buffer mapping hashtable and an I/O lock for the buffer itself, preventing the risk of false positives due to any concurrent activity. The primary use of this function is the detection of on-disk corruptions for relation pages. If a page is found in shared buffers, the on-disk page is checked if not dirty (a follow-up checkpoint would flush a valid version of the page if dirty anyway), as it could be possible that a page was present for a long time in shared buffers with its on-disk version corrupted. Such a scenario could lead to a corrupted cluster if a host is plugged off for example. If the page is not found in shared buffers, its on-disk state is checked. PageIsVerifiedExtended() is used to apply the same sanity checks as when a page gets loaded into shared buffers. This function will be used by an upcoming patch able to check the state of on-disk relation pages using a SQL function. Author: Julien Rouhaud, Michael Paquier Reviewed-by: Masahiko Sawada Discussion: https://postgr.es/m/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com	2020-10-28 11:12:46 +09:00
Michael Paquier	d401c5769e	Extend PageIsVerified() to handle more custom options This is useful for checks of relation pages without having to load the pages into the shared buffers, and two cases can make use of that: page verification in base backups and the online, lock-safe, flavor. Compatibility is kept with past versions using a macro that calls the new extended routine with the set of options compatible with the original version. Extracted from a larger patch by the same author. Author: Anastasia Lubennikova Reviewed-by: Michael Paquier, Julien Rouhaud Discussion: https://postgr.es/m/608f3476-0598-2514-2c03-e05c7d2b0cbd@postgrespro.ru	2020-10-26 09:55:28 +09:00
Peter Eisentraut	26ec6b5948	Avoid invalid alloc size error in shm_mq In shm_mq_receive(), a huge payload could trigger an unjustified "invalid memory alloc request size" error due to the way the buffer size is increased. Add error checks (documenting the upper limit) and avoid the error by limiting the allocation size to MaxAllocSize. Author: Markus Wanner <markus.wanner@2ndquadrant.com> Discussion: https://www.postgresql.org/message-id/flat/3bb363e7-ac04-0ac4-9fe8-db1148755bfa%402ndquadrant.com	2020-10-19 08:52:25 +02:00
Thomas Munro	70516a178a	Handle EACCES errors from kevent() better. While registering for postmaster exit events, we have to handle a couple of edge cases where the postmaster is already gone. Commit `815c2f09` missed one: EACCES must surely imply that PostmasterPid no longer belongs to our postmaster process (or alternatively an unexpected permissions model has been imposed on us). Like ESRCH, this should be treated as a WL_POSTMASTER_DEATH event, rather than being raised with ereport(). No known problems reported in the wild. Per code review from Tom Lane. Back-patch to 13. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/3624029.1602701929%40sss.pgh.pa.us	2020-10-15 18:34:21 +13:00
Thomas Munro	b94109ce37	Make WL_POSTMASTER_DEATH level-triggered on kqueue builds. If WaitEventSetWait() reports that the postmaster has gone away, later calls to WaitEventSetWait() should continue to report that. Otherwise further waits that occur in the proc_exit() path after we already noticed the postmaster's demise could block forever. Back-patch to 13, where the kqueue support landed. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/3624029.1602701929%40sss.pgh.pa.us	2020-10-15 11:41:58 +13:00
Andres Freund	7b28913bca	Fix and test snapshot behavior on standby. I (Andres) broke this in 623a9CA79bx, because I didn't think about the way snapshots are built on standbys sufficiently. Unfortunately our existing tests did not catch this, as they are all just querying with psql (therefore ending up with fresh snapshots). The fix is trivial, we just need to increment the transaction completion counter in ExpireTreeKnownAssignedTransactionIds(), which is the equivalent of ProcArrayEndTransaction() during recovery. This commit also adds a new test doing some basic testing of the correctness of snapshots built on standbys. To avoid the aforementioned issue of one-shot psql's not exercising the snapshot caching, the test uses a long lived psqls, similar to 013_crash_restart.pl. It'd be good to extend the test further. Reported-By: Ian Barwick <ian.barwick@2ndquadrant.com> Author: Andres Freund <andres@anarazel.de> Author: Ian Barwick <ian.barwick@2ndquadrant.com> Discussion: https://postgr.es/m/61291ffe-d611-f889-68b5-c298da9fb18f@2ndquadrant.com	2020-09-30 17:28:51 -07:00
Thomas Munro	dee663f784	Defer flushing of SLRU files. Previously, we called fsync() after writing out individual pg_xact, pg_multixact and pg_commit_ts pages due to cache pressure, leading to regular I/O stalls in user backends and recovery. Collapse requests for the same file into a single system call as part of the next checkpoint, as we already did for relation files, using the infrastructure developed by commit `3eb77eba`. This can cause a significant improvement to recovery performance, especially when it's otherwise CPU-bound. Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer that it applies to all the SLRU mini-buffer-pools as well as the main buffer pool. Rearrange things so that data collected in CheckpointStats includes SLRU activity. Also remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions, because they were redundant after the shutdown checkpoint that immediately precedes them. (I'm not sure if they were ever needed, but they aren't now.) Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (parts) Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> Discussion: https://postgr.es/m/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com	2020-09-25 19:00:15 +12:00
Thomas Munro	733fa9aa51	Allow WaitLatch() to be used without a latch. Due to flaws in commit `3347c982ba`, using WaitLatch() without WL_LATCH_SET could cause an assertion failure or crash. Repair. While here, also add a check that the latch we're switching to belongs to this backend, when changing from one latch to another. Discussion: https://postgr.es/m/CA%2BhUKGK1607VmtrDUHQXrsooU%3Dap4g4R2yaoByWOOA3m8xevUQ%40mail.gmail.com	2020-09-23 15:17:30 +12:00
Peter Eisentraut	80fc96eceb	Standardize order of use strict and use warnings in Perl code The standard order in PostgreSQL and other code is use strict first, but some code was uselessly inconsistent about this.	2020-09-21 17:04:36 +02:00
Peter Eisentraut	3d13867a2c	Fix whitespace	2020-09-20 14:42:54 +02:00
David Rowley	19c60ad69a	Optimize compactify_tuples function This function could often be seen in profiles of vacuum and could often be a significant bottleneck during recovery. The problem was that a qsort was performed in order to sort an array of item pointers in reverse offset order so that we could use that to safely move tuples up to the end of the page without overwriting the memory of yet-to-be-moved tuples. i.e. we used to compact the page starting at the back of the page and move towards the front. The qsort that this required could be expensive for pages with a large number of tuples. In this commit, we take another approach to tuple compactification. Now, instead of sorting the remaining item pointers array we first check if the array is presorted and only memmove() the tuples that need to be moved. This presorted check can be done very cheaply in the calling functions when the array is being populated. This presorted case is very fast. When the item pointer array is not presorted we must copy tuples that need to be moved into a temp buffer before copying them back into the page again. This differs from what we used to do here as we're now copying the tuples back into the page in reverse line pointer order. Previously we left the existing order alone. Reordering the tuples results in an increased likelihood of hitting the pre-sorted case the next time around. Any newly added tuple which consumes a new line pointer will also maintain the correct sort order of tuples in the page which will also result in the presorted case being hit the next time. Only consuming an unused line pointer can cause the order of tuples to go out again, but that will be corrected next time the function is called for the page. Benchmarks have shown that the non-presorted case is at least equally as fast as the original qsort method even when the page just has a few tuples. As the number of tuples becomes larger the new method maintains its performance whereas the original qsort method became much slower when the number of tuples on the page became large. Author: David Rowley Reviewed-by: Thomas Munro Tested-by: Jakub Wartak Discussion: https://postgr.es/m/CA+hUKGKMQFVpjr106gRhwk6R-nXv0qOcTreZuQzxgpHESAL6dw@mail.gmail.com	2020-09-16 13:22:20 +12:00
Fujii Masao	95233011a0	Fix typos. Author: Naoki Nakamichi Discussion: https://postgr.es/m/b6919d145af00295a8e86ce4d034b7cd@oss.nttdata.com	2020-09-14 14:16:07 +09:00
Peter Eisentraut	3e0242b24c	Message fixes and style improvements	2020-09-14 06:42:30 +02:00
Tom Lane	6693a96b32	Don't run atexit callbacks during signal exits from ProcessStartupPacket. Although `58c6feccf` fixed the case for SIGQUIT, we were still calling proc_exit() from signal handlers for SIGTERM and timeout failures in ProcessStartupPacket. Fortunately, at the point where that code runs, we haven't yet connected to shared memory in any meaningful way, so there is nothing we need to undo in shared memory. This means it should be safe to use _exit(1) here, ie, not run any atexit handlers but also inform the postmaster that it's not a crash exit. To make sure nobody breaks the "nothing to undo" expectation, add a cross-check that no on-shmem-exit or before-shmem-exit handlers have been registered yet when we finish using these signal handlers. This change is simple enough that maybe it could be back-patched, but I won't risk that right now. Discussion: https://postgr.es/m/1850884.1599601164@sss.pgh.pa.us	2020-09-11 12:20:16 -04:00
Michael Paquier	aad546bd0a	doc: Fix some grammar and inconsistencies Some comments are fixed while on it. Author: Justin Pryzby Discussion: https://postgr.es/m/20200818171702.GK17022@telsasoft.com Backpatch-through: 9.6	2020-09-10 15:50:19 +09:00
Tom Lane	c9ae5cbb88	Install an error check into cancel_before_shmem_exit(). Historically, cancel_before_shmem_exit() just silently did nothing if the specified callback wasn't the top-of-stack. The folly of ignoring this case was exposed by the bugs fixed in `303640199` and `bab150045`, so let's make it throw elog(ERROR) instead. There is a decent argument to be made that PG_ENSURE_ERROR_CLEANUP should use some separate infrastructure, so it wouldn't break if something inside the guarded code decides to register a new before_shmem_exit callback. However, a survey of the surviving uses of before_shmem_exit() and PG_ENSURE_ERROR_CLEANUP doesn't show any plausible conflicts of that sort today, so for now we'll forgo the extra complexity. (It will almost certainly become necessary if anyone ever wants to wrap PG_ENSURE_ERROR_CLEANUP around arbitrary user-defined actions, though.) No backpatch, since this is developer support not a production issue. Bharath Rupireddy, per advice from Andres Freund, Robert Haas, and myself Discussion: https://postgr.es/m/CALj2ACWk7j4F2v2fxxYfrroOF=AdFNPr1WsV+AGtHAFQOqm_pw@mail.gmail.com	2020-09-08 15:54:25 -04:00
Andres Freund	5871f09c98	Fix autovacuum cancellation. The problem is caused by me (Andres) having ProcSleep() look at the wrong PGPROC entry in `5788e258bb`. Unfortunately it seems hard to write a reliable test for autovacuum cancellations. Perhaps somebody will come up with a good approach, but it seems worth fixing the issue even without a test. Reported-By: Jeff Janes <jeff.janes@gmail.com> Author: Jeff Janes <jeff.janes@gmail.com> Discussion: https://postgr.es/m/CAMkU=1wH2aUy+wDRDz+5RZALdcUnEofV1t9PzXS_gBJO9vZZ0Q@mail.gmail.com	2020-09-08 11:25:34 -07:00
Thomas Munro	861c6e7c8e	Skip unnecessary stat() calls in walkdir(). Some kernels can tell us the type of a "dirent", so we can avoid a call to stat() or lstat() in many cases. Define a new function get_dirent_type() to contain that logic, for use by the backend and frontend versions of walkdir(), and perhaps other callers in future. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKG%2BFzxupGGN4GpUdbzZN%2Btn6FQPHo8w0Q%2BAPH5Wz8RG%2Bww%40mail.gmail.com	2020-09-07 18:28:06 +12:00
Tom Lane	695de5d1ed	Split Makefile symbol CFLAGS_VECTOR into two symbols. Replace CFLAGS_VECTOR with CFLAGS_UNROLL_LOOPS and CFLAGS_VECTORIZE, allowing us to distinguish whether we want to apply -funroll-loops, -ftree-vectorize, or both to a particular source file. Up to now the only consumer of the symbol has been checksum.c which wants both, so that there was no need to distinguish; but that's about to change. Amit Khandekar, reviewed and edited a little by me Discussion: https://postgr.es/m/CAJ3gD9evtA_vBo+WMYMyT-u=keHX7-r8p2w7OSRfXf42LTwCZQ@mail.gmail.com	2020-09-06 21:28:16 -04:00
Magnus Hagander	2a093355aa	Fix typo in comment Author: Hou, Zhijie	2020-09-06 19:26:55 +02:00
Tom Lane	a5cc4dab6d	Yet more elimination of dead stores and useless initializations. I'm not sure what tool Ranier was using, but the ones I contributed were found by using a newer version of scan-build than I tried before. Ranier Vilela and Tom Lane Discussion: https://postgr.es/m/CAEudQAo1+AcGppxDSg8k+zF4+Kv+eJyqzEDdbpDg58-=MQcerQ@mail.gmail.com	2020-09-05 13:17:32 -04:00
Bruce Momjian	e36e936e0e	remove redundant initializations Reported-by: Ranier Vilela Discussion: https://postgr.es/m/CAEudQAo1+AcGppxDSg8k+zF4+Kv+eJyqzEDdbpDg58-=MQcerQ@mail.gmail.com Author: Ranier Vilela Backpatch-through: master	2020-09-03 22:57:35 -04:00
Amit Kapila	4ab77697f6	Fix the SharedFileSetUnregister API. Commit `808e13b282` introduced a few APIs to extend the existing Buffile interface. In SharedFileSetDeleteOnProcExit, it tries to delete the list element while traversing the list with 'foreach' construct which makes the behavior of list traversal unpredictable. Author: Amit Kapila Reviewed-by: Dilip Kumar Tested-by: Dilip Kumar and Neha Sharma Discussion: https://postgr.es/m/CAA4eK1JhLatVcQ2OvwA_3s0ih6Hx9+kZbq107cXVsSWWukH7vA@mail.gmail.com	2020-09-01 08:11:39 +05:30
Michael Paquier	77c7267c37	Fix comment in procarray.c The description of GlobalVisDataRels was missing, GlobalVisCatalogRels being mentioned instead. Author: Jim Nasby Discussion: https://postgr.es/m/8e06c883-2858-1fd4-07c5-560c28b08dcd@amazon.com	2020-08-27 16:40:34 +09:00
Tom Lane	e942af7b82	Suppress compiler warning in non-cassert builds. Oversight in `808e13b28`, reported by Bruce Momjian. Discussion: https://postgr.es/m/20200826160251.GB21909@momjian.us	2020-08-26 17:08:11 -04:00
Amit Kapila	808e13b282	Extend the BufFile interface. Allow BufFile to support temporary files that can be used by the single backend when the corresponding files need to be survived across the transaction and need to be opened and closed multiple times. Such files need to be created as a member of a SharedFileSet. Additionally, this commit implements the interface for BufFileTruncate to allow files to be truncated up to a particular offset and extends the BufFileSeek API to support the SEEK_END case. This also adds an option to provide a mode while opening the shared BufFiles instead of always opening in read-only mode. These enhancements in BufFile interface are required for the upcoming patch to allow the replication apply worker, to handle streamed in-progress transactions. Author: Dilip Kumar, Amit Kapila Reviewed-by: Amit Kapila Tested-by: Neha Sharma Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com	2020-08-26 07:36:43 +05:30
Fujii Masao	d259afa736	Fix typos in comments. Author: Masahiko Sawada Reviewed-by: Fujii Masao Discussion: https://postgr.es/m/CA+fd4k4m9hFSrRLB3etPWO5_v5=MujVZWRtz63q+55hM0Dz25Q@mail.gmail.com	2020-08-21 12:35:22 +09:00
Andres Freund	1fe1f42e3e	Acquire ProcArrayLock exclusively in ProcArrayClearTransaction. This corrects an oversight by me in `2072932407`, which made ProcArrayClearTransaction() increment xactCompletionCount. That requires an exclusive lock, obviously. There's other approaches that avoid the exclusive acquisition, but given that a 2PC commit is fairly heavyweight, it doesn't seem worth doing so. I've not been able to measure a performance difference, unsurprisingly. I did add a comment documenting that we could do so, should it ever become a bottleneck. Reported-By: Tom Lane <tgl@sss.pgh.pa.us> Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/1355915.1597794204@sss.pgh.pa.us	2020-08-19 18:24:33 -07:00
Andres Freund	07f32fcd23	Fix race condition in snapshot caching when 2PC is used. When preparing a transaction xactCompletionCount needs to be incremented, even though the transaction has not committed yet. Otherwise the snapshot used within the transaction otherwise can get reused outside of the prepared transaction. As GetSnapshotData() does not include the current xid when building a snapshot, reuse would not be correct. Somewhat surprisingly the regression tests only rarely show incorrect results without the fix. The reason for that is that often the snapshot's xmax will be >= the backend xid, yielding a snapshot that is correct, despite the bug. I'm working on a reliable test for the bug, but it seems worth seeing whether this fixes all the BF failures while I do. Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/E1k7tGP-0005V0-5k@gemulon.postgresql.org	2020-08-18 16:31:12 -07:00
Andres Freund	623a9ba79b	snapshot scalability: cache snapshots using a xact completion counter. Previous commits made it faster/more scalable to compute snapshots. But not building a snapshot is still faster. Now that GetSnapshotData() does not maintain RecentGlobal* anymore, that is actually not too hard: This commit introduces xactCompletionCount, which tracks the number of top-level transactions with xids (i.e. which may have modified the database) that completed in some form since the start of the server. We can avoid rebuilding the snapshot's contents whenever the current xactCompletionCount is the same as it was when the snapshot was originally built. Currently this check happens while holding ProcArrayLock. While it's likely possible to perform the check without acquiring ProcArrayLock, it seems better to do that separately / later, some careful analysis is required. Even with the lock this is a significant win on its own. On a smaller two socket machine this gains another ~1.03x, on a larger machine the effect is roughly double (earlier patch version tested though). If we were able to safely avoid the lock there'd be another significant gain on top of that. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de	2020-08-17 21:08:30 -07:00
Andres Freund	f6661d3df2	Fix use of wrong index in ComputeXidHorizons(). This bug, recently introduced in `941697c3c1`, at least lead to vacuum failing because it found tuples inserted by a running transaction, but below the freeze limit. The freeze limit in turn is directly affected by the aforementioned bug. Thanks to Tom Lane figuring how to make the bug reproducible. We should add a few more assertions to make sure this type of bug isn't as hard to notice, but it's not yet clear how to best do so. Co-Diagnosed-By: Tom Lane <tgl@sss.pgh.pa.us> Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/1013484.1597609043@sss.pgh.pa.us	2020-08-16 14:21:37 -07:00
Noah Misch	676a9c3cc4	Correct several behavior descriptions in comments. Reuse cautionary language from src/test/ssl/README in src/test/kerberos/README. SLRUs have had access to six-character segments names since commit `73c986adde`, and recovery stopped calling HeapTupleHeaderAdvanceLatestRemovedXid() in commit `558a9165e0`. The other corrections are more self-evident.	2020-08-15 20:21:52 -07:00
Noah Misch	566372b3d6	Prevent concurrent SimpleLruTruncate() for any given SLRU. The SimpleLruTruncate() header comment states the new coding rule. To achieve this, add locktype "frozenid" and two LWLocks. This closes a rare opportunity for data loss, which manifested as "apparent wraparound" or "could not access status of transaction" errors. Data loss is more likely in pg_multixact, due to released branches' thin margin between multiStopLimit and multiWrapLimit. If a user's physical replication primary logged ": apparent wraparound" messages, the user should rebuild standbys of that primary regardless of symptoms. At less risk is a cluster having emitted "not accepting commands" errors or "must be vacuumed" warnings at some point. One can test a cluster for this data loss by running VACUUM FREEZE in every database. Back-patch to 9.5 (all supported versions). Discussion: https://postgr.es/m/20190218073103.GA1434723@rfd.leadboat.com	2020-08-15 10:15:53 -07:00
Andres Freund	73487a60fc	snapshot scalability: Move subxact info to ProcGlobal, remove PGXACT. Similar to the previous changes this increases the chance that data frequently needed by GetSnapshotData() stays in l2 cache. In many workloads subtransactions are very rare, and this makes the check for that considerably cheaper. As this removes the last member of PGXACT, there is no need to keep it around anymore. On a larger 2 socket machine this and the two preceding commits result in a ~1.07x performance increase in read-only pgbench. For read-heavy mixed r/w workloads without row level contention, I see about 1.1x. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de	2020-08-14 15:33:35 -07:00
Andres Freund	5788e258bb	snapshot scalability: Move PGXACT->vacuumFlags to ProcGlobal->vacuumFlags. Similar to the previous commit this increases the chance that data frequently needed by GetSnapshotData() stays in l2 cache. As we now take care to not unnecessarily write to ProcGlobal->vacuumFlags, there should be very few modifications to the ProcGlobal->vacuumFlags array. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de	2020-08-14 15:33:35 -07:00
Andres Freund	941697c3c1	snapshot scalability: Introduce dense array of in-progress xids. The new array contains the xids for all connected backends / in-use PGPROC entries in a dense manner (in contrast to the PGPROC/PGXACT arrays which can have unused entries interspersed). This improves performance because GetSnapshotData() always needs to scan the xids of all live procarray entries and now there's no need to go through the procArray->pgprocnos indirection anymore. As the set of running top-level xids changes rarely, compared to the number of snapshots taken, this substantially increases the likelihood of most data required for a snapshot being in l2 cache. In read-mostly workloads scanning the xids[] array will sufficient to build a snapshot, as most backends will not have an xid assigned. To keep the xid array dense ProcArrayRemove() needs to move entries behind the to-be-removed proc's one further up in the array. Obviously moving array entries cannot happen while a backend sets it xid. I.e. locking needs to prevent that array entries are moved while a backend modifies its xid. To avoid locking ProcArrayLock in GetNewTransactionId() - a fairly hot spot already - ProcArrayAdd() / ProcArrayRemove() now needs to hold XidGenLock in addition to ProcArrayLock. Adding / Removing a procarray entry is not a very frequent operation, even taking 2PC into account. Due to the above, the dense array entries can only be read or modified while holding ProcArrayLock and/or XidGenLock. This prevents a concurrent ProcArrayRemove() from shifting the dense array while it is accessed concurrently. While the new dense array is very good when needing to look at all xids it is less suitable when accessing a single backend's xid. In particular it would be problematic to have to acquire a lock to access a backend's own xid. Therefore a backend's xid is not just stored in the dense array, but also in PGPROC. This also allows a backend to only access the shared xid value when the backend had acquired an xid. The infrastructure added in this commit will be used for the remaining PGXACT fields in subsequent commits. They are kept separate to make review easier. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de	2020-08-14 15:33:35 -07:00
Andres Freund	1f51c17c68	snapshot scalability: Move PGXACT->xmin back to PGPROC. Now that xmin isn't needed for GetSnapshotData() anymore, it leads to unnecessary cacheline ping-pong to have it in PGXACT, as it is updated considerably more frequently than the other PGXACT members. After the changes in `dc7420c2c9`, this is a very straight-forward change. For highly concurrent, snapshot acquisition heavy, workloads this change alone can significantly increase scalability. E.g. plain pgbench on a smaller 2 socket machine gains 1.07x for read-only pgbench, 1.22x for read-only pgbench when submitting queries in batches of 100, and 2.85x for batches of 100 'SELECT';. The latter numbers are obviously not to be expected in the real-world, but micro-benchmark the snapshot computation scalability (previously spending ~80% of the time in GetSnapshotData()). Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de	2020-08-13 16:25:21 -07:00
Andres Freund	dc7420c2c9	snapshot scalability: Don't compute global horizons while building snapshots. To make GetSnapshotData() more scalable, it cannot not look at at each proc's xmin: While snapshot contents do not need to change whenever a read-only transaction commits or a snapshot is released, a proc's xmin is modified in those cases. The frequency of xmin modifications leads to, particularly on higher core count systems, many cache misses inside GetSnapshotData(), despite the data underlying a snapshot not changing. That is the most significant source of GetSnapshotData() scaling poorly on larger systems. Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons / thresholds as it has so far. But we don't really have to: The horizons don't actually change that much between GetSnapshotData() calls. Nor are the horizons actually used every time a snapshot is built. The trick this commit introduces is to delay computation of accurate horizons until there use and using horizon boundaries to determine whether accurate horizons need to be computed. The use of RecentGlobal[Data]Xmin to decide whether a row version could be removed has been replaces with new GlobalVisTest* functions. These use two thresholds to determine whether a row can be pruned: 1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed are definitely still visible. 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can definitely be removed GetSnapshotData() updates definitely_needed to be the xmin of the computed snapshot. When testing whether a row can be removed (with GlobalVisTestIsRemovableXid()) and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID < definitely_needed) the boundaries can be recomputed to be more accurate. As it is not cheap to compute accurate boundaries, we limit the number of times that happens in short succession. As the boundaries used by GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by GetSnapshotData()), it is likely that further test can benefit from an earlier computation of accurate horizons. To avoid regressing performance when old_snapshot_threshold is set (as that requires an accurate horizon to be computed), heap_page_prune_opt() doesn't unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the computation of the limited horizon, and the triggering of errors (with SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove tuples. This commit just removes the accesses to PGXACT->xmin from GetSnapshotData(), but other members of PGXACT residing in the same cache line are accessed. Therefore this in itself does not result in a significant improvement. Subsequent commits will take advantage of the fact that GetSnapshotData() now does not need to access xmins anymore. Note: This contains a workaround in heap_page_prune_opt() to keep the snapshot_too_old tests working. While that workaround is ugly, the tests currently are not meaningful, and it seems best to address them separately. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de	2020-08-12 16:03:49 -07:00
Andres Freund	3bd7f9969a	Track latest completed xid as a FullTransactionId. The reason for doing so is that a subsequent commit will need that to avoid wraparound issues. As the subsequent change is large this was split out for easier review. The reason this is not a perfect straight-forward change is that we do not want track 64bit xids in the procarray or the WAL. Therefore we need to advance lastestCompletedXid in relation to 32 bit xids. The code for that is now centralized in MaintainLatestCompletedXid*. Author: Andres Freund Reviewed-By: Thomas Munro, Robert Haas, David Rowley Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de	2020-08-11 17:41:18 -07:00
Andres Freund	fea10a6434	Rename VariableCacheData.nextFullXid to nextXid. Including Full in variable names duplicates the type information and leads to overly long names. As FullTransactionId cannot accidentally be casted to TransactionId that does not seem necessary. Author: Andres Freund Discussion: https://postgr.es/m/20200724011143.jccsyvsvymuiqfxu@alap3.anarazel.de	2020-08-11 12:07:14 -07:00
Thomas Munro	84b1c63ad4	Preallocate some DSM space at startup. Create an optional region in the main shared memory segment that can be used to acquire and release "fast" DSM segments, and can benefit from huge pages allocated at cluster startup time, if configured. Fall back to the existing mechanisms when that space is full. The size is controlled by a new GUC min_dynamic_shared_memory, defaulting to 0. Main region DSM segments initially contain whatever garbage the memory held last time they were used, rather than zeroes. That change revealed that DSA areas failed to initialize themselves correctly in memory that wasn't zeroed first, so fix that problem. Discussion: https://postgr.es/m/CA%2BhUKGLAE2QBv-WgGp%2BD9P_J-%3Dyne3zof9nfMaqq1h3EGHFXYQ%40mail.gmail.com	2020-07-31 17:49:58 +12:00
Thomas Munro	c5315f4f44	Cache smgrnblocks() results in recovery. Avoid repeatedly calling lseek(SEEK_END) during recovery by caching the size of each fork. For now, we can't use the same technique in other processes, because we lack a shared invalidation mechanism. Do this by generalizing the pre-existing caching used by FSM and VM to support all forks. Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com	2020-07-31 14:29:52 +12:00
Thomas Munro	e2d394df5d	Use WaitLatch() for condition variables. Previously, condition_variable.c created a long lived WaitEventSet to avoid extra system calls. WaitLatch() now uses something similar internally, so there is no point in wasting an extra kernel descriptor. Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com	2020-07-30 17:42:45 +12:00
Thomas Munro	3347c982ba	Use a long lived WaitEventSet for WaitLatch(). Create LatchWaitSet at backend startup time, and use it to implement WaitLatch(). This avoids repeated epoll/kqueue setup and teardown system calls. Reorder SubPostmasterMain() slightly so that we restore the postmaster pipe and Windows signal emulation before we reach InitPostmasterChild(), to make this work in EXEC_BACKEND builds. Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com	2020-07-30 17:40:00 +12:00
Thomas Munro	cb04ad4985	Move syncscan.c to src/backend/access/common. Since the tableam.c code needs to make use of the syncscan.c routines itself, and since other block-oriented AMs might also want to use it one day, it didn't make sense for it to live under src/backend/access/heap. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKGLCnG%3DNEAByg6bk%2BCT9JZD97Y%3DAxKhh27Su9FeGWOKvDg%40mail.gmail.com	2020-07-29 16:59:33 +12:00
Thomas Munro	42dee8b8e3	Fix error message. Remove extra space. Back-patch to all releases, like commit `7897e3bb`. Author: Lu, Chenyang <lucy.fnst@cn.fujitsu.com> Discussion: https://postgr.es/m/795d03c6129844d3803e7eea48f5af0d%40G08CNEXMBPEKD04.g08.fujitsu.local	2020-07-23 21:10:49 +12:00
Peter Geoghegan	4a70f829d8	Add nbtree Valgrind buffer lock checks. Holding just a buffer pin (with no buffer lock) on an nbtree buffer/page provides very weak guarantees, especially compared to heapam, where it's often safe to read a page while only holding a buffer pin. This commit has Valgrind enforce the following rule: it is never okay to access an nbtree buffer without holding both a pin and a lock on the buffer. A draft version of this patch detected questionable code that was cleaned up by commits `fa7ff642` and `7154aa16`. The code in question used to access an nbtree buffer page's special/opaque area with no buffer lock (only a buffer pin). This practice (which isn't obviously unsafe) is hereby formally disallowed in nbtree. There doesn't seem to be any reason to allow it, and banning it keeps things simple for Valgrind. The new checks are implemented by adding custom nbtree client requests (located in LockBuffer() wrapper functions); these requests are "superimposed" on top of the generic bufmgr.c Valgrind client requests added by commit `1e0dfd16`. No custom resource management cleanup code is needed to undo the effects of marking buffers as non-accessible under this scheme. Author: Peter Geoghegan Reviewed-By: Anastasia Lubennikova, Georgios Kokolatos Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com	2020-07-21 15:50:58 -07:00
Peter Geoghegan	6ca7cd89a2	Assert that buffer is pinned in LockBuffer(). Strengthen the LockBuffer() assertion that verifies BufferIsValid() by making it verify BufferIsPinned() instead. Do the same in nearby related functions. There is probably not much chance that anybody will try to lock a buffer that is not already pinned, but we might as well make sure of that.	2020-07-20 16:03:38 -07:00
Peter Geoghegan	46ef520b95	Mark buffers as defined to Valgrind consistently. Make PinBuffer() mark buffers as defined to Valgrind unconditionally, including when the buffer header spinlock must be acquired. Failure to handle that case could lead to false positive reports from Valgrind. This theoretically creates a risk that we'll mark buffers defined even when external callers don't end up with a buffer pin. That seems perfectly acceptable, though, since in general we make no guarantees about buffers that are unsafe to access being reliably marked as unsafe. Oversight in commit `1e0dfd16`, which added valgrind buffer access instrumentation.	2020-07-19 09:46:44 -07:00
Peter Geoghegan	1e0dfd166b	Add Valgrind buffer access instrumentation. Teach Valgrind memcheck to maintain the "defined-ness" of each shared buffer based on whether the backend holds at least one pin at the point it is accessed by access method code. Bugs like the one fixed by commit `b0229f26` can be detected using this new instrumentation. Note that backends running with Valgrind naturally have their own independent ideas about whether any given byte in shared memory is safe or unsafe to access. There is no risk that concurrent access by multiple backends to the same shared memory will confuse Valgrind's instrumentation, because everything already works at the process level (or at the memory mapping level, if you prefer). Author: Álvaro Herrera, Peter Geoghegan Reviewed-By: Anastasia Lubennikova Discussion: https://postgr.es/m/20150723195349.GW5596@postgresql.org Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com	2020-07-17 17:49:45 -07:00
Andres Freund	5e7bbb5286	code: replace 'master' with 'primary' where appropriate. Also changed "in the primary" to "on the primary", and added a few "the" before "primary". Author: Andres Freund Reviewed-By: David Steele Discussion: https://postgr.es/m/20200615182235.x7lch5n6kcjq4aue@alap3.anarazel.de	2020-07-08 12:57:23 -07:00
Tom Lane	f7b5988cc0	Fix temporary tablespaces for shared filesets some more. Commit `ecd9e9f0b` fixed the problem in the wrong place, causing unwanted side-effects on the behavior of GetNextTempTableSpace(). Instead, let's make SharedFileSetInit() responsible for subbing in the value of MyDatabaseTableSpace when the default tablespace is called for. The convention about what is in the tempTableSpaces[] array is evidently insufficiently documented, so try to improve that. It also looks like SharedFileSetInit() is doing the wrong thing in the case where temp_tablespaces is empty. It was hard-wiring use of the pg_default tablespace, but it seems like using MyDatabaseTableSpace is more consistent with what happens for other temp files. Back-patch the reversion of PrepareTempTablespaces()'s behavior to 9.5, as `ecd9e9f0b` was. The changes in SharedFileSetInit() go back to v11 where that was introduced. (Note there is net zero code change before v11 from these two patch sets, so nothing to release-note.) Magnus Hagander and Tom Lane Discussion: https://postgr.es/m/CABUevExg5YEsOvqMxrjoNvb3ApVyH+9jggWGKwTDFyFCVWczGQ@mail.gmail.com	2020-07-03 17:01:34 -04:00
Andres Freund	cf1234a10e	Fix deadlock danger when atomic ops are done under spinlock. This was a danger only for --disable-spinlocks in combination with atomic operations unsupported by the current platform. While atomics.c was careful to signal that a separate semaphore ought to be used when spinlock emulation is active, spin.c didn't actually implement that mechanism. That's my (Andres') fault, it seems to have gotten lost during the development of the atomic operations support. Fix that issue and add test for nesting atomic operations inside a spinlock. Author: Andres Freund Discussion: https://postgr.es/m/20200605023302.g6v3ydozy5txifji@alap3.anarazel.de Backpatch: 9.5-	2020-06-18 14:08:32 -07:00
Andres Freund	4d4ca24efe	spinlock emulation: Fix bug when more than INT_MAX spinlocks are initialized. Once the counter goes negative we ended up with spinlocks that errored out on first use (due to check in tas_sema). Author: Andres Freund Reviewed-By: Robert Haas Discussion: https://postgr.es/m/20200606023103.avzrctgv7476xj7i@alap3.anarazel.de Backpatch: 9.5-	2020-06-17 12:50:54 -07:00
Andres Freund	fd49d53807	Avoid potential spinlock in a signal handler as part of global barriers. On platforms without support for 64bit atomic operations where we also cannot rely on 64bit reads to have single copy atomicity, such atomics are implemented using a spinlock based fallback. That means it's not safe to even read such atomics from within a signal handler (since the signal handler might run when the spinlock already is held). To avoid this issue defer global barrier processing out of the signal handler. Instead of checking local / shared barrier generation to determine whether to set ProcSignalBarrierPending, introduce PROCSIGNAL_BARRIER and always set ProcSignalBarrierPending when receiving such a signal. Additionally avoid redundant work in ProcessProcSignalBarrier if ProcSignalBarrierPending is unnecessarily. Also do a small amount of other polishing. Author: Andres Freund Reviewed-By: Robert Haas Discussion: https://postgr.es/m/20200609193723.eu5ilsjxwdpyxhgz@alap3.anarazel.de Backpatch: 13-, where the code was introduced.	2020-06-17 12:41:45 -07:00
Peter Eisentraut	a513f1dfbf	Remove STATUS_WAITING Add a separate enum for use in the locking APIs, which were the only user. Discussion: https://www.postgresql.org/message-id/flat/a6f91ead-0ce4-2a34-062b-7ab9813ea308%402ndquadrant.com	2020-06-17 09:14:37 +02:00
Thomas Munro	4dd804a99c	Remove useless variable.	2020-06-16 17:40:06 +12:00
Thomas Munro	f5d18862bb	Make BufFileWrite() void. It now either returns after it wrote all the data you gave it, or raises an error. Not done in back-branches, because it might cause problems for external code. Discussion: https://postgr.es/m/CA%2BhUKGJE04G%3D8TLK0DLypT_27D9dR8F1RQgNp0jK6qR0tZGWOw%40mail.gmail.com	2020-06-16 17:33:04 +12:00
Thomas Munro	7897e3bb90	Fix buffile.c error handling. Convert buffile.c error handling to use ereport. This fixes cases where I/O errors were indistinguishable from EOF or not reported. Also remove "%m" from error messages where errno would be bogus. While we're modifying those strings, add block numbers and short read byte counts where appropriate. Back-patch to all supported releases. Reported-by: Amit Khandekar <amitdkhan.pg@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Ibrar Ahmed <ibrar.ahmad@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CA%2BhUKGJE04G%3D8TLK0DLypT_27D9dR8F1RQgNp0jK6qR0tZGWOw%40mail.gmail.com	2020-06-16 16:59:07 +12:00
Thomas Munro	7aa4fb5925	Improve comments for [Heap]CheckForSerializableConflictOut(). Rewrite the documentation of these functions, in light of recent bug fix commit `5940ffb2`. Back-patch to 13 where the check-for-conflict-out code was split up into AM-specific and generic parts, and new documentation was added that now looked wrong. Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/db7b729d-0226-d162-a126-8a8ab2dc4443%40jepsen.io	2020-06-12 10:55:38 +12:00
Peter Eisentraut	0fd2a79a63	Spelling adjustments	2020-06-07 15:06:51 +02:00
Peter Eisentraut	42181b1015	Use correct and consistent unit abbreviation	2020-06-01 21:18:36 +02:00
Noah Misch	3350fb5d1f	Clear some style deviations.	2020-05-21 08:31:16 -07:00
Tom Lane	14a9101091	Drop the redundant "Lock" suffix from LWLock wait event names. This was mostly confusing, especially since some wait events in this class had the suffix and some did not. While at it, stop exposing MainLWLockNames[] as a globally visible name; any code using that directly is almost certainly wrong, as its name has been misleading for some time. (GetLWLockIdentifier() is what to use instead.) Discussion: https://postgr.es/m/28683.1589405363@sss.pgh.pa.us	2020-05-15 19:55:56 -04:00
Tom Lane	36ac359d36	Rename assorted LWLock tranches. Choose names that fit into the conventions for wait event names (particularly, that multi-word names are in the style MultiWordName) and hopefully convey more information to non-hacker users than the previous names did. Also rename SerializablePredicateLockListLock to SerializablePredicateListLock; the old name was long enough to cause table formatting problems, plus the double occurrence of "Lock" seems confusing/error-prone. Also change a couple of particularly opaque LWLock field names. Discussion: https://postgr.es/m/28683.1589405363@sss.pgh.pa.us	2020-05-15 18:11:07 -04:00
Tom Lane	5da14938f7	Rename SLRU structures and associated LWLocks. Originally, the names assigned to SLRUs had no purpose other than being shmem lookup keys, so not a lot of thought went into them. As of v13, though, we're exposing them in the pg_stat_slru view and the pg_stat_reset_slru function, so it seems advisable to take a bit more care. Rename them to names based on the associated on-disk storage directories (which fortunately we did think about, to some extent; since those are also visible to DBAs, consistency seems like a good thing). Also rename the associated LWLocks, since those names are likewise user-exposed now as wait event names. For the most part I only touched symbols used in the respective modules' SimpleLruInit() calls, not the names of other related objects. This renaming could have been taken further, and maybe someday we will do so. But for now it seems undesirable to change the names of any globally visible functions or structs, so some inconsistency is unavoidable. (But I did terminate "oldserxid" with prejudice, as I found that name both unreadable and not descriptive of the SLRU's contents.) Table 27.12 needs re-alphabetization now, but I'll leave that till after the other LWLock renamings I have in mind. Discussion: https://postgr.es/m/28683.1589405363@sss.pgh.pa.us	2020-05-15 14:28:25 -04:00
Tom Lane	5cbfce562f	Initial pgindent and pgperltidy run for v13. Includes some manual cleanup of places that pgindent messed up, most of which weren't per project style anyway. Notably, it seems some people didn't absorb the style rules of commit `c9d297751`, because there were a bunch of new occurrences of function calls with a newline just after the left paren, all with faulty expectations about how the rest of the call would get indented.	2020-05-14 13:06:50 -04:00
Tom Lane	29c3e2dd5a	Collect built-in LWLock tranche names statically, not dynamically. There is little point in using the LWLockRegisterTranche mechanism for built-in tranche names. It wastes cycles, it creates opportunities for bugs (since failing to register a tranche name is a very hard-to-detect problem), and the lack of any centralized list of names encourages sloppy nonconformity in name choices. Moreover, since we have a centralized list of the tranches anyway in enum BuiltinTrancheIds, we're certainly not buying any flexibility in return for these disadvantages. Hence, nuke all the backend-internal LWLockRegisterTranche calls, and instead provide a const array of the builtin tranche names. (I have in mind to change a bunch of these names shortly, but this patch is just about getting them into one place.) Discussion: https://postgr.es/m/9056.1589419765@sss.pgh.pa.us	2020-05-14 11:10:31 -04:00
Heikki Linnakangas	e8abf585ab	Move check for fsync=off so that pendingOps still gets cleared. Commit `3eb77eba5a` moved the loop and refactored it, and inadvertently changed the effect of fsync=off so that it also skipped removing entries from the pendingOps table. That was not intentional, and leads to an assertion failure if you turn fsync on while the server is running and reload the config. Backpatch-through: 12- Reviewed-By: Thomas Munro Discussion: https://www.postgresql.org/message-id/3cbc7f4b-a5fa-56e9-9591-c886deb07513%40iki.fi	2020-05-14 08:39:26 +03:00
Michael Paquier	e111c9f90a	Remove smgrdounlink() in smgr.c from the code tree The last caller of this routine was removed in `b416691`, and as a wise man said one day, dead code tends to silently break. Per discussion between Fujii Masao, Peter Geoghegan, Vignesh C and me. Reported-by: Peter Geoghegan Discussion: https://postgr.es/m/CAH2-Wz=sg5H8-vG4d5UmAofdcRMpeTDt2K-NUWp4GSfhenRGAQ@mail.gmail.com	2020-05-10 10:58:54 +09:00
Alexander Korotkov	1aac32df89	Revert `0f5ca02f53` `0f5ca02f53` introduces 3 new keywords. It appears to be too much for relatively small feature. Given now we past feature freeze, it's already late for discussion of the new syntax. So, revert. Discussion: https://postgr.es/m/28209.1586294824%40sss.pgh.pa.us	2020-04-08 11:37:27 +03:00
Thomas Munro	3985b600f5	Support PrefetchBuffer() in recovery. Provide PrefetchSharedBuffer(), a variant that takes SMgrRelation, for use in recovery. Rename LocalPrefetchBuffer() to PrefetchLocalBuffer() for consistency. Add a return value to all of these. In recovery, tolerate and report missing files, so we can handle relations unlinked before crash recovery began. Also report cache hits and misses, so that callers can do faster buffer lookups and better I/O accounting. Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com	2020-04-08 14:56:57 +12:00
Andres Freund	75848bc744	snapshot scalability: Move delayChkpt from PGXACT to PGPROC. The goal of separating hotly accessed per-backend data from PGPROC into PGXACT is to make accesses fast (GetSnapshotData() in particular). But delayChkpt is not actually accessed frequently; only when starting a checkpoint. As it is frequently modified (multiple times in the course of a single transaction), storing it in the same cacheline as hotly accessed data unnecessarily dirties a contended cacheline. Therefore move delayChkpt to PGPROC. This is part of a larger series of patches intending to improve GetSnapshotData() scalability. It is committed and pushed separately, as it is independently beneficial (small but measurable win, limited by the other frequent modifications of PGXACT). Author: Andres Freund Reviewed-By: Robert Haas, Thomas Munro, David Rowley Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de	2020-04-07 17:36:23 -07:00
Alexander Korotkov	0f5ca02f53	Implement waiting for given lsn at transaction start This commit adds following optional clause to BEGIN and START TRANSACTION commands. WAIT FOR LSN lsn [ TIMEOUT timeout ] New clause pospones transaction start till given lsn is applied on standby. This clause allows user be sure, that changes previously made on primary would be visible on standby. New shared memory struct is used to track awaited lsn per backend. Recovery process wakes up backend once required lsn is applied. Author: Ivan Kartyshov, Anna Akenteva Reviewed-by: Craig Ringer, Thomas Munro, Robert Haas, Kyotaro Horiguchi Reviewed-by: Masahiko Sawada, Ants Aasma, Dmitry Ivanov, Simon Riggs Reviewed-by: Amit Kapila, Alexander Korotkov Discussion: https://postgr.es/m/0240c26c-9f84-30ea-fca9-93ab2df5f305%40postgrespro.ru	2020-04-07 23:51:10 +03:00
Noah Misch	c6b92041d3	Skip WAL for new relfilenodes, under wal_level=minimal. Until now, only selected bulk operations (e.g. COPY) did this. If a given relfilenode received both a WAL-skipping COPY and a WAL-logged operation (e.g. INSERT), recovery could lose tuples from the COPY. See src/backend/access/transam/README section "Skipping WAL for New RelFileNode" for the new coding rules. Maintainers of table access methods should examine that section. To maintain data durability, just before commit, we choose between an fsync of the relfilenode and copying its contents to WAL. A new GUC, wal_skip_threshold, guides that choice. If this change slows a workload that creates small, permanent relfilenodes under wal_level=minimal, try adjusting wal_skip_threshold. Users setting a timeout on COMMIT may need to adjust that timeout, and log_min_duration_statement analysis will reflect time consumption moving to COMMIT from commands like COPY. Internally, this requires a reliable determination of whether RollbackAndReleaseCurrentSubTransaction() would unlink a relation's current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the specification of rd_createSubid such that the field is zero when a new rel has an old rd_node. Make relcache.c retain entries for certain dropped relations until end of transaction. Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN. Future servers accept older WAL, so this bump is discretionary. Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert Haas. Heikki Linnakangas and Michael Paquier implemented earlier designs that materially clarified the problem. Reviewed, in earlier designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane, Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout. Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org	2020-04-04 12:25:34 -07:00
Peter Eisentraut	552fcebff0	Revert "Improve handling of parameter differences in physical replication" This reverts commit `246f136e76`. That patch wasn't quite complete enough. Discussion: https://www.postgresql.org/message-id/flat/E1jIpJu-0007Ql-CL%40gemulon.postgresql.org	2020-04-04 09:08:12 +02:00
Fujii Masao	18808f8c89	Add wait events for recovery conflicts. This commit introduces new wait events RecoveryConflictSnapshot and RecoveryConflictTablespace. The former is reported while waiting for recovery conflict resolution on a vacuum cleanup. The latter is reported while waiting for recovery conflict resolution on dropping tablespace. Also this commit changes the code so that the wait event Lock is reported while waiting in ResolveRecoveryConflictWithVirtualXIDs() for recovery conflict resolution on a lock. Basically the wait event Lock is reported during that wait, but previously was not reported only when that wait happened in ResolveRecoveryConflictWithVirtualXIDs(). Author: Masahiko Sawada Reviewed-by: Fujii Masao Discussion: https://postgr.es/m/CA+fd4k4mXWTwfQLS3RPwGr4xnfAEs1ysFfgYHvmmoUgv6Zxvmg@mail.gmail.com	2020-04-03 12:15:56 +09:00
Magnus Hagander	087d3d0583	Fix assorted typos Author: Daniel Gustafsson <daniel@yesql.se>	2020-03-31 16:00:06 +02:00
Fujii Masao	64638ccba3	Report waiting via PS while recovery is waiting for buffer pin in hot standby. Previously while the startup process was waiting for the recovery conflict with snapshot, tablespace or lock to be resolved, waiting was reported in PS display, but not in the case of recovery conflict with buffer pin. This commit makes the startup process in hot standby report waiting via PS while waiting for the conflicts with other backends holding buffer pins to be resolved. Author: Masahiko Sawada Reviewed-by: Fujii Masao Discussion: https://postgr.es/m/CA+fd4k4mXWTwfQLS3RPwGr4xnfAEs1ysFfgYHvmmoUgv6Zxvmg@mail.gmail.com	2020-03-30 17:35:03 +09:00
Peter Eisentraut	246f136e76	Improve handling of parameter differences in physical replication When certain parameters are changed on a physical replication primary, this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL record. The standby then checks whether its own settings are at least as big as the ones on the primary. If not, the standby shuts down with a fatal error. The correspondence of settings between primary and standby is required because those settings influence certain shared memory sizings that are required for processing WAL records that the primary might send. For example, if the primary sends a prepared transaction, the standby must have had max_prepared_transaction set appropriately or it won't be able to process those WAL records. However, fatally shutting down the standby immediately upon receipt of the parameter change record might be a bit of an overreaction. The resources related to those settings are not required immediately at that point, and might never be required if the activity on the primary does not exhaust all those resources. If we just let the standby roll on with recovery, it will eventually produce an appropriate error when those resources are used. So this patch relaxes this a bit. Upon receipt of XLOG_PARAMETER_CHANGE, we still check the settings but only issue a warning and set a global flag if there is a problem. Then when we actually hit the resource issue and the flag was set, we issue another warning message with relevant information. At that point we pause recovery, so a hot standby remains usable. We also repeat the last warning message once a minute so it is harder to miss or ignore. Reviewed-by: Sergei Kornilov <sk@zsrv.org> Reviewed-by: Masahiko Sawada <masahiko.sawada@2ndquadrant.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36c7b@2ndquadrant.com	2020-03-30 09:53:45 +02:00
Magnus Hagander	eff5b245df	Document that pg_checksums exists in checksums README Author: Daniel Gustafsson <daniel@yesql.se>	2020-03-26 15:05:54 +01:00
Tom Lane	bda6dedbea	Go back to returning int from ereport auxiliary functions. This reverts the parts of commit `17a28b0364` that changed ereport's auxiliary functions from returning dummy integer values to returning void. It turns out that a minority of compilers complain (not entirely unreasonably) about constructs such as (condition) ? errdetail(...) : 0 if errdetail() returns void rather than int. We could update those call sites to say "(void) 0" perhaps, but the expectation for this patch set was that ereport callers would not have to change anything. And this aspect of the patch set was already the most invasive and least compelling part of it, so let's just drop it. Per buildfarm. Discussion: https://postgr.es/m/CA+fd4k6N8EjNvZpM8nme+y+05mz-SM8Z_BgkixzkA34R+ej0Kw@mail.gmail.com	2020-03-25 11:57:36 -04:00
Tom Lane	17a28b0364	Improve the internal implementation of ereport(). Change all the auxiliary error-reporting routines to return void, now that we no longer need to pretend they are passing something useful to errfinish(). While this probably doesn't save anything significant at the machine-code level, it allows detection of some additional types of mistakes. Pass the error location details (__FILE__, __LINE__, PG_FUNCNAME_MACRO) to errfinish not errstart. This shaves a few cycles off the case where errstart decides we're not going to emit anything. Re-implement elog() as a trivial wrapper around ereport(), removing the separate support infrastructure it used to have. Aside from getting rid of some now-surplus code, this means that elog() now really does have exactly the same semantics as ereport(), in particular that it can skip evaluation work if the message is not to be emitted. Andres Freund and Tom Lane Discussion: https://postgr.es/m/CA+fd4k6N8EjNvZpM8nme+y+05mz-SM8Z_BgkixzkA34R+ej0Kw@mail.gmail.com	2020-03-24 12:08:48 -04:00
Noah Misch	de9396326e	Revert "Skip WAL for new relfilenodes, under wal_level=minimal." This reverts commit `cb2fd7eac2`. Per numerous buildfarm members, it was incompatible with parallel query, and a test case assumed LP64. Back-patch to 9.5 (all supported versions). Discussion: https://postgr.es/m/20200321224920.GB1763544@rfd.leadboat.com	2020-03-22 09:24:09 -07:00
Noah Misch	cb2fd7eac2	Skip WAL for new relfilenodes, under wal_level=minimal. Until now, only selected bulk operations (e.g. COPY) did this. If a given relfilenode received both a WAL-skipping COPY and a WAL-logged operation (e.g. INSERT), recovery could lose tuples from the COPY. See src/backend/access/transam/README section "Skipping WAL for New RelFileNode" for the new coding rules. Maintainers of table access methods should examine that section. To maintain data durability, just before commit, we choose between an fsync of the relfilenode and copying its contents to WAL. A new GUC, wal_skip_threshold, guides that choice. If this change slows a workload that creates small, permanent relfilenodes under wal_level=minimal, try adjusting wal_skip_threshold. Users setting a timeout on COMMIT may need to adjust that timeout, and log_min_duration_statement analysis will reflect time consumption moving to COMMIT from commands like COPY. Internally, this requires a reliable determination of whether RollbackAndReleaseCurrentSubTransaction() would unlink a relation's current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the specification of rd_createSubid such that the field is zero when a new rel has an old rd_node. Make relcache.c retain entries for certain dropped relations until end of transaction. Back-patch to 9.5 (all supported versions). This introduces a new WAL record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC. As always, update standby systems before master systems. This changes sizeof(RelationData) and sizeof(IndexStmt), breaking binary compatibility for affected extensions. (The most recent commit to affect the same class of extensions was 089e4d405d0f3b94c74a2c6a54357a84a681754b.) Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert Haas. Heikki Linnakangas and Michael Paquier implemented earlier designs that materially clarified the problem. Reviewed, in earlier designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane, Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout. Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org	2020-03-21 09:38:26 -07:00
Amit Kapila	3ba59ccc89	Allow page lock to conflict among parallel group members. This is required as it is no safer for two related processes to perform clean up in gin indexes at a time than for unrelated processes to do the same. After acquiring page locks, we can acquire relation extension lock but reverse never happens which means these will also not participate in deadlock. So, avoid checking wait edges from this lock. Currently, the parallel mode is strictly read-only, but after this patch we have the infrastructure to allow parallel inserts and parallel copy. Author: Dilip Kumar, Amit Kapila Reviewed-by: Amit Kapila, Kuntal Ghosh and Sawada Masahiko Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com	2020-03-21 08:48:06 +05:30
Amit Kapila	85f6b49c2c	Allow relation extension lock to conflict among parallel group members. This is required as it is no safer for two related processes to extend the same relation at a time than for unrelated processes to do the same. We don't acquire a heavyweight lock on any other object after relation extension lock which means such a lock can never participate in the deadlock cycle. So, avoid checking wait edges from this lock. This provides an infrastructure to allow parallel operations like insert, copy, etc. which were earlier not possible as parallel group members won't conflict for relation extension lock. Author: Dilip Kumar, Amit Kapila Reviewed-by: Amit Kapila, Kuntal Ghosh and Sawada Masahiko Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com	2020-03-20 08:20:56 +05:30
Amit Kapila	72e78d831a	Add assert to ensure that page locks don't participate in deadlock cycle. Assert that we don't acquire any other heavyweight lock while holding the page lock except for relation extension. However, these locks are never taken in reverse order which implies that page locks will never participate in the deadlock cycle. Similar to relation extension, page locks are also held for a short duration, so imposing such a restriction won't hurt. Author: Dilip Kumar, with few changes by Amit Kapila Reviewed-by: Amit Kapila, Kuntal Ghosh and Sawada Masahiko Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com	2020-03-19 08:11:45 +05:30
Amit Kapila	b4f140869f	Add missing errcode() in a few ereport calls. This will allow to specifying SQLSTATE error code for the errors in the missing places. Reported-by: Sawada Masahiko Author: Sawada Masahiko Backpatch-through: 9.5 Discussion: https://postgr.es/m/CA+fd4k6N8EjNvZpM8nme+y+05mz-SM8Z_BgkixzkA34R+ej0Kw@mail.gmail.com	2020-03-18 09:27:14 +05:30
Amit Kapila	15ef6ff4b9	Assert that we don't acquire a heavyweight lock on another object after relation extension lock. The only exception to the rule is that we can try to acquire the same relation extension lock more than once. This is allowed as we are not creating any new lock for this case. This restriction implies that the relation extension lock won't ever participate in the deadlock cycle because we can never wait for any other heavyweight lock after acquiring this lock. Such a restriction is okay for relation extension locks as unlike other heavyweight locks these are not held till the transaction end. These are taken for a short duration to extend a particular relation and then released. Author: Dilip Kumar, with few changes by Amit Kapila Reviewed-by: Amit Kapila, Kuntal Ghosh and Sawada Masahiko Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com	2020-03-18 07:20:17 +05:30
Thomas Munro	9b8aa09293	Don't use EV_CLEAR for kqueue events. For the semantics to match the epoll implementation, we need a socket to continue to appear readable/writable if you wait multiple times without doing I/O in between (in Linux terminology: level-triggered rather than edge-triggered). This distinction will be important for later commits. Similar to commit `3b790d256f` for Windows. Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com	2020-03-18 13:05:09 +13:00
Thomas Munro	7bc84a1f30	Fix kqueue support under debugger on macOS. While running under a debugger, macOS's getppid() can return the debugger's PID. That could cause a backend to exit because it falsely believed that the postmaster had died, since commit `815c2f09`. Continue to use getppid() as a fast postmaster check after adding the postmaster's PID to a kqueue, to close a PID-reuse race, but double check that it actually exited by trying to read the pipe. The new check isn't reached in the common case. Reported-by: Alexander Korotkov <a.korotkov@postgrespro.ru> Discussion: https://postgr.es/m/CA%2BhUKGKhAxJ8V8RVwCo6zJaeVrdOG1kFBHGZOOjf6DzW_omeMA%40mail.gmail.com	2020-03-18 13:05:07 +13:00
Thomas Munro	fc34b0d9de	Introduce a maintenance_io_concurrency setting. Introduce a GUC and a tablespace option to control I/O prefetching, much like effective_io_concurrency, but for work that is done on behalf of many client sessions. Use the new setting in heapam.c instead of the hard-coded formula effective_io_concurrency + 10 introduced by commit `558a9165e0`. Go with a default value of 10 for now, because it's a round number pretty close to the value used for that existing case. Discussion: https://postgr.es/m/CA%2BhUKGJUw08dPs_3EUcdO6M90GnjofPYrWp4YSLaBkgYwS-AqA%40mail.gmail.com	2020-03-16 17:14:26 +13:00
Thomas Munro	b09ff53667	Simplify the effective_io_concurrency setting. The effective_io_concurrency GUC and equivalent tablespace option were previously passed through a formula based on a theory about RAID spindles and probabilities, to arrive at the number of pages to prefetch in bitmap heap scans. Tomas Vondra, Andres Freund and others argued that it was anachronistic and hard to justify, and commit `558a9165e0` already started down the path of bypassing it in new code. We agreed to drop that logic and use the value directly. For the default setting of 1, there is no change in effect. Higher settings can be converted from the old meaning to the new with: select round(sum(OLD / n::float)) from generate_series(1, OLD) s(n); We might want to consider renaming the GUC before the next release given the change in meaning, but it's not clear that many users had set it very carefully anyway. That decision is deferred for now. Discussion: https://postgr.es/m/CA%2BhUKGJUw08dPs_3EUcdO6M90GnjofPYrWp4YSLaBkgYwS-AqA%40mail.gmail.com	2020-03-16 17:14:26 +13:00
Peter Eisentraut	bf68b79e50	Refactor ps_status.c API The init_ps_display() arguments were mostly lies by now, so to match typical usage, just use one argument and let the caller assemble it from multiple sources if necessary. The only user of the additional arguments is BackendInitialize(), which was already doing string assembly on the caller side anyway. Remove the second argument of set_ps_display() ("force") and just handle that in init_ps_display() internally. BackendInitialize() also used to set the initial status as "authentication", but that was very far from where authentication actually happened. So now it's set to "initializing" and then "authentication" just before the actual call to ClientAuthentication(). Reviewed-by: Julien Rouhaud <rjuju123@gmail.com> Reviewed-by: Kuntal Ghosh <kuntalghosh.2007@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> Discussion: https://www.postgresql.org/message-id/flat/c65e5196-4f04-4ead-9353-6088c19615a3@2ndquadrant.com	2020-03-11 16:38:31 +01:00
Peter Eisentraut	aaa3aeddee	Remove HAVE_WORKING_LINK Previously, hard links were not used on Windows and Cygwin, but they support them just fine in currently supported OS versions, so we can use them there as well. Since all supported platforms now support hard links, we can remove the alternative code paths. Rename durable_link_or_rename() to durable_rename_excl() to make the purpose more clear without referencing the implementation details. Discussion: https://www.postgresql.org/message-id/flat/72fff73f-dc9c-4ef4-83e8-d2e60c98df48%402ndquadrant.com	2020-03-11 11:23:04 +01:00
Peter Eisentraut	3c173a53a8	Remove utils/acl.h from catalog/objectaddress.h The need for this was removed by `8b9e9644dc`. A number of files now need to include utils/acl.h or parser/parse_node.h explicitly where they previously got it indirectly somehow. Since parser/parse_node.h already includes nodes/parsenodes.h, the latter is then removed where the former was added. Also, remove nodes/pg_list.h from objectaddress.h, since that's included via nodes/parsenodes.h. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> Discussion: https://www.postgresql.org/message-id/flat/7601e258-26b2-8481-36d0-dc9dca6f28f1%402ndquadrant.com	2020-03-10 10:27:00 +01:00
Fujii Masao	17d3fcdc39	Fix bug that causes to report waiting in PS display twice, in hot standby. Previously "waiting" could appear twice via PS in case of lock conflict in hot standby mode. Specifically this issue happend when the delay in WAL application determined by max_standby_archive_delay and max_standby_streaming_delay had passed but it took more than 500 msec to cancel all the conflicting transactions. Especially we can observe this easily by setting those delay parameters to -1. The cause of this issue was that WaitOnLock() and ResolveRecoveryConflictWithVirtualXIDs() added "waiting" to the process title in that case. This commit prevents ResolveRecoveryConflictWithVirtualXIDs() from reporting waiting in case of lock conflict, to fix the bug. Back-patch to all back branches. Author: Masahiko Sawada Reviewed-by: Fujii Masao Discussion: https://postgr.es/m/CA+fd4k4mXWTwfQLS3RPwGr4xnfAEs1ysFfgYHvmmoUgv6Zxvmg@mail.gmail.com	2020-03-10 00:14:43 +09:00
Robert Haas	05d8449e73	Move src/backend/utils/hash/hashfn.c to src/common This also involves renaming src/include/utils/hashutils.h, which becomes src/include/common/hashfn.h. Perhaps an argument can be made for keeping the hashutils.h name, but it seemed more consistent to make it match the name of the file, and also more descriptive of what is actually going on here. Patch by me, reviewed by Suraj Kharage and Mark Dilger. Off-list advice on how not to break the Windows build from Davinder Singh and Amit Kapila. Discussion: http://postgr.es/m/CA+TgmoaRiG4TXND8QuM6JXFRkM_1wL2ZNhzaUKsuec9-4yrkgw@mail.gmail.com	2020-02-27 09:25:41 +05:30
Peter Geoghegan	0d861bbb70	Add deduplication to nbtree. Deduplication reduces the storage overhead of duplicates in indexes that use the standard nbtree index access method. The deduplication process is applied lazily, after the point where opportunistic deletion of LP_DEAD-marked index tuples occurs. Deduplication is only applied at the point where a leaf page split would otherwise be required. New posting list tuples are formed by merging together existing duplicate tuples. The physical representation of the items on an nbtree leaf page is made more space efficient by deduplication, but the logical contents of the page are not changed. Even unique indexes make use of deduplication as a way of controlling bloat from duplicates whose TIDs point to different versions of the same logical table row. The lazy approach taken by nbtree has significant advantages over a GIN style eager approach. Most individual inserts of index tuples have exactly the same overhead as before. The extra overhead of deduplication is amortized across insertions, just like the overhead of page splits. The key space of indexes works in the same way as it has since commit `dd299df8` (the commit that made heap TID a tiebreaker column). Testing has shown that nbtree deduplication can generally make indexes with about 10 or 15 tuples for each distinct key value about 2.5X - 4X smaller, even with single column integer indexes (e.g., an index on a referencing column that accompanies a foreign key). The final size of single column nbtree indexes comes close to the final size of a similar contrib/btree_gin index, at least in cases where GIN's posting list compression isn't very effective. This can significantly improve transaction throughput, and significantly reduce the cost of vacuuming indexes. A new index storage parameter (deduplicate_items) controls the use of deduplication. The default setting is 'on', so all new B-Tree indexes automatically use deduplication where possible. This decision will be reviewed at the end of the Postgres 13 beta period. There is a regression of approximately 2% of transaction throughput with synthetic workloads that consist of append-only inserts into a table with several non-unique indexes, where all indexes have few or no repeated values. The underlying issue is that cycles are wasted on unsuccessful attempts at deduplicating items in non-unique indexes. There doesn't seem to be a way around it short of disabling deduplication entirely. Note that deduplication of items in unique indexes is fairly well targeted in general, which avoids the problem there (we can use a special heuristic to trigger deduplication passes in unique indexes, since we're specifically targeting "version bloat"). Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed. No bump in BTREE_VERSION, since the representation of posting list tuples works in a way that's backwards compatible with version 4 indexes (i.e. indexes built on PostgreSQL 12). However, users must still REINDEX a pg_upgrade'd index to use deduplication, regardless of the Postgres version they've upgraded from. This is the only way to set the new nbtree metapage flag indicating that deduplication is generally safe. Author: Anastasia Lubennikova, Peter Geoghegan Reviewed-By: Peter Geoghegan, Heikki Linnakangas Discussion: https://postgr.es/m/55E4051B.7020209@postgrespro.ru https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru	2020-02-26 13:05:30 -08:00
Tom Lane	3d475515a1	Account explicitly for long-lived FDs that are allocated outside fd.c. The comments in fd.c have long claimed that all file allocations should go through that module, but in reality that's not always practical. fd.c doesn't supply APIs for invoking some FD-producing syscalls like pipe() or epoll_create(); and the APIs it does supply for non-virtual FDs are mostly insistent on releasing those FDs at transaction end; and in some cases the actual open() call is in code that can't be made to use fd.c, such as libpq. This has led to a situation where, in a modern server, there are likely to be seven or so long-lived FDs per backend process that are not known to fd.c. Since NUM_RESERVED_FDS is only 10, that meant we had very few spare FDs if max_files_per_process is >= the system ulimit and fd.c had opened all the files it thought it safely could. The contrib/postgres_fdw regression test, in particular, could easily be made to fall over by running it under a restrictive ulimit. To improve matters, invent functions Acquire/Reserve/ReleaseExternalFD that allow outside callers to tell fd.c that they have or want to allocate a FD that's not directly managed by fd.c. Add calls to track all the fixed FDs in a standard backend session, so that we are honestly guaranteeing that NUM_RESERVED_FDS FDs remain unused below the EMFILE limit in a backend's idle state. The coding rules for these functions say that there's no need to call them in code that just allocates one FD over a fairly short interval; we can dip into NUM_RESERVED_FDS for such cases. That means that there aren't all that many places where we need to worry. But postgres_fdw and dblink must use this facility to account for long-lived FDs consumed by libpq connections. There may be other places where it's worth doing such accounting, too, but this seems like enough to solve the immediate problem. Internally to fd.c, "external" FDs are limited to max_safe_fds/3 FDs. (Callers can choose to ignore this limit, but of course it's unwise to do so except for fixed file allocations.) I also reduced the limit on "allocated" files to max_safe_fds/3 FDs (it had been max_safe_fds/2). Conceivably a smarter rule could be used here --- but in practice, on reasonable systems, max_safe_fds should be large enough that this isn't much of an issue, so KISS for now. To avoid possible regression in the number of external or allocated files that can be opened, increase FD_MINFREE and the lower limit on max_files_per_process a little bit; we now insist that the effective "ulimit -n" be at least 64. This seems like pretty clearly a bug fix, but in view of the lack of field complaints, I'll refrain from risking a back-patch. Discussion: https://postgr.es/m/E1izCmM-0005pV-Co@gemulon.postgresql.org	2020-02-24 17:28:33 -05:00
Amit Kapila	3dfba9fdf5	Fix typos. Reported-by: Justin Pryzby Author: Justin Pryzby Discussion: https://postgr.es/m/20200206021432.GA24549@telsasoft.com	2020-02-10 09:31:18 +05:30
Michael Paquier	5ac4e9a12c	Fix typo in proc.c Author: Julien Rouhaud Discussion: https://postgr.es/m/20200206082333.GA95343@nol	2020-02-07 12:41:10 +09:00
Fujii Masao	3ccc66dac6	Fix bug in LWLock statistics mechanism. Previously PostgreSQL built with -DLWLOCK_STATS could report more than one LWLock statistics entries for the same backend process and the same LWLock. This is strange and only one statistics should be output in that case, instead. The cause of this issue is that the key variable used for LWLock stats hash table was not fully initialized. The key consists of two fields and they were initialized. But the following 4 bytes allocated in the key variable for the alignment was not initialized. So even if the same key was specified, hash_search(HASH_ENTER) could not find the existing entry for that key and created new one. This commit fixes this issue by initializing the key variable with zero. As the side effect of this commit, the volume of LWLock statistics output would be reduced very much. Back-patch to v10, where commit `3761fe3c20` introduced the issue. Author: Fujii Masao Reviewed-by: Julien Rouhaud, Kyotaro Horiguchi Discussion: https://postgr.es/m/26359edb-798a-568f-d93a-6aafac49752d@oss.nttdata.com	2020-02-06 14:43:21 +09:00
Thomas Munro	815c2f0972	Add kqueue(2) support to the WaitEventSet API. Use kevent(2) to wait for events on the BSD family of operating systems and macOS. This is similar to the epoll(2) support added for Linux by commit `98a64d0bd`. Author: Thomas Munro Reviewed-by: Andres Freund, Marko Tiikkaja, Tom Lane Tested-by: Mateusz Guzik, Matteo Beccati, Keith Fiske, Heikki Linnakangas, Michael Paquier, Peter Eisentraut, Rui DeSousa, Tom Lane, Mark Wong Discussion: https://postgr.es/m/CAEepm%3D37oF84-iXDTQ9MrGjENwVGds%2B5zTr38ca73kWR7ez_tA%40mail.gmail.com	2020-02-05 17:35:57 +13:00
Michael Paquier	f1f10a1ba9	Add declaration-level assertions for compile-time checks Those new assertions can be used at file scope, outside of any function for compilation checks. This commit provides implementations for C and C++, and fallback implementations. Author: Peter Smith Reviewed-by: Andres Freund, Kyotaro Horiguchi, Dagfinn Ilmari Mannsåker, Michael Paquier Discussion: https://postgr.es/m/201DD0641B056142AC8C6645EC1B5F62014B8E8030@SYD1217	2020-02-03 14:48:42 +09:00
Thomas Munro	93745f1e01	Fix memory leak on DSM slot exhaustion. If we attempt to create a DSM segment when no slots are available, we should return the memory to the operating system. Previously we did that if the DSM_CREATE_NULL_IF_MAXSEGMENTS flag was passed in, but we didn't do it if an error was raised. Repair. Back-patch to 9.4, where DSM segments arrived. Author: Thomas Munro Reviewed-by: Robert Haas Reported-by: Julian Backes Discussion: https://postgr.es/m/CA%2BhUKGKAAoEw-R4om0d2YM4eqT1eGEi6%3DQot-3ceDR-SLiWVDw%40mail.gmail.com	2020-02-01 14:29:13 +13:00
Thomas Munro	ef02fb15a3	Report time spent in posix_fallocate() as a wait event. When allocating DSM segments with posix_fallocate() on Linux (see commit `899bd785`), report this activity as a wait event exactly as we would if we were using file-backed DSM rather than shm_open()-backed DSM. Author: Thomas Munro Discussion: https://postgr.es/m/CA%2BhUKGKCSh4GARZrJrQZwqs5SYp0xDMRr9Bvb%2BHQzJKvRgL6ZA%40mail.gmail.com	2020-01-31 17:29:41 +13:00
Thomas Munro	d061ea21fc	Adjust DSM and DSA slot usage constants. When running a lot of large parallel queries concurrently, or a plan with a lot of separate Gather nodes, it is possible to run out of DSM slots. There are better solutions to these problems requiring architectural redesign work, but for now, let's adjust the constants so that it's more difficult to hit the limit. 1. Previously, a DSA area would create up to four segments at each size before doubling the size. After this commit, it will create only two at each size, so it ramps up faster and therefore needs fewer slots. 2. Previously, the total limit on DSM slots allowed for 2 per connection. Switch to 5 per connection. Also remove an obsolete nearby comment. Author: Thomas Munro Reviewed-by: Robert Haas, Andres Freund Discussion: https://postre.es/m/CA%2BhUKGL6H2BpGbiF7Lj6QiTjTGyTLW_vLR%3DSn2tEBeTcYXiMKw%40mail.gmail.com	2020-01-31 17:29:38 +13:00
Alvaro Herrera	4e89c79a52	Remove excess parens in ereport() calls Cosmetic cleanup, not worth backpatching. Discussion: https://postgr.es/m/20200129200401.GA6303@alvherre.pgsql Reviewed-by: Tom Lane, Michael Paquier	2020-01-30 13:32:04 -03:00
Thomas Munro	78aaa0e823	Don't reset latch in ConditionVariablePrepareToSleep(). It's not OK to do that without calling CHECK_FOR_INTERRUPTS(). Let the next wait loop deal with it, following the usual pattern. One consequence of this bug was that a SIGTERM delivered in a very narrow timing window could leave a parallel worker process waiting forever for a condition variable that will never be signaled, after an error was raised in other process. The code is a bit different in the stable branches due to commit `1321509f`, making problems less likely there. No back-patch for now, but we may finish up deciding to make a similar change after more discussion. Author: Thomas Munro Reviewed-by: Shawn Debnath Reported-by: Tomas Vondra Discussion: https://postgr.es/m/CA%2BhUKGJOm8zZHjVA8svoNT3tHY0XdqmaC_kHitmgXDQM49m1dA%40mail.gmail.com	2020-01-28 15:28:36 +13:00
Thomas Munro	6f38d4dac3	Remove dependency on HeapTuple from predicate locking functions. The following changes make the predicate locking functions more generic and suitable for use by future access methods: - PredicateLockTuple() is renamed to PredicateLockTID(). It takes ItemPointer and inserting transaction ID instead of HeapTuple. - CheckForSerializableConflictIn() takes blocknum instead of buffer. - CheckForSerializableConflictOut() no longer takes HeapTuple or buffer. Author: Ashwin Agrawal Reviewed-by: Andres Freund, Kuntal Ghosh, Thomas Munro Discussion: https://postgr.es/m/CALfoeiv0k3hkEb3Oqk%3DziWqtyk2Jys1UOK5hwRBNeANT_yX%2Bng%40mail.gmail.com	2020-01-28 13:13:04 +13:00
Thomas Munro	f37ff03478	Refactor confusing code in _mdfd_openseg(). As reported independently by a couple of people, _mdfd_openseg() is coded in a way that seems to imply that the segments could be opened in an order that isn't strictly sequential. Even if that were true, it's also using the wrong comparison. It's not an active bug, since the condition is always true anyway, but it's confusing, so replace it with an assertion. Author: Thomas Munro Reviewed-by: Andres Freund, Kyotaro Horiguchi, Noah Misch Discussion: https://postgr.es/m/CA%2BhUKG%2BNBw%2BuSzxF1os-SO6gUuw%3DcqO5DAybk6KnHKzgGvxhxA%40mail.gmail.com Discussion: https://postgr.es/m/20191222091930.GA1280238%40rfd.leadboat.com	2020-01-27 09:12:56 +13:00
Fujii Masao	d694e0bb79	Add pg_file_sync() to adminpack extension. This function allows us to fsync the specified file or directory. It's useful, for example, when we want to sync the file that pg_file_write() writes out or that COPY TO exports the data into, for durability. Author: Fujii Masao Reviewed-By: Julien Rouhaud, Arthur Zakirov, Michael Paquier, Atsushi Torikoshi Discussion: https://www.postgresql.org/message-id/CAHGQGwGY8uzZ_k8dHRoW1zDcy1Z7=5GQ+So4ZkVy2u=nLsk=hA@mail.gmail.com	2020-01-24 20:42:52 +09:00
Peter Eisentraut	c096a804d9	Remove STATUS_FOUND Replace the solitary use with a bool. Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://www.postgresql.org/message-id/flat/a6f91ead-0ce4-2a34-062b-7ab9813ea308%402ndquadrant.com	2020-01-11 07:48:57 +01:00
Noah Misch	38fc056074	Maintain valid md.c state when FileClose() fails. FileClose() failure ordinarily causes a PANIC. Suppose the user disables that PANIC via data_sync_retry=on. After mdclose() issued a FileClose() that failed, calls into md.c raised SIGSEGV. This fix adds repalloc() calls during mdclose(); update a comment about ignoring repalloc() cost. The rate of relation segment count change is a minor factor; more relevant to overall performance is the rate of mdclose() and subsequent re-opening of segments. Back-patch to v10, where commit `45e191e3aa` introduced the bug. Reviewed by Kyotaro Horiguchi. Discussion: https://postgr.es/m/20191222091930.GA1280238@rfd.leadboat.com	2020-01-10 18:31:22 -08:00
Robert Haas	ed10f32e37	Add pg_shmem_allocations view. This tells you about allocations that have been made from the main shared memory segment. The original patch also tried to show information about dynamic shared memory allocation as well, but I decided to leave that problem for another time. Andres Freund and Robert Haas, reviewed by Michael Paquier, Marti Raudsepp, Tom Lane, Álvaro Herrera, and Kyotaro Horiguchi. Discussion: http://postgr.es/m/20140504114417.GM12715@awork2.anarazel.de	2020-01-09 10:59:07 -05:00
Bruce Momjian	7559d8ebfa	Update copyrights for 2020 Backpatch-through: update all files in master, backpatch legal files through 9.4	2020-01-01 12:21:45 -05:00
Michael Paquier	7854e07f25	Revert "Rename files and headers related to index AM" This follows multiple complains from Peter Geoghegan, Andres Freund and Alvaro Herrera that this issue ought to be dug more before actually happening, if it happens. Discussion: https://postgr.es/m/20191226144606.GA5659@alvherre.pgsql	2019-12-27 08:09:00 +09:00
Michael Paquier	8ce3aa9b59	Rename files and headers related to index AM The following renaming is done so as source files related to index access methods are more consistent with table access methods (the original names used for index AMs ware too generic, and could be confused as including features related to table AMs): - amapi.h -> indexam.h. - amapi.c -> indexamapi.c. Here we have an equivalent with backend/access/table/tableamapi.c. - amvalidate.c -> indexamvalidate.c. - amvalidate.h -> indexamvalidate.h. - genam.c -> indexgenam.c. - genam.h -> indexgenam.h. This has been discussed during the development of v12 when table AM was worked on, but the renaming never happened. Author: Michael Paquier Reviewed-by: Fabien Coelho, Julien Rouhaud Discussion: https://postgr.es/m/20191223053434.GF34339@paquier.xyz	2019-12-25 10:23:39 +09:00
Robert Haas	16a4e4aecd	Extend the ProcSignal mechanism to support barriers. A new function EmitProcSignalBarrier() can be used to emit a global barrier which all backends that participate in the ProcSignal mechanism must absorb, and a new function WaitForProcSignalBarrier() can be used to wait until all relevant backends have in fact absorbed the barrier. This can be used to coordinate global state changes, such as turning checksums on while the system is running. There's no real client of this mechanism yet, although two are proposed, but an enum has to have at least one element, so this includes a placeholder type (PROCSIGNAL_BARRIER_PLACEHOLDER) which should be replaced by the first real client of this mechanism to get committed. Andres Freund and Robert Haas, reviewed by Daniel Gustafsson and, in earlier versions, by Magnus Hagander. Discussion: http://postgr.es/m/CA+TgmoZwDk=BguVDVa+qdA6SBKef=PKbaKDQALTC_9qoz1mJqg@mail.gmail.com	2019-12-19 14:56:20 -05:00
Thomas Munro	7c85be08a2	Fix mdsyncfiletag(), take II. The previous commit failed to consider that FileGetRawDesc() might not return a valid fd, as discovered on the build farm. Switch to using the File interface only. Back-patch to 12, like the previous commit.	2019-12-14 18:35:58 +13:00
Thomas Munro	7bb3102cea	Don't use _mdfd_getseg() in mdsyncfiletag(). _mdfd_getseg() opens all segments up to the requested one. That causes problems for mdsyncfiletag(), if mdunlinkfork() has already unlinked other segment files. Open the file we want directly by name instead, if it's not already open. The consequence of this bug was a rare panic in the checkpointer, made more likely if you saturated the sync request queue so that the SYNC_FORGET_REQUEST messages for a given relation were more likely to be absorbed in separate cycles by the checkpointer. Back-patch to 12. Defect in commit `3eb77eba`. Author: Thomas Munro Reported-by: Justin Pryzby Discussion: https://postgr.es/m/20191119115759.GI30362%40telsasoft.com	2019-12-14 16:32:03 +13:00
Michael Paquier	12198239c0	Add safeguards for pg_fsync() called with incorrectly-opened fds On some platforms, fsync() returns EBADFD when opening a file descriptor with O_RDONLY (read-only), leading ultimately now to a PANIC to prevent data corruption. This commit adds a new sanity check in pg_fsync() based on fcntl() to make sure that we don't repeat again mistakes with incorrectly-set file descriptors so as problems are detected at an early stage. Without that, such errors could only be detected after running Postgres on a specific supported platform for the culprit code path, which could take some time before being found. `b8e19b93` was a fix for such a problem, which got undetected for more than 5 years, and `a586cc4b` fixed another similar issue. Note that the new check added works as well when fsync=off is configured, so as all regression tests would detect problems as long as assertions are enabled. fcntl() being not available on Windows, the new checks do not happen there. Author: Michael Paquier Reviewed-by: Mark Dilger Discussion: https://postgr.es/m/20191009062640.GB21379@paquier.xyz	2019-11-26 13:32:52 +09:00
Amit Kapila	1379fd537f	Introduce the 'force' option for the Drop Database command. This new option terminates the other sessions connected to the target database and then drop it. To terminate other sessions, the current user must have desired permissions (same as pg_terminate_backend()). We don't allow to terminate the sessions if prepared transactions, active logical replication slots or subscriptions are present in the target database. Author: Pavel Stehule with changes by me Reviewed-by: Dilip Kumar, Vignesh C, Ibrar Ahmed, Anthony Nowocien, Ryan Lambert and Amit Kapila Discussion: https://postgr.es/m/CAP_rwwmLJJbn70vLOZFpxGw3XD7nLB_7+NKz46H5EOO2k5H7OQ@mail.gmail.com	2019-11-13 08:25:33 +05:30
Amit Kapila	14aec03502	Make the order of the header file includes consistent in backend modules. Similar to commits `7e735035f2` and `dddf4cdc33`, this commit makes the order of header file inclusion consistent for backend modules. In the passing, removed a couple of duplicate inclusions. Author: Vignesh C Reviewed-by: Kuntal Ghosh and Amit Kapila Discussion: https://postgr.es/m/CALDaNm2Sznv8RR6Ex-iJO6xAdsxgWhCoETkaYX=+9DW3q0QCfA@mail.gmail.com	2019-11-12 08:30:16 +05:30
Thomas Munro	db2687d1f3	Optimize PredicateLockTuple(). PredicateLockTuple() has a fast exit if tuple was written by the current transaction, as in that case it already has a lock. This check can be performed using TransactionIdIsCurrentTransactionId() instead of SubTransGetTopmostTransaction(), to avoid any chance of having to hit the disk. Author: Ashwin Agrawal, based on a suggestion from Andres Freund Reviewed-by: Thomas Munro Discussion: https://postgr.es/m/CALfoeiv0k3hkEb3Oqk%3DziWqtyk2Jys1UOK5hwRBNeANT_yX%2Bng%40mail.gmail.com	2019-11-11 17:06:59 +13:00
Andres Freund	01368e5d9d	Split all OBJS style lines in makefiles into one-line-per-entry style. When maintaining or merging patches, one of the most common sources for conflicts are the list of objects in makefiles. Especially when the split across lines has been changed on both sides, which is somewhat common due to attempting to stay below 80 columns, those conflicts are unnecessarily laborious to resolve. By splitting, and alphabetically sorting, OBJS style lines into one object per line, conflicts should be less frequent, and easier to resolve when they still occur. Author: Andres Freund Discussion: https://postgr.es/m/20191029200901.vww4idgcxv74cwes@alap3.anarazel.de	2019-11-05 14:41:07 -08:00
Michael Paquier	6ca86bb7e9	Fix typos in the code Author: Vignesh C Reviewed-by: Dilip Kumar, Michael Paquier Discussion: https://postgr.es/m/CALDaNm0ni+GAOe4+fbXiOxNrVudajMYmhJFtXGX-zBPoN8ixhw@mail.gmail.com	2019-10-30 10:03:00 +09:00
Peter Eisentraut	5d3587d14b	Fix most -Wundef warnings In some cases #if was used instead of #ifdef in an inconsistent style. Cleaning this up also helps when analyzing cases like `38d8dce61f` where this makes a difference. There are no behavior changes here, but the change in pg_bswap.h would prevent possible accidental misuse by third-party code. Discussion: https://www.postgresql.org/message-id/flat/3b615ca5-c595-3f1d-fdf7-a429e564f614%402ndquadrant.com	2019-10-19 18:31:38 +02:00
Noah Misch	48cc59ed24	Use standard compare_exchange loop style in ProcArrayGroupClearXid(). Besides style, this might improve performance in the contended case. Reviewed by Amit Kapila. Discussion: https://postgr.es/m/20191015035348.GA4166224@rfd.leadboat.com	2019-10-18 20:21:10 -07:00
Alvaro Herrera	0d21f919eb	Fix crash when reporting CREATE INDEX progress A race condition can make us try to dereference a NULL pointer to the PGPROC struct of a process that's already finished. That results in crashes during REINDEX CONCURRENTLY and CREATE INDEX CONCURRENTLY. This was introduced in `ab0dfc961b`, so backpatch to pg12. Reported by: Justin Pryzby Reviewed-by: Michaël Paquier Discussion: https://postgr.es/m/20191012004446.GT10470@telsasoft.com	2019-10-16 14:51:34 +02:00
Robert Haas	2e8b6bfa90	Rename some toasting functions based on whether they are heap-specific. The old names for the attribute-detoasting functions names included the word "heap," which seems outdated now that the heap is only one of potentially many table access methods. On the other hand, toast_insert_or_update and toast_delete are heap-specific, so rename them by adding "heap_" as a prefix. Not all of the work of making the TOAST system fully accessible to AMs other than the heap is done yet, but there seems to be little harm in getting this renaming out of the way now. Commit `8b94dab066` already divided up the functions among various files partially according to whether it was intended that they should be heap-specific or AM-agnostic, so this is just clarifying the division contemplated by that commit. Patch by me, reviewed and tested by Prabhat Sabu, Thomas Munro, Andres Freund, and Álvaro Herrera. Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com	2019-10-04 14:24:46 -04:00
Fujii Masao	6d05086c0a	Speedup truncations of relation forks. When a relation is truncated, shared_buffers needs to be scanned so that any buffers for the relation forks are invalidated in it. Previously, shared_buffers was scanned for each relation forks, i.e., MAIN, FSM and VM, when VACUUM truncated off any empty pages at the end of relation or TRUNCATE truncated the relation in place. Since shared_buffers needed to be scanned multiple times, it could take a long time to finish those commands especially when shared_buffers was large. This commit changes the logic so that shared_buffers is scanned only one time for those three relation forks. Author: Kirk Jamison Reviewed-by: Masahiko Sawada, Thomas Munro, Alvaro Herrera, Takayuki Tsunakawa and Fujii Masao Discussion: https://postgr.es/m/D09B13F772D2274BB348A310EE3027C64E2067@g01jpexmbkw24	2019-09-24 17:31:26 +09:00
Peter Eisentraut	887248e97e	Message style fixes	2019-09-23 13:38:39 +02:00
Fujii Masao	33a94bae60	Remove unused smgrdounlinkfork() function. smgrdounlinkfork() became dead code as the result of commit `ece01aae47`, but it was left in place just in case we want it someday. However no users have appeared in 7 years, so it's time to remove this unused function. Author: Kirk Jamison Discussion: https://www.postgresql.org/message-id/D09B13F772D2274BB348A310EE3027C64E2067@g01jpexmbkw24	2019-09-18 21:05:33 +09:00
Tom Lane	9a86f03b4e	Rearrange postmaster's startup sequence for better syslogger results. This is a second try at what commit `57431a911` tried to do, namely, launch the syslogger before we open postmaster sockets so that our messages about the sockets end up in the syslogger files. That commit fell foul of a bunch of subtle issues caused by trying to launch a postmaster child process before creating shared memory. Rather than messing with that interaction, let's postpone opening the sockets till after we launch the syslogger. This would not have been terribly safe before commit `7de19fbc0`, because we relied on socket opening to detect whether any competing postmasters were using the same port number. But now that we choose IPC keys without regard to the port number, there's no interaction to worry about. Also delay creation of the external PID file (if requested) till after the sockets are open, since external code could plausibly be relying on that ordering of events. And postpone most of the work of RemovePgTempFiles() so that that potentially-slow processing still happens after we make the external PID file. We have to be a bit careful about that last though: as noted in the discussion subsequent to bug #15804, EXEC_BACKEND builds still have to clear the parameter-file temp dir before launching the syslogger. Patch by me; thanks to Michael Paquier for review/testing. Discussion: https://postgr.es/m/15804-3721117bf40fb654@postgresql.org	2019-09-11 11:43:01 -04:00
Tom Lane	7de19fbc0b	Use data directory inode number, not port, to select SysV resource keys. This approach provides a much tighter binding between a data directory and the associated SysV shared memory block (and SysV or named-POSIX semaphores, if we're using those). Key collisions are still possible, but only between data directories stored on different filesystems, so the situation should be negligible in practice. More importantly, restarting the postmaster with a different port number no longer risks failing to identify a relevant shared memory block, even when postmaster.pid has been removed. A standalone backend is likewise much more certain to detect conflicting leftover backends. (In the longer term, we might now think about deprecating the port as a cluster-wide value, so that one postmaster could support sockets with varying port numbers. But that's for another day.) The hazards fixed here apply only on Unix systems; our Windows code paths already use identifiers derived from the data directory path name rather than the port. src/test/recovery/t/017_shm.pl, which intends to test key-collision cases, has been substantially rewritten since it can no longer use two postmasters with identical port numbers to trigger the case. Instead, use Perl's IPC::SharedMem module to create a conflicting shmem segment directly. The test script will be skipped if that module is not available. (This means that some older buildfarm members won't run it, but I don't think that that results in any meaningful coverage loss.) Patch by me; thanks to Noah Misch and Peter Eisentraut for discussion and review. Discussion: https://postgr.es/m/16908.1557521200@sss.pgh.pa.us	2019-09-05 13:31:46 -04:00
Robert Haas	8b94dab066	Split tuptoaster.c into three separate files. detoast.c/h contain functions required to detoast a datum, partially or completely, plus a few other utility functions for examining the size of toasted datums. toast_internals.c/h contain functions that are used internally to the TOAST subsystem but which (mostly) do not need to be accessed from outside. heaptoast.c/h contains code that is intrinsically specific to the heap AM, either because it operates on HeapTuples or is based on the layout of a heap page. detoast.c and toast_internals.c are placed in src/backend/access/common rather than src/backend/access/heap. At present, both files still have dependencies on the heap, but that will be improved in a future commit. Patch by me, reviewed and tested by Prabhat Sabu, Thomas Munro, Andres Freund, and Álvaro Herrera. Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com	2019-09-05 13:15:10 -04:00
Michael Paquier	c96581abe4	Fix inconsistencies and typos in the tree, take 11 This fixes various typos in docs and comments, and removes some orphaned definitions. Author: Alexander Lakhin Discussion: https://postgr.es/m/5da8e325-c665-da95-21e0-c8a99ea61fbf@gmail.com	2019-08-19 16:21:39 +09:00
Michael Paquier	66bde49d96	Fix inconsistencies and typos in the tree, take 10 This addresses some issues with unnecessary code comments, fixes various typos in docs and comments, and removes some orphaned structures and definitions. Author: Alexander Lakhin Discussion: https://postgr.es/m/9aabc775-5494-b372-8bcb-4dfc0bd37c68@gmail.com	2019-08-13 13:53:41 +09:00
Michael Paquier	8548ddc61b	Fix inconsistencies and typos in the tree, take 9 This addresses more issues with code comments, variable names and unreferenced variables. Author: Alexander Lakhin Discussion: https://postgr.es/m/7ab243e0-116d-3e44-d120-76b3df7abefd@gmail.com	2019-08-05 12:14:58 +09:00
Andres Freund	870b1d6800	Remove superfluous newlines in function prototypes. These were introduced by pgindent due to fixe to broken indentation (c.f. `8255c7a5ee`). Previously the mis-indentation of function prototypes was creatively used to reduce indentation in a few places. As that formatting only exists in master and REL_12_STABLE, it seems better to fix it in both, rather than having some odd indentation in v12 that somebody might copy for future patches or such. Author: Andres Freund Discussion: https://postgr.es/m/20190728013754.jwcbe5nfyt3533vx@alap3.anarazel.de Backpatch: 12-	2019-07-31 00:05:21 -07:00
Tom Lane	3420851a2c	Fix busted logic for parallel lock grouping in TopoSort(). A "break" statement erroneously left behind by commit `a1c1af2a1` caused TopoSort to do the wrong thing if a lock's wait list contained multiple members of the same locking group. Because parallel workers don't normally need any locks not already taken by their leader, this is very hard --- maybe impossible --- to hit in production. Still, if it did happen, the queries involved in an otherwise-resolvable deadlock would block until canceled. In addition to removing the bogus "break", add an Assert showing that the conflicting uses of the beforeConstraints[] array (for both counts and flags) don't overlap, and add some commentary explaining why not; because it's not obvious without explanation, IMHO. Original report and patch from Rui Hai Jiang; additional assert and commentary by me. Back-patch to 9.6 where the bug came in. Discussion: https://postgr.es/m/CAEri+mLd3bpHLyW+a9pSe1y=aEkeuJpwBSwvo-+m4n7-ceRmXw@mail.gmail.com	2019-07-29 18:49:04 -04:00
Michael Paquier	eb43f3d193	Fix inconsistencies and typos in the tree This is numbered take 8, and addresses again a set of issues with code comments, variable names and unreferenced variables. Author: Alexander Lakhin Discussion: https://postgr.es/m/b137b5eb-9c95-9c2f-586e-38aba7d59788@gmail.com	2019-07-29 12:28:30 +09:00
Michael Paquier	b7a82317b6	Fix typo in fd.c The frontend version of walkdir() is defined in file_utils.c, and not initdb.c. Author: Sehrope Sarkuni Discussion: https://postgr.es/m/CAH7T-artawnBt4=KODNCD8Mt2ZX4CCjJT8c=_=950xjutcRZ4Q@mail.gmail.com	2019-07-28 16:21:53 +09:00
David Rowley	1e6a759838	Use appendBinaryStringInfo in more places where the length is known When we already know the length that we're going to append, then it makes sense to use appendBinaryStringInfo instead of appendStringInfoString so that the append can be performed with a simple memcpy() using a known length rather than having to first perform a strlen() call to obtain the length. Discussion: https://postgr.es/m/CAKJS1f8+FRAM1s5+mAa3isajeEoAaicJ=4e0WzrH3tAusbbiMQ@mail.gmail.com	2019-07-23 00:14:11 +12:00
Michael Paquier	23bccc823d	Fix inconsistencies and typos in the tree This is numbered take 7, and addresses a set of issues with code comments, variable names and unreferenced variables. Author: Alexander Lakhin Discussion: https://postgr.es/m/dff75442-2468-f74f-568c-6006e141062f@gmail.com	2019-07-22 10:01:50 +09:00
Thomas Munro	dfd0121dc7	Move some md.c-specific logic from smgr.c to md.c. Potential future SMGR implementations may not want to create tablespace directories when creating an SMGR relation. Move that logic to mdcreate(). Move the initialization of md-specific data structures from smgropen() to a new callback mdopen(). Author: Thomas Munro Reviewed-by: Shawn Debnath (as part of an earlier patch set) Discussion: https://postgr.es/m/CA%2BhUKG%2BOZqOiOuDm5tC5DyQZtJ3FH4%2BFSVMqtdC4P1atpJ%2Bqhg%40mail.gmail.com	2019-07-17 15:00:22 +12:00
Michael Paquier	0896ae561b	Fix inconsistencies and typos in the tree This is numbered take 7, and addresses a set of issues around: - Fixes for typos and incorrect reference names. - Removal of unneeded comments. - Removal of unreferenced functions and structures. - Fixes regarding variable name consistency. Author: Alexander Lakhin Discussion: https://postgr.es/m/10bfd4ac-3e7c-40ab-2b2e-355ed15495e8@gmail.com	2019-07-16 13:23:53 +09:00
Tom Lane	1cff1b95ab	Represent Lists as expansible arrays, not chains of cons-cells. Originally, Postgres Lists were a more or less exact reimplementation of Lisp lists, which consist of chains of separately-allocated cons cells, each having a value and a next-cell link. We'd hacked that once before (commit `d0b4399d8`) to add a separate List header, but the data was still in cons cells. That makes some operations -- notably list_nth() -- O(N), and it's bulky because of the next-cell pointers and per-cell palloc overhead, and it's very cache-unfriendly if the cons cells end up scattered around rather than being adjacent. In this rewrite, we still have List headers, but the data is in a resizable array of values, with no next-cell links. Now we need at most two palloc's per List, and often only one, since we can allocate some values in the same palloc call as the List header. (Of course, extending an existing List may require repalloc's to enlarge the array. But this involves just O(log N) allocations not O(N).) Of course this is not without downsides. The key difficulty is that addition or deletion of a list entry may now cause other entries to move, which it did not before. For example, that breaks foreach() and sister macros, which historically used a pointer to the current cons-cell as loop state. We can repair those macros transparently by making their actual loop state be an integer list index; the exposed "ListCell " pointer is no longer state carried across loop iterations, but is just a derived value. (In practice, modern compilers can optimize things back to having just one loop state value, at least for simple cases with inline loop bodies.) In principle, this is a semantics change for cases where the loop body inserts or deletes list entries ahead of the current loop index; but I found no such cases in the Postgres code. The change is not at all transparent for code that doesn't use foreach() but chases lists "by hand" using lnext(). The largest share of such code in the backend is in loops that were maintaining "prev" and "next" variables in addition to the current-cell pointer, in order to delete list cells efficiently using list_delete_cell(). However, we no longer need a previous-cell pointer to delete a list cell efficiently. Keeping a next-cell pointer doesn't work, as explained above, but we can improve matters by changing such code to use a regular foreach() loop and then using the new macro foreach_delete_current() to delete the current cell. (This macro knows how to update the associated foreach loop's state so that no cells will be missed in the traversal.) There remains a nontrivial risk of code assuming that a ListCell pointer will remain good over an operation that could now move the list contents. To help catch such errors, list.c can be compiled with a new define symbol DEBUG_LIST_MEMORY_USAGE that forcibly moves list contents whenever that could possibly happen. This makes list operations significantly more expensive so it's not normally turned on (though it is on by default if USE_VALGRIND is on). There are two notable API differences from the previous code: * lnext() now requires the List's header pointer in addition to the current cell's address. * list_delete_cell() no longer requires a previous-cell argument. These changes are somewhat unfortunate, but on the other hand code using either function needs inspection to see if it is assuming anything it shouldn't, so it's not all bad. Programmers should be aware of these significant performance changes: * list_nth() and related functions are now O(1); so there's no major access-speed difference between a list and an array. * Inserting or deleting a list element now takes time proportional to the distance to the end of the list, due to moving the array elements. (However, it typically doesn't require palloc or pfree, so except in long lists it's probably still faster than before.) Notably, lcons() used to be about the same cost as lappend(), but that's no longer true if the list is long. Code that uses lcons() and list_delete_first() to maintain a stack might usefully be rewritten to push and pop at the end of the list rather than the beginning. * There are now list_insert_nth...() and list_delete_nth...() functions that add or remove a list cell identified by index. These have the data-movement penalty explained above, but there's no search penalty. * list_concat() and variants now copy the second list's data into storage belonging to the first list, so there is no longer any sharing of cells between the input lists. The second argument is now declared "const List *" to reflect that it isn't changed. This patch just does the minimum needed to get the new implementation in place and fix bugs exposed by the regression tests. As suggested by the foregoing, there's a fair amount of followup work remaining to do. Also, the ENABLE_LIST_COMPAT macros are finally removed in this commit. Code using those should have been gone a dozen years ago. Patch by me; thanks to David Rowley, Jesper Pedersen, and others for review. Discussion: https://postgr.es/m/11587.1550975080@sss.pgh.pa.us	2019-07-15 13:41:58 -04:00
Thomas Munro	b91dd9de5e	Forward received condition variable signals on cancel. After a process decides not to wait for a condition variable, it can still consume a signal before it reaches ConditionVariableCancelSleep(). In that case, pass the signal on to another waiter if possible, so that a signal doesn't go missing when there is another process ready to receive it. Author: Thomas Munro Reviewed-by: Shawn Debnath Discussion: https://postgr.es/m/CA%2BhUKGLQ_RW%2BXs8znDn36e-%2Bmq2--zrPemBqTQ8eKT-VO1OF4Q%40mail.gmail.com	2019-07-13 14:50:18 +12:00
Thomas Munro	1321509fa4	Introduce timed waits for condition variables. Provide ConditionVariableTimedSleep(), like ConditionVariableSleep() but with a timeout argument. Author: Shawn Debnath Reviewed-by: Kyotaro Horiguchi, Thomas Munro Discussion: https://postgr.es/m/eeb06007ccfe46e399df6af18bfcd15a@EX13D05UWC002.ant.amazon.com	2019-07-13 13:51:05 +12:00
Michael Paquier	6b8548964b	Fix inconsistencies in the code This addresses a couple of issues in the code: - Typos and inconsistencies in comments and function declarations. - Removal of unreferenced function declarations. - Removal of unnecessary compile flags. - A cleanup error in regressplans.sh. Author: Alexander Lakhin Discussion: https://postgr.es/m/0c991fdf-2670-1997-c027-772a420c4604@gmail.com	2019-07-08 13:15:09 +09:00
Peter Eisentraut	7e9a4c5c3d	Use consistent style for checking return from system calls Use if (something() != 0) error ... instead of just if (something) error ... The latter is not incorrect, but it's a bit confusing and not the common style. Discussion: https://www.postgresql.org/message-id/flat/5de61b6b-8be9-7771-0048-860328efe027%402ndquadrant.com	2019-07-07 15:28:49 +02:00
Michael Paquier	c74d49d41c	Fix many typos and inconsistencies Author: Alexander Lakhin Discussion: https://postgr.es/m/af27d1b3-a128-9d62-46e0-88f424397f44@gmail.com	2019-07-01 10:00:23 +09:00
Thomas Munro	25b93a2967	Remove obsolete comments about sempahores from proc.c. Commit `6753333f` switched from a semaphore-based wait to a latch-based wait for ProcSleep()/ProcWakeup(), but left behind some stray references to semaphores. Back-patch to 9.5. Reviewed-by: Daniel Gustafsson, Michael Paquier Discussion: https://postgr.es/m/CA+hUKGLs5H6zhmgTijZ1OaJvC1sG0=AFXc1aHuce32tKiQrdEA@mail.gmail.com	2019-06-21 10:57:07 +12:00
Michael Paquier	3412030205	Fix more typos and inconsistencies in the tree Author: Alexander Lakhin Discussion: https://postgr.es/m/0a5419ea-1452-a4e6-72ff-545b1a5a8076@gmail.com	2019-06-17 16:13:16 +09:00
Amit Kapila	92c4abc736	Fix assorted inconsistencies. There were a number of issues in the recent commits which include typos, code and comments mismatch, leftover function declarations. Fix them. Reported-by: Alexander Lakhin Author: Alexander Lakhin, Amit Kapila and Amit Langote Reviewed-by: Amit Kapila Discussion: https://postgr.es/m/ef0c0232-0c1d-3a35-63d4-0ebd06e31387@gmail.com	2019-06-08 08:16:38 +05:30
Michael Paquier	1fb6f62a84	Fix typos in various places Author: Andrea Gelmini Reviewed-by: Michael Paquier, Justin Pryzby Discussion: https://postgr.es/m/20190528181718.GA39034@glet	2019-06-03 13:44:03 +09:00
Alvaro Herrera	d890fa812d	Make one message just like all its siblings.	2019-05-28 23:44:22 -04:00
Thomas Munro	7988cb446d	Fix typos. Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/CA%2BhUKGJFWXmtYo6Frd77RR8YXCHz7hJ2mRy5aHV%3D7fJOqDnBHA%40mail.gmail.com	2019-05-24 12:00:59 +12:00
Tom Lane	8255c7a5ee	Phase 2 pgindent run for v12. Switch to 2.1 version of pg_bsd_indent. This formats multiline function declarations "correctly", that is with additional lines of parameter declarations indented to match where the first line's left parenthesis is. Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com	2019-05-22 13:04:48 -04:00
Tom Lane	be76af171c	Initial pgindent run for v12. This is still using the 2.0 version of pg_bsd_indent. I thought it would be good to commit this separately, so as to document the differences between 2.0 and 2.1 behavior. Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us	2019-05-22 12:55:34 -04:00
Tom Lane	93f03dad82	Make BufFileCreateTemp() ensure that temp tablespaces are set up. If PrepareTempTablespaces() has never been called in the current transaction, OpenTemporaryFile() will fall back to using the default tablespace, which is a bug if the user wanted temp files placed elsewhere. gistInitBuildBuffers() appears to have this disease already, and it seems like an easy trap for future coders to fall into. We discussed other ways to close this gap, but none of them are prettier or more reliable than just having BufFileCreateTemp do it. In particular, having fd.c do this creates layering issues that we could do without. Per suggestion from Melanie Plageman. Arguably this is a bug fix, but nobody seems very excited about back-patching, so change in HEAD only. Discussion: https://postgr.es/m/CAAKRu_YwzjuGAmmaw4-8XO=OVFGR1QhY_Pq-t3wjb9ribBJb_Q@mail.gmail.com	2019-05-18 13:51:16 -04:00
Andres Freund	7f44ede594	tableam: Don't assume that every AM uses md.c style storage. Previously various parts of the code routed size requests through RelationGetNumberOfBlocks[InFork]. That works if md.c is used by the AM, but not otherwise. Add a tableam callback to return the size of the table. As not every AM will use postgres' BLCKSZ, have it return bytes, and have RelationGetNumberOfBlocksInFork() round the byte size up into blocks. To allow code outside of the AM to determine the actual relation size map InvalidForkNumber the total size of a relation, as not every AM might just need the postgres defined forks. A few users of RelationGetNumberOfBlocks() ought to be converted away from that. One case, the use of it to determine whether a tid is valid, will be fixed in a follow up commit. Others will have to wait for v13. Author: Andres Freund Discussion: https://postgr.es/m/20190423225201.3bbv6tbqzkb5w7cw@alap3.anarazel.de	2019-05-17 18:56:47 -07:00
Peter Geoghegan	ae7291acbc	Standardize ItemIdData terminology. The term "item pointer" should not be used to refer to ItemIdData variables, since that is needlessly ambiguous. Only ItemPointerData/ItemPointer variables should be called item pointers. To fix, establish the convention that ItemIdData variables should always be referred to either as "item identifiers" or "line pointers". The term "item identifier" already predominates in docs and translatable messages, and so should be the preferred alternative there. Discussion: https://postgr.es/m/CAH2-Wz=c=MZQjUzde3o9+2PLAPuHTpVZPPdYxN=E4ndQ2--8ew@mail.gmail.com	2019-05-13 15:53:39 -07:00
Thomas Munro	47a338cfcd	Fix SxactGlobalXmin tracking. Commit `bb16aba50` broke the code that maintains SxactGlobalXmin. It could get stuck when a well-timed READ ONLY transaction runs. If SxactGlobalXmin stops advancing, transactions on the FinishedSerializableTransactions queue are never cleaned up, so resources are effectively leaked. Revert that hunk of the commit. Also revert another similar hunk that was probably harmless, but unnecessary and unjustified, relating to the DOOMED flag in case of RO_SAFE early release. Author: Thomas Munro Reported-by: Tom Lane Discussion: https://postgr.es/m/16170.1557251214%40sss.pgh.pa.us	2019-05-09 20:32:26 +12:00
Amit Kapila	7db0cde6b5	Revert "Avoid the creation of the free space map for small heap relations". This feature was using a process local map to track the first few blocks in the relation. The map was reset each time we get the block with enough freespace. It was discussed that it would be better to track this map on a per-relation basis in relcache and then invalidate the same whenever vacuum frees up some space in the page or when FSM is created. The new design would be better both in terms of API design and performance. List of commits reverted, in reverse chronological order: `06c8a5090e` Improve code comments in `b0eaa4c51b`. `13e8643bfc` During pg_upgrade, conditionally skip transfer of FSMs. `6f918159a9` Add more tests for FSM. `9c32e4c350` Clear the local map when not used. `29d108cdec` Update the documentation for FSM behavior.. `08ecdfe7e5` Make FSM test portable. `b0eaa4c51b` Avoid creation of the free space map for small heap relations. Discussion: https://postgr.es/m/20190416180452.3pm6uegx54iitbt5@alap3.anarazel.de	2019-05-07 09:30:24 +05:30
Fujii Masao	978b032d1f	Fix function names in comments. Commit `3eb77eba5a` renamed some functions, but forgot to update some comments referencing to those functions. This commit fixes those function names in the comments. Kyotaro Horiguchi	2019-04-25 23:43:48 +09:00
Alvaro Herrera	0a999e1290	Unify error messages ... for translatability purposes.	2019-04-24 09:26:13 -04:00
Michael Paquier	47ac2033d4	Simplify some ERROR paths clearing wait events and transient files Transient files and wait events get normally cleaned up when seeing an exception (be it in the context of a transaction for a backend or another process like the checkpointer), hence there is little point in complicating error code paths to do this work. This shaves a bit of code, and removes some extra handling with errno which needed to be preserved during the cleanup steps done. Reported-by: Masahiko Sawada Author: Michael Paquier Reviewed-by: Tom Lane, Masahiko Sawada Discussion: https://postgr.es/m/CAD21AoDhHYVq5KkXfkaHhmjA-zJYj-e4teiRAJefvXuKJz1tKQ@mail.gmail.com	2019-04-17 09:51:45 +09:00
Noah Misch	c098509927	Consistently test for in-use shared memory. postmaster startup scrutinizes any shared memory segment recorded in postmaster.pid, exiting if that segment matches the current data directory and has an attached process. When the postmaster.pid file was missing, a starting postmaster used weaker checks. Change to use the same checks in both scenarios. This increases the chance of a startup failure, in lieu of data corruption, if the DBA does "kill -9 `head -n1 postmaster.pid` && rm postmaster.pid && pg_ctl -w start". A postmaster will no longer stop if shmat() of an old segment fails with EACCES. A postmaster will no longer recycle segments pertaining to other data directories. That's good for production, but it's bad for integration tests that crash a postmaster and immediately delete its data directory. Such a test now leaks a segment indefinitely. No "make check-world" test does that. win32_shmem.c already avoided all these problems. In 9.6 and later, enhance PostgresNode to facilitate testing. Back-patch to 9.4 (all supported versions). Reviewed (in earlier versions) by Daniel Gustafsson and Kyotaro HORIGUCHI. Discussion: https://postgr.es/m/20190408064141.GA2016666@rfd.leadboat.com	2019-04-12 22:36:38 -07:00
Noah Misch	82150a05be	Revert "Consistently test for in-use shared memory." This reverts commits `2f932f71d9`, `16ee6eaf80` and `6f0e190056`. The buildfarm has revealed several bugs. Back-patch like the original commits. Discussion: https://postgr.es/m/20190404145319.GA1720877@rfd.leadboat.com	2019-04-05 00:00:52 -07:00
Thomas Munro	794c543b17	Fix bugs in mdsyncfiletag(). Commit `3eb77eba` moved a _mdfd_getseg() call from mdsync() into a new callback function mdsyncfiletag(), but didn't get the arguments quite right. Without the EXTENSION_DONT_CHECK_SIZE flag we fail to open a segment if lower-numbered segments have been truncated, and it wants a block number rather than a segment number. While comparing with the older coding, also remove an unnecessary clobbering of errno, and adjust the code in mdunlinkfiletag() to ressemble the original code from mdpostckpt() more closely instead of using an unnecessary call to smgropen(). Author: Thomas Munro Discussion: https://postgr.es/m/CA%2BhUKGL%2BYLUOA0eYiBXBfwW%2BbH5kFgh94%3DgQH0jHEJ-t5Y91wQ%40mail.gmail.com	2019-04-05 17:41:58 +13:00
Thomas Munro	3eb77eba5a	Refactor the fsync queue for wider use. Previously, md.c and checkpointer.c were tightly integrated so that fsync calls could be handed off and processed in the background. Introduce a system of callbacks and file tags, so that other modules can hand off fsync work in the same way. For now only md.c uses the new interface, but other users are being proposed. Since there may be use cases that are not strictly SMGR implementations, use a new function table for sync handlers rather than extending the traditional SMGR one. Instead of using a bitmapset of segment numbers for each RelFileNode in the checkpointer's hash table, make the segment number part of the key. This requires sending explicit "forget" requests for every segment individually when relations are dropped, but suits the file layout schemes of proposed future users better (ie sparse or high segment numbers). Author: Shawn Debnath and Thomas Munro Reviewed-by: Thomas Munro, Andres Freund Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com	2019-04-04 23:38:38 +13:00
Noah Misch	2f932f71d9	Consistently test for in-use shared memory. postmaster startup scrutinizes any shared memory segment recorded in postmaster.pid, exiting if that segment matches the current data directory and has an attached process. When the postmaster.pid file was missing, a starting postmaster used weaker checks. Change to use the same checks in both scenarios. This increases the chance of a startup failure, in lieu of data corruption, if the DBA does "kill -9 `head -n1 postmaster.pid` && rm postmaster.pid && pg_ctl -w start". A postmaster will no longer recycle segments pertaining to other data directories. That's good for production, but it's bad for integration tests that crash a postmaster and immediately delete its data directory. Such a test now leaks a segment indefinitely. No "make check-world" test does that. win32_shmem.c already avoided all these problems. In 9.6 and later, enhance PostgresNode to facilitate testing. Back-patch to 9.4 (all supported versions). Reviewed by Daniel Gustafsson and Kyotaro HORIGUCHI. Discussion: https://postgr.es/m/20130911033341.GD225735@tornado.leadboat.com	2019-04-03 17:03:46 -07:00
Alvaro Herrera	e8abf97af7	Prevent use of uninitialized variable Per buildfarm member longfin.	2019-04-02 16:03:26 -03:00
Alvaro Herrera	ab0dfc961b	Report progress of CREATE INDEX operations This uses the progress reporting infrastructure added by `c16dc1aca5`, adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY. There are two pieces to this: one is index-AM-agnostic, and the other is AM-specific. The latter is fairly elaborate for btrees, including reportage for parallel index builds and the separate phases that btree index creation uses; other index AMs, which are much simpler in their building procedures, have simplistic reporting only, but that seems sufficient, at least for non-concurrent builds. The index-AM-agnostic part is fairly complete, providing insight into the CONCURRENTLY wait phases as well as block-based progress during the index validation table scan. (The index validation index scan requires patching each AM, which has not been included here.) Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql	2019-04-02 15:18:08 -03:00
Thomas Munro	2fc7af5e96	Add basic infrastructure for 64 bit transaction IDs. Instead of inferring epoch progress from xids and checkpoints, introduce a 64 bit FullTransactionId type and use it to track xid generation. This fixes an unlikely bug where the epoch is reported incorrectly if the range of active xids wraps around more than once between checkpoints. The only user-visible effect of this commit is to correct the epoch used by txid_current() and txid_status(), also visible with pg_controldata, in those rare circumstances. It also creates some basic infrastructure so that later patches can use 64 bit transaction IDs in more places. The new type is a struct that we pass by value, as a form of strong typedef. This prevents the sort of accidental confusion between TransactionId and FullTransactionId that would be possible if we were to use a plain old uint64. Author: Thomas Munro Reported-by: Amit Kapila Reviewed-by: Andres Freund, Tom Lane, Heikki Linnakangas Discussion: https://postgr.es/m/CAA4eK1%2BMv%2Bmb0HFfWM9Srtc6MVe160WFurXV68iAFMcagRZ0dQ%40mail.gmail.com	2019-03-28 18:12:20 +13:00
Tomas Vondra	6ca015f9f0	Track unowned relations in doubly-linked list Relations dropped in a single transaction are tracked in a list of unowned relations. With large number of dropped relations this resulted in poor performance at the end of a transaction, when the relations are removed from the singly linked list one by one. Commit `b4166911` attempted to address this issue (particularly when it happens during recovery) by removing the relations in a reverse order, resulting in O(1) lookups in the list of unowned relations. This did not work reliably, though, and it was possible to trigger the O(N^2) behavior in various ways. Instead of trying to remove the relations in a specific order with respect to the linked list, which seems rather fragile, switch to a regular doubly linked. That allows us to remove relations cheaply no matter where in the list they are. As `b4166911` was a bugfix, backpatched to all supported versions, do the same thing here. Reviewed-by: Alvaro Herrera Discussion: https://www.postgresql.org/message-id/flat/80c27103-99e4-1d0c-642c-d9f3b94aaa0a%402ndquadrant.com Backpatch-through: 9.4	2019-03-27 02:39:39 +01:00
Peter Eisentraut	481018f280	Add macro to cast away volatile without allowing changes to underlying type This adds unvolatize(), which works just like unconstify() but for volatile. Discussion: https://www.postgresql.org/message-id/flat/7a5cbea7-b8df-e910-0f10-04014bcad701%402ndquadrant.com	2019-03-25 09:37:03 +01:00
Amit Kapila	06c8a5090e	Improve code comments in `b0eaa4c51b`. Author: John Naylor Discussion: https://postgr.es/m/CACPNZCswjyGJxTT=mxHgK=Z=mJ9uJ4WEx_UO=bNwpR_i0EaHHg@mail.gmail.com	2019-03-16 06:55:56 +05:30
Thomas Munro	bb16aba50c	Enable parallel query with SERIALIZABLE isolation. Previously, the SERIALIZABLE isolation level prevented parallel query from being used. Allow the two features to be used together by sharing the leader's SERIALIZABLEXACT with parallel workers. An extra per-SERIALIZABLEXACT LWLock is introduced to make it safe to share, and new logic is introduced to coordinate the early release of the SERIALIZABLEXACT required for the SXACT_FLAG_RO_SAFE optimization, as follows: The first backend to observe the SXACT_FLAG_RO_SAFE flag (set by some other transaction) will 'partially release' the SERIALIZABLEXACT, meaning that the conflicts and locks it holds are released, but the SERIALIZABLEXACT itself will remain active because other backends might still have a pointer to it. Whenever any backend notices the SXACT_FLAG_RO_SAFE flag, it clears its own MySerializableXact variable and frees local resources so that it can skip SSI checks for the rest of the transaction. In the special case of the leader process, it transfers the SERIALIZABLEXACT to a new variable SavedSerializableXact, so that it can be completely released at the end of the transaction after all workers have exited. Remove the serializable_okay flag added to CreateParallelContext() by commit `9da0cc35`, because it's now redundant. Author: Thomas Munro Reviewed-by: Haribabu Kommi, Robert Haas, Masahiko Sawada, Kevin Grittner Discussion: https://postgr.es/m/CAEepm=0gXGYhtrVDWOTHS8SQQy_=S9xo+8oCxGLWZAOoeJ=yzQ@mail.gmail.com	2019-03-15 17:47:04 +13:00
Alvaro Herrera	af38498d4c	Move hash_any prototype from access/hash.h to utils/hashutils.h ... as well as its implementation from backend/access/hash/hashfunc.c to backend/utils/hash/hashfn.c. access/hash is the place for the hash index AM, not really appropriate for generic facilities, which is what hash_any is; having things the old way meant that anything using hash_any had to include the AM's include file, pointlessly polluting its namespace with unrelated, unnecessary cruft. Also move the HTEqual strategy number to access/stratnum.h from access/hash.h. To avoid breaking third-party extension code, add an #include "utils/hashutils.h" to access/hash.h. (An easily removed line by committers who enjoy their asbestos suits to protect them from angry extension authors.) Discussion: https://postgr.es/m/201901251935.ser5e4h6djt2@alvherre.pgsql	2019-03-11 13:17:50 -03:00
Magnus Hagander	6b9e875f72	Track block level checksum failures in pg_stat_database This adds a column that counts how many checksum failures have occurred on files belonging to a specific database. Both checksum failures during normal backend processing and those created when a base backup detects a checksum failure are counted. Author: Magnus Hagander Reviewed by: Julien Rouhaud	2019-03-09 10:47:30 -08:00
Michael Paquier	82a5649fb9	Tighten use of OpenTransientFile and CloseTransientFile This fixes two sets of issues related to the use of transient files in the backend: 1) OpenTransientFile() has been used in some code paths with read-write flags while read-only is sufficient, so switch those calls to be read-only where necessary. These have been reported by Joe Conway. 2) When opening transient files, it is up to the caller to close the file descriptors opened. In error code paths, CloseTransientFile() gets called to clean up things before issuing an error. However in normal exit paths, a lot of callers of CloseTransientFile() never actually reported errors, which could leave a file descriptor open without knowing about it. This is an issue I complained about a couple of times, but never had the courage to write and submit a patch, so here we go. Note that one frontend code path is impacted by this commit so as an error is issued when fetching control file data, making backend and frontend to be treated consistently. Reported-by: Joe Conway, Michael Paquier Author: Michael Paquier Reviewed-by: Álvaro Herrera, Georgios Kokolatos, Joe Conway Discussion: https://postgr.es/m/20190301023338.GD1348@paquier.xyz Discussion: https://postgr.es/m/c49b69ec-e2f7-ff33-4f17-0eaa4f2cef27@joeconway.com	2019-03-09 08:50:55 +09:00
Thomas Munro	91595f9d49	Drop the vestigial "smgr" type. Before commit `3fa2bb31` this type appeared in the catalogs to select which of several block storage mechanisms each relation used. New features under development propose to revive the concept of different block storage managers for new kinds of data accessed via bufmgr.c, but don't need to put references to them in the catalogs. So, avoid useless maintenance work on this type by dropping it. Update some regression tests that were referencing it where any type would do. Discussion: https://postgr.es/m/CA%2BhUKG%2BDE0mmiBZMtZyvwWtgv1sZCniSVhXYsXkvJ_Wo%2B83vvw%40mail.gmail.com	2019-03-07 15:44:04 +13:00
Peter Eisentraut	278584b526	Remove volatile from latch API This was no longer useful since the latch functions use memory barriers already, which are also compiler barriers, and volatile does not help with cross-process access. Discussion: https://www.postgresql.org/message-id/flat/20190218202511.qsfpuj5sy4dbezcw%40alap3.anarazel.de#18783c27d73e9e40009c82f6e0df0974	2019-03-04 11:30:41 +01:00
Amit Kapila	9c32e4c350	Clear the local map when not used. After commit `b0eaa4c51b`, we use a local map of pages to find the required space for small relations. We do clear this map when we have found a block with enough free space, when we extend the relation, or on transaction abort so that it can be used next time. However, we miss to clear it when we didn't find any pages to try from the map which leads to an assertion failure when we later tried to use it after relation extension. In the passing, I have improved some comments in this area. Reported-by: Tom Lane based on buildfarm results Author: Amit Kapila Reviewed-by: John Naylor Tested-by: Kuntal Ghosh Discussion: https://postgr.es/m/32368.1551114120@sss.pgh.pa.us	2019-03-01 07:38:47 +05:30
Michael Paquier	effe7d9552	Make release of 2PC identifier and locks consistent in COMMIT PREPARED When preparing a transaction in two-phase commit, a dummy PGPROC entry holding the GID used for the transaction is registered, which gets released once COMMIT PREPARED is run. Prior releasing its shared memory state, all the locks taken in the prepared transaction are released using a dedicated set of callbacks (pgstat and multixact having similar callbacks), which may cause the locks to be released before the GID is set free. Hence, there is a small window where lock conflicts could happen, for example: - Transaction A releases its locks, still holding its GID in shared memory. - Transaction B held a lock which conflicted with locks of transaction A. - Transaction B continues its processing, reusing the same GID as transaction A. - Transaction B fails because of a conflicting GID, already in use by transaction A. This commit changes the shared memory state release so as post-commit callbacks and predicate lock cleanup happen consistently with the shared memory state cleanup for the dummy PGPROC entry. The race window is small and 2PC had this issue from the start, so no backpatch is done. On top if that fixes discussed involved ABI breakages, which are not welcome in stable branches. Reported-by: Oleksii Kliukin, Ildar Musin Diagnosed-by: Oleksii Kliukin, Ildar Musin Author: Michael Paquier Reviewed-by: Masahiko Sawada, Oleksii Kliukin Discussion: https://postgr.es/m/BF9B38A4-2BFF-46E8-BA87-A2D00A8047A6@hintbits.com	2019-02-25 14:19:34 +09:00
Thomas Munro	f16735d80d	Tolerate EINVAL when calling fsync() on a directory. Previously, we tolerated EBADF as a way for the operating system to indicate that it doesn't support fsync() on a directory. Tolerate EINVAL too, for older versions of Linux CIFS. Bug #15636. Back-patch all the way. Reported-by: John Klann Discussion: https://postgr.es/m/15636-d380890dafd78fc6@postgresql.org	2019-02-24 23:50:20 +13:00
Thomas Munro	483520eca4	Tolerate ENOSYS failure from sync_file_range(). One unintended consequence of commit `9ccdd7f6` was that Windows WSL users started getting a panic whenever we tried to initiate data flushing with sync_file_range(), because WSL does not implement that system call. Previously, they got a stream of periodic warnings, which was also undesirable but at least ignorable. Prevent the panic by handling ENOSYS specially and skipping the panic promotion with data_sync_elevel(). Also suppress future attempts after the first such failure so that the pre-existing problem of noisy warnings is improved. Back-patch to 9.6 (older branches were not affected in this way by `9ccdd7f6`). Author: Thomas Munro and James Sewell Tested-by: James Sewell Reported-by: Bruce Klein Discussion: https://postgr.es/m/CA+mCpegfOUph2U4ZADtQT16dfbkjjYNJL1bSTWErsazaFjQW9A@mail.gmail.com	2019-02-24 22:37:20 +13:00
Thomas Munro	0b55aaacec	Fix race in dsm_unpin_segment() when handles are reused. Teach dsm_unpin_segment() to skip segments that are in the process of being destroyed by another backend, when searching for a handle. Such a segment cannot possibly be the one we are looking for, even if its handle matches. Another slot might hold a recently created segment that has the same handle value by coincidence, and we need to keep searching for that one. The bug caused rare "cannot unpin a segment that is not pinned" errors on 10 and 11. Similar to commit `6c0fb941` for dsm_attach(). Back-patch to 10, where dsm_unpin_segment() landed. Author: Thomas Munro Reported-by: Justin Pryzby Tested-by: Justin Pryzby (along with other recent DSA/DSM fixes) Discussion: https://postgr.es/m/20190216023854.GF30291@telsasoft.com	2019-02-18 09:58:29 +13:00
Thomas Munro	6c0fb94189	Fix race in dsm_attach() when handles are reused. DSM handle values can be reused as soon as the underlying shared memory object has been destroyed. That means that for a brief moment we might have two DSM slots with the same handle. While trying to attach, if we encounter a slot with refcnt == 1, meaning that it is currently being destroyed, we should continue our search in case the same handle exists in another slot. The race manifested as a rare "dsa_area could not attach to segment" error, and was more likely in 10 and 11 due to the lack of distinct seed for random() in parallel workers. It was made very unlikely in in master by commit `197e4af9`, and older releases don't usually create new DSM segments in background workers so it was also unlikely there. This fixes the root cause of bug report #15585, in which the error could also sometimes result in a self-deadlock in the error path. It's not yet clear if further changes are needed to avoid that failure mode. Back-patch to 9.4, where dsm.c arrived. Author: Thomas Munro Reported-by: Justin Pryzby, Sergei Kornilov Discussion: https://postgr.es/m/20190207014719.GJ29720@telsasoft.com Discussion: https://postgr.es/m/15585-324ff6a93a18da46@postgresql.org	2019-02-15 14:05:09 +13:00
Michael Paquier	ea92368cd1	Move max_wal_senders out of max_connections for connection slot handling Since its introduction, max_wal_senders is counted as part of max_connections when it comes to define how many connection slots can be used for replication connections with a WAL sender context. This can lead to confusion for some users, as it could be possible to block a base backup or replication from happening because other backend sessions are already taken for other purposes by an application, and superuser-only connection slots are not a correct solution to handle that case. This commit makes max_wal_senders independent of max_connections for its handling of PGPROC entries in ProcGlobal, meaning that connection slots for WAL senders are handled using their own free queue, like autovacuum workers and bgworkers. One compatibility issue that this change creates is that a standby now requires to have a value of max_wal_senders at least equal to its primary. So, if a standby created enforces the value of max_wal_senders to be lower than that, then this could break failovers. Normally this should not be an issue though, as any settings of a standby are inherited from its primary as postgresql.conf gets normally copied as part of a base backup, so parameters would be consistent. Author: Alexander Kukushkin Reviewed-by: Kyotaro Horiguchi, Petr Jelínek, Masahiko Sawada, Oleksii Kliukin Discussion: https://postgr.es/m/CAFh8B=nBzHQeYAu0b8fjK-AF1X4+_p6GRtwG+cCgs6Vci2uRuQ@mail.gmail.com	2019-02-12 10:07:56 +09:00
Amit Kapila	b0eaa4c51b	Avoid creation of the free space map for small heap relations, take 2. Previously, all heaps had FSMs. For very small tables, this means that the FSM took up more space than the heap did. This is wasteful, so now we refrain from creating the FSM for heaps with 4 pages or fewer. If the last known target block has insufficient space, we still try to insert into some other page before giving up and extending the relation, since doing otherwise leads to table bloat. Testing showed that trying every page penalized performance slightly, so we compromise and try every other page. This way, we visit at most two pages. Any pages with wasted free space become visible at next relation extension, so we still control table bloat. As a bonus, directly attempting one or two pages can even be faster than consulting the FSM would have been. Once the FSM is created for a heap we don't remove it even if somebody deletes all the rows from the corresponding relation. We don't think it is a useful optimization as it is quite likely that relation will again grow to the same size. Author: John Naylor, Amit Kapila Reviewed-by: Amit Kapila Tested-by: Mithun C Y Discussion: https://www.postgresql.org/message-id/CAJVSVGWvB13PzpbLEecFuGFc5V2fsO736BsdTakPiPAcdMM5tQ@mail.gmail.com	2019-02-04 07:49:15 +05:30
Thomas Munro	f1bebef60e	Add shared_memory_type GUC. Since 9.3 we have used anonymous shared mmap for our main shared memory region, except in EXEC_BACKEND builds. Provide a GUC so that users can opt for System V shared memory once again, like in 9.2 and earlier. A later patch proposes to add huge/large page support for AIX, which requires System V shared memory and provided the motivation to revive this possibility. It may also be useful on some BSDs. Author: Andres Freund (revived and documented by Thomas Munro) Discussion: https://postgr.es/m/HE1PR0202MB28126DB4E0B6621CC6A1A91286D90%40HE1PR0202MB2812.eurprd02.prod.outlook.com Discussion: https://postgr.es/m/2AE143D2-87D3-4AD1-AC78-CE2258230C05%40FreeBSD.org	2019-02-03 12:47:26 +01:00
Amit Kapila	a23676503b	Revert "Avoid creation of the free space map for small heap relations." This reverts commit `ac88d2962a`.	2019-01-28 11:31:44 +05:30
Amit Kapila	ac88d2962a	Avoid creation of the free space map for small heap relations. Previously, all heaps had FSMs. For very small tables, this means that the FSM took up more space than the heap did. This is wasteful, so now we refrain from creating the FSM for heaps with 4 pages or fewer. If the last known target block has insufficient space, we still try to insert into some other page before giving up and extending the relation, since doing otherwise leads to table bloat. Testing showed that trying every page penalized performance slightly, so we compromise and try every other page. This way, we visit at most two pages. Any pages with wasted free space become visible at next relation extension, so we still control table bloat. As a bonus, directly attempting one or two pages can even be faster than consulting the FSM would have been. Once the FSM is created for a heap we don't remove it even if somebody deletes all the rows from the corresponding relation. We don't think it is a useful optimization as it is quite likely that relation will again grow to the same size. Author: John Naylor with design inputs and some code contribution by Amit Kapila Reviewed-by: Amit Kapila Tested-by: Mithun C Y Discussion: https://www.postgresql.org/message-id/CAJVSVGWvB13PzpbLEecFuGFc5V2fsO736BsdTakPiPAcdMM5tQ@mail.gmail.com	2019-01-28 08:14:06 +05:30
Amit Kapila	d66e3664b8	In bootstrap mode, don't allow the creation of files if they don't already exist. In commit's `b9d01fe288` and `3908473c80`, we have added some code where we allowed the creation of files during mdopen even if they didn't exist during the bootstrap mode. The later commit obviates the need for same. This was harmless code till now but with an upcoming feature where we don't allow to create FSM for small tables, this will needlessly create FSM files. Author: John Naylor Reviewed-by: Amit Kapila Discussion: https://www.postgresql.org/message-id/CAJVSVGWvB13PzpbLEecFuGFc5V2fsO736BsdTakPiPAcdMM5tQ@mail.gmail.com https://www.postgresql.org/message-id/CAA4eK1KsET6sotf+rzOTQfb83pzVEzVhbQi1nxGFYVstVWXUGw@mail.gmail.com	2019-01-28 07:52:51 +05:30
Andres Freund	c91560defc	Move remaining code from tqual.[ch] to heapam.h / heapam_visibility.c. Given these routines are heap specific, and that there will be more generic visibility support in via table AM, it makes sense to move the prototypes to heapam.h (routines like HeapTupleSatisfiesVacuum will not be exposed in a generic fashion, because they are too storage specific). Similarly, the code in tqual.c is specific to heap, so moving it into access/heap/ makes sense. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de	2019-01-21 17:07:10 -08:00
Andres Freund	e7cc78ad43	Remove superfluous tqual.h includes. Most of these had been obsoleted by `568d4138c` / the SnapshotNow removal. This is is preparation for moving most of tqual.[ch] into either snapmgr.h or heapam.h, which in turn is in preparation for pluggable table AMs. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de	2019-01-21 12:15:02 -08:00
Andres Freund	e0c4ec0728	Replace uses of heap_open et al with the corresponding table_* function. Author: Andres Freund Discussion: https://postgr.es/m/20190111000539.xbv7s6w7ilcvm7dp@alap3.anarazel.de	2019-01-21 10:51:37 -08:00
Andres Freund	111944c5ee	Replace heapam.h includes with {table, relation}.h where applicable. A lot of files only included heapam.h for relation_open, heap_open etc - replace the heapam.h include in those files with the narrower header. Author: Andres Freund Discussion: https://postgr.es/m/20190111000539.xbv7s6w7ilcvm7dp@alap3.anarazel.de	2019-01-21 10:51:37 -08:00
Michael Paquier	5d59a6c5ea	Fix grammar mistakes in md.c Author: Kirk Jamison Discussion: https://postgr.es/m/D09B13F772D2274BB348A310EE3027C640AC54@g01jpexmbkw24	2019-01-10 09:36:25 +09:00
Bruce Momjian	97c39498e5	Update copyright for 2019 Backpatch-through: certain files through 9.4	2019-01-02 12:44:25 -05:00
Michael Paquier	1707a0d2aa	Remove configure switch --disable-strong-random This removes a portion of infrastructure introduced by `fe0a0b5` to allow compilation of Postgres in environments where no strong random source is available, meaning that there is no linking to OpenSSL and no /dev/urandom (Windows having its own CryptoAPI). No systems shipped this century lack /dev/urandom, and the buildfarm is actually not testing this switch at all, so just remove it. This simplifies particularly some backend code which included a fallback implementation using shared memory, and removes a set of alternate regression output files from pgcrypto. Author: Michael Paquier Reviewed-by: Tom Lane Discussion: https://postgr.es/m/20181230063219.GG608@paquier.xyz	2019-01-01 20:05:51 +09:00
Peter Geoghegan	1a990b207b	Have BufFileSize() ereport() on FileSize() failure. Move the responsibility for checking for and reporting a failure from the only current BufFileSize() caller, logtape.c, to BufFileSize() itself. Code within buffile.c is generally responsible for interfacing with fd.c to report irrecoverable failures. This seems like a convention that's worth sticking to. Reorganizing things this way makes it easy to make the error message raised in the event of BufFileSize() failure descriptive of the underlying problem. We're now clear on the distinction between temporary file name and BufFile name, and can show errno, confident that its value actually relates to the error being reported. In passing, an existing, similar buffile.c ereport() + errcode_for_file_access() site is changed to follow the same conventions. The API of the function BufFileSize() is changed by this commit, despite already being in a stable release (Postgres 11). This seems acceptable, since the BufFileSize() ABI was changed by commit `aa55183042`, which hasn't made it into a point release yet. Besides, it's difficult to imagine a third party BufFileSize() caller not just raising an error anyway, since BufFile state should be considered corrupt when BufFileSize() fails. Per complaint from Tom Lane. Discussion: https://postgr.es/m/26974.1540826748@sss.pgh.pa.us Backpatch: 11-, where shared BufFiles were introduced.	2018-11-28 14:42:54 -08:00
Thomas Munro	d67dae036b	Don't count zero-filled buffers as 'read' in EXPLAIN. If you extend a relation, it should count as a block written, not read (we write a zero-filled block). If you ask for a zero-filled buffer, it shouldn't be counted as read or written. Later we might consider counting zero-filled buffers with a separate counter, if they become more common due to future work. Author: Thomas Munro Reviewed-by: Haribabu Kommi, Kyotaro Horiguchi, David Rowley Discussion: https://postgr.es/m/CAEepm%3D3JytB3KPpvSwXzkY%2Bdwc5zC8P8Lk7Nedkoci81_0E9rA%40mail.gmail.com	2018-11-28 11:58:10 +13:00
Thomas Munro	cfdf4dc4fc	Add WL_EXIT_ON_PM_DEATH pseudo-event. Users of the WaitEventSet and WaitLatch() APIs can now choose between asking for WL_POSTMASTER_DEATH and then handling it explicitly, or asking for WL_EXIT_ON_PM_DEATH to trigger immediate exit on postmaster death. This reduces code duplication, since almost all callers want the latter. Repair all code that was previously ignoring postmaster death completely, or requesting the event but ignoring it, or requesting the event but then doing an unconditional PostmasterIsAlive() call every time through its event loop (which is an expensive syscall on platforms for which we don't have USE_POSTMASTER_DEATH_SIGNAL support). Assert that callers of WaitLatchXXX() under the postmaster remember to ask for either WL_POSTMASTER_DEATH or WL_EXIT_ON_PM_DEATH, to prevent future bugs. The only process that doesn't handle postmaster death is syslogger. It waits until all backends holding the write end of the syslog pipe (including the postmaster) have closed it by exiting, to be sure to capture any parting messages. By using the WaitEventSet API directly it avoids the new assertion, and as a by-product it may be slightly more efficient on platforms that have epoll(). Author: Thomas Munro Reviewed-by: Kyotaro Horiguchi, Heikki Linnakangas, Tom Lane Discussion: https://postgr.es/m/CAEepm%3D1TCviRykkUb69ppWLr_V697rzd1j3eZsRMmbXvETfqbQ%40mail.gmail.com, https://postgr.es/m/CAEepm=2LqHzizbe7muD7-2yHUbTOoF7Q+qkSD5Q41kuhttRTwA@mail.gmail.com	2018-11-23 20:46:34 +13:00
Andres Freund	578b229718	Remove WITH OIDS support, change oid catalog column visibility. Previously tables declared WITH OIDS, including a significant fraction of the catalog tables, stored the oid column not as a normal column, but as part of the tuple header. This special column was not shown by default, which was somewhat odd, as it's often (consider e.g. pg_class.oid) one of the more important parts of a row. Neither pg_dump nor COPY included the contents of the oid column by default. The fact that the oid column was not an ordinary column necessitated a significant amount of special case code to support oid columns. That already was painful for the existing, but upcoming work aiming to make table storage pluggable, would have required expanding and duplicating that "specialness" significantly. WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0). Remove it. Removing includes: - CREATE TABLE and ALTER TABLE syntax for declaring the table to be WITH OIDS has been removed (WITH (oids[ = true]) will error out) - pg_dump does not support dumping tables declared WITH OIDS and will issue a warning when dumping one (and ignore the oid column). - restoring an pg_dump archive with pg_restore will warn when restoring a table with oid contents (and ignore the oid column) - COPY will refuse to load binary dump that includes oids. - pg_upgrade will error out when encountering tables declared WITH OIDS, they have to be altered to remove the oid column first. - Functionality to access the oid of the last inserted row (like plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed. The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false) for CREATE TABLE) is still supported. While that requires a bit of support code, it seems unnecessary to break applications / dumps that do not use oids, and are explicit about not using them. The biggest user of WITH OID columns was postgres' catalog. This commit changes all 'magic' oid columns to be columns that are normally declared and stored. To reduce unnecessary query breakage all the newly added columns are still named 'oid', even if a table's column naming scheme would indicate 'reloid' or such. This obviously requires adapting a lot code, mostly replacing oid access via HeapTupleGetOid() with access to the underlying Form_pg_->oid column. The bootstrap process now assigns oids for all oid columns in genbki.pl that do not have an explicit value (starting at the largest oid previously used), only oids assigned later by oids will be above FirstBootstrapObjectId. As the oid column now is a normal column the special bootstrap syntax for oids has been removed. Oids are not automatically assigned during insertion anymore, all backend code explicitly assigns oids with GetNewOidWithIndex(). For the rare case that insertions into the catalog via SQL are called for the new pg_nextoid() function can be used (which only works on catalog tables). The fact that oid columns on system tables are now normal columns means that they will be included in the set of columns expanded by (i.e. SELECT * FROM pg_class will now include the table's oid, previously it did not). It'd not technically be hard to hide oid column by default, but that'd mean confusing behavior would either have to be carried forward forever, or it'd cause breakage down the line. While it's not unlikely that further adjustments are needed, the scope/invasiveness of the patch makes it worthwhile to get merge this now. It's painful to maintain externally, too complicated to commit after the code code freeze, and a dependency of a number of other patches. Catversion bump, for obvious reasons. Author: Andres Freund, with contributions by John Naylor Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de	2018-11-20 16:00:17 -08:00
Tom Lane	ffa4cbd623	Handle EPIPE more sanely when we close a pipe reading from a program. Previously, any program launched by COPY TO/FROM PROGRAM inherited the server's setting of SIGPIPE handling, i.e. SIG_IGN. Hence, if we were doing COPY FROM PROGRAM and closed the pipe early, the child process would see EPIPE on its output file and typically would treat that as a fatal error, in turn causing the COPY to report error. Similarly, one could get a failure report from a query that didn't read all of the output from a contrib/file_fdw foreign table that uses file_fdw's PROGRAM option. To fix, ensure that child programs inherit SIG_DFL not SIG_IGN processing of SIGPIPE. This seems like an all-around better situation since if the called program wants some non-default treatment of SIGPIPE, it would expect to have to set that up for itself. Then in COPY, if it's COPY FROM PROGRAM and we stop reading short of detecting EOF, treat a SIGPIPE exit from the called program as a non-error condition. This still allows us to report an error for any case where the called program gets SIGPIPE on some other file descriptor. As coded, we won't report a SIGPIPE if we stop reading as a result of seeing an in-band EOF marker (e.g. COPY BINARY EOF marker). It's somewhat debatable whether we should complain if the called program continues to transmit data after an EOF marker. However, it seems like we should avoid throwing error in any questionable cases, especially in a back-patched fix, and anyway it would take additional code to make such an error get reported consistently. Back-patch to v10. We could go further back, since COPY FROM PROGRAM has been around awhile, but AFAICS the only way to reach this situation using core or contrib is via file_fdw, which has only supported PROGRAM sources since v10. The COPY statement per se has no feature whereby it'd stop reading without having hit EOF or an error already. Therefore, I don't see any upside to back-patching further that'd outweigh the risk of complaints about behavioral change. Per bug #15449 from Eric Cyr. Patch by me, review by Etsuro Fujita and Kyotaro Horiguchi Discussion: https://postgr.es/m/15449-1cf737dd5929450e@postgresql.org	2018-11-19 17:02:39 -05:00
Thomas Munro	9ccdd7f66e	PANIC on fsync() failure. On some operating systems, it doesn't make sense to retry fsync(), because dirty data cached by the kernel may have been dropped on write-back failure. In that case the only remaining copy of the data is in the WAL. A subsequent fsync() could appear to succeed, but not have flushed the data. That means that a future checkpoint could apparently complete successfully but have lost data. Therefore, violently prevent any future checkpoint attempts by panicking on the first fsync() failure. Note that we already did the same for WAL data; this change extends that behavior to non-temporary data files. Provide a GUC data_sync_retry to control this new behavior, for users of operating systems that don't eject dirty data, and possibly forensic/testing uses. If it is set to on and the write-back error was transient, a later checkpoint might genuinely succeed (on a system that does not throw away buffers on failure); if the error is permanent, later checkpoints will continue to fail. The GUC defaults to off, meaning that we panic. Back-patch to all supported releases. There is still a narrow window for error-loss on some operating systems: if the file is closed and later reopened and a write-back error occurs in the intervening time, but the inode has the bad luck to be evicted due to memory pressure before we reopen, we could miss the error. A later patch will address that with a scheme for keeping files with dirty data open at all times, but we judge that to be too complicated to back-patch. Author: Craig Ringer, with some adjustments by Thomas Munro Reported-by: Craig Ringer Reviewed-by: Robert Haas, Thomas Munro, Andres Freund Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de	2018-11-19 17:41:26 +13:00
Thomas Munro	1556cb2fc5	Don't forget about failed fsync() requests. If fsync() fails, md.c must keep the request in its bitmap, so that future attempts will try again. Back-patch to all supported releases. Author: Thomas Munro Reviewed-by: Amit Kapila Reported-by: Andrew Gierth Discussion: https://postgr.es/m/87y3i1ia4w.fsf%40news-spur.riddles.org.uk	2018-11-19 17:41:26 +13:00
Thomas Munro	aa55183042	Use 64 bit type for BufFileSize(). BufFileSize() can't use off_t, because it's only 32 bits wide on some systems. BufFile objects can have many 1GB segments so the total size can exceed 2^31. The only known client of the function is parallel CREATE INDEX, which was reported to fail when building large indexes on Windows. Though this is technically an ABI break on platforms with a 32 bit off_t and we might normally avoid back-patching it, the function is brand new and thus unlikely to have been discovered by extension authors yet, and it's fairly thoroughly broken on those platforms anyway, so just fix it. Defect in `9da0cc35`. Bug #15460. Back-patch to 11, where this function landed. Author: Thomas Munro Reported-by: Paul van der Linden, Pavel Oskin Reviewed-by: Peter Geoghegan Discussion: https://postgr.es/m/15460-b6db80de822fa0ad%40postgresql.org Discussion: https://postgr.es/m/CAHDGBJP_GsESbTt4P3FZA8kMUKuYxjg57XHF7NRBoKnR%3DCAR-g%40mail.gmail.com	2018-11-15 13:13:57 +13:00
Amit Kapila	a53bc135fb	Fix the initialization of atomic variables introduced by the group clearing mechanism. Commits `0e141c0fbb` and `baaf272ac9` introduced initialization of atomic variables in InitProcess which means that it's not safe to look at those for backends that aren't currently in use. Fix that by initializing them during postmaster startup. Reported-by: Andres Freund Author: Amit Kapila Backpatch-through: 9.6 Discussion: https://postgr.es/m/20181027104138.qmbbelopvy7cw2qv@alap3.anarazel.de	2018-11-13 10:52:40 +05:30
Andres Freund	450c7defa6	Remove volatiles from {procarray,volatile}.c and fix memory ordering issue. The use of volatiles in procarray.c largely originated from the time when postgres did not have reliable compiler and memory barriers. That's not the case anymore, so we can do better. Several of the functions in procarray.c can be bottlenecks, and removal of volatile yields mildly better code. The new state, with explicit memory barriers, is also more correct. The previous use of volatile did not actually deliver sufficient guarantees on weakly ordered machines, in particular the logic in GetNewTransactionId() does not look safe. It seems unlikely to be a problem in practice, but worth fixing. Thomas and I independently wrote a patch for this. Reported-By: Andres Freund and Thomas Munro Author: Andres Freund, with cherrypicked changes from a patch by Thomas Munro Discussion: https://postgr.es/m/20181005172955.wyjb4fzcdzqtaxjq@alap3.anarazel.de https://postgr.es/m/CAEepm=1nff0x=7i3YQO16jLA2qw-F9O39YmUew4oq-xcBQBs0g@mail.gmail.com	2018-11-10 16:11:57 -08:00
Andres Freund	5fde047f2b	Combine two flag tests in GetSnapshotData(). Previously the code checked PROC_IN_LOGICAL_DECODING and PROC_IN_VACUUM separately. As the relevant variable is marked as volatile, the compiler cannot combine the two tests. As GetSnapshotData() is pretty hot in a number of workloads, it's worthwhile to fix that. It'd also be a good idea to get rid of the volatiles altogether. But for one that's a larger patch, and for another, the code after this change still seems at least as easy to read as before. Author: Andres Freund Discussion: https://postgr.es/m/20181005172955.wyjb4fzcdzqtaxjq@alap3.anarazel.de	2018-11-09 20:43:56 -08:00
Thomas Munro	c4f0876fb8	Remove set-but-unused variable. Clean-up for commit `c24dcd0c`. Reported-by: Andrew Dunstan Discussion: https://postgr.es/m/2d52ff4a-5440-f6f1-7806-423b0e6370cb%402ndQuadrant.com	2018-11-07 12:06:43 +13:00
Thomas Munro	c24dcd0cfd	Use pg_pread() and pg_pwrite() for data files and WAL. Cut down on system calls by doing random I/O using offset-based OS routines where available. Remove the code for tracking the 'virtual' seek position. The only reason left to call FileSeek() was to get the file's size, so provide a new function FileSize() instead. Author: Oskari Saarenmaa, Thomas Munro Reviewed-by: Thomas Munro, Jesper Pedersen, Tom Lane, Alvaro Herrera Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com Discussion: https://postgr.es/m/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com Discussion: https://postgr.es/m/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi	2018-11-07 09:51:50 +13:00
Thomas Munro	9e12fb02b7	Remove some remaining traces of dsm_resize(). A couple of obsolete comments and unreachable blocks remained after commit `3c60d0fa`. Discussion: https://postgr.es/m/CAA4eK1%2B%3DyAFUvpFoHXFi_gm8YqmXN-TtkFH%2BVYjvDLS6-SFq-Q%40mail.gmail.com	2018-11-06 21:40:08 +13:00
Thomas Munro	3c60d0fa23	Remove dsm_resize() and dsm_remap(). These interfaces were never used in core, didn't handle failure of posix_fallocate() correctly and weren't supported on all platforms. We agreed to remove them in 12. Author: Thomas Munro Reported-by: Andres Freund Discussion: https://postgr.es/m/CAA4eK1%2B%3DyAFUvpFoHXFi_gm8YqmXN-TtkFH%2BVYjvDLS6-SFq-Q%40mail.gmail.com	2018-11-06 16:11:12 +13:00
Magnus Hagander	fbec7459aa	Fix spelling errors and typos in comments Author: Daniel Gustafsson <daniel@yesql.se>	2018-11-02 13:56:52 +01:00
Andres Freund	62649bad83	Correct constness of a few variables. This allows the compiler / linker to mark affected pages as read-only. There's other cases, but they're a bit more invasive, and should go through some review. These are easy. They were found with objdump -j .data -t src/backend/postgres\|awk '{print $4, $5, $6}'\|sort -r\|less Discussion: https://postgr.es/m/20181015200754.7y7zfuzsoux2c4ya@alap3.anarazel.de	2018-10-15 21:01:14 -07:00
Michael Paquier	1df21ddb19	Avoid duplicate XIDs at recovery when building initial snapshot On a primary, sets of XLOG_RUNNING_XACTS records are generated on a periodic basis to allow recovery to build the initial state of transactions for a hot standby. The set of transaction IDs is created by scanning all the entries in ProcArray. However it happens that its logic never counted on the fact that two-phase transactions finishing to prepare can put ProcArray in a state where there are two entries with the same transaction ID, one for the initial transaction which gets cleared when prepare finishes, and a second, dummy, entry to track that the transaction is still running after prepare finishes. This way ensures a continuous presence of the transaction so as callers of for example TransactionIdIsInProgress() are always able to see it as alive. So, if a XLOG_RUNNING_XACTS takes a standby snapshot while a two-phase transaction finishes to prepare, the record can finish with duplicated XIDs, which is a state expected by design. If this record gets applied on a standby to initial its recovery state, then it would simply fail, so the odds of facing this failure are very low in practice. It would be tempting to change the generation of XLOG_RUNNING_XACTS so as duplicates are removed on the source, but this requires to hold on ProcArrayLock for longer and this would impact all workloads, particularly those using heavily two-phase transactions. XLOG_RUNNING_XACTS is also actually used only to initialize the standby state at recovery, so instead the solution is taken to discard duplicates when applying the initial snapshot. Diagnosed-by: Konstantin Knizhnik Author: Michael Paquier Discussion: https://postgr.es/m/0c96b653-4696-d4b4-6b5d-78143175d113@postgrespro.ru Backpatch-through: 9.3	2018-10-14 22:23:21 +09:00
Michael Paquier	09921f397b	Refactor user-facing SQL functions signalling backends This moves the system administration functions for signalling backends from backend/utils/adt/misc.c into a separate file dedicated to backend signalling. No new functionality is introduced in this commit. Author: Daniel Gustafsson Reviewed-by: Michael Paquier, Álvaro Herrera Discussion: https://postgr.es/m/C2C7C3EC-CC5F-44B6-9C78-637C88BD7D14@yesql.se	2018-10-04 18:27:25 +09:00
Tom Lane	b04aeb0a05	Add assertions that we hold some relevant lock during relation open. Opening a relation with no lock at all is unsafe; there's no guarantee that we'll see a consistent state of the relevant catalog entries. While use of MVCC scans to read the catalogs partially addresses that complaint, it's still possible to switch to a new catalog snapshot partway through loading the relcache entry. Moreover, whether or not you trust the reasoning behind sometimes using less than AccessExclusiveLock for ALTER TABLE, that reasoning is certainly not valid if concurrent users of the table don't hold a lock corresponding to the operation they want to perform. Hence, add some assertion-build-only checks that require any caller of relation_open(x, NoLock) to hold at least AccessShareLock. This isn't a full solution, since we can't verify that the lock level is semantically appropriate for the action --- but it's definitely of some use, because it's already caught two bugs. We can also assert that callers of addRangeTableEntryForRelation() hold at least the lock level specified for the new RTE. Amit Langote and Tom Lane Discussion: https://postgr.es/m/16565.1538327894@sss.pgh.pa.us	2018-10-01 12:43:21 -04:00
Alexander Korotkov	2f39106a20	Replace CAS loop with single TAS in ProcArrayGroupClearXid() Single pg_atomic_exchange_u32() is expected to be faster than loop of pg_atomic_compare_exchange_u32(). Also, it would be consistent with clog group update code. Discussion: https://postgr.es/m/CAPpHfdtxLsC-bqfxFcHswZ91OxXcZVNDBBVfg9tAWU0jvn1tQA%40mail.gmail.com Reviewed-by: Amit Kapila	2018-09-22 16:22:30 +03:00
Tom Lane	8f0de712c3	Don't ignore locktable-full failures in StandbyAcquireAccessExclusiveLock. Commit `37c54863c` removed the code in StandbyAcquireAccessExclusiveLock that checked the return value of LockAcquireExtended. That created a bug, because it's still passing reportMemoryError = false to LockAcquireExtended, meaning that LOCKACQUIRE_NOT_AVAIL will be returned if we're out of shared memory for the lock table. In such a situation, the startup process would believe it had acquired an exclusive lock even though it hadn't, with potentially dire consequences. To fix, just drop the use of reportMemoryError = false, which allows us to simplify the call into a plain LockAcquire(). It's unclear that the locktable-full situation arises often enough that it's worth having a better recovery method than crash-and-restart. (I strongly suspect that the only reason the code path existed at all was that it was relatively simple to do in the pre-37c54863c implementation. But now it's not.) LockAcquireExtended's reportMemoryError parameter is now dead code and could be removed. I refrained from doing so, however, because there was some interest in resurrecting the behavior if we do get reports of locktable-full failures in the field. Also, it seems unwise to remove the parameter concurrently with shipping commit `f868a8143`, which added a parameter; if there are any third-party callers of LockAcquireExtended, we want them to get a wrong-number-of-parameters compile error rather than a possibly-silent misinterpretation of its last parameter. Back-patch to 9.6 where the bug was introduced. Discussion: https://postgr.es/m/6202.1536359835@sss.pgh.pa.us	2018-09-19 12:43:51 -04:00
Thomas Munro	422952ee78	Allow DSM allocation to be interrupted. Chris Travers reported that the startup process can repeatedly try to cancel a backend that is in a posix_fallocate()/EINTR loop and cause it to loop forever. Teach the retry loop to give up if an interrupt is pending. Don't actually check for interrupts in that loop though, because a non-local exit would skip some clean-up code in the caller. Back-patch to 9.4 where DSM was added (and posix_fallocate() was later back-patched). Author: Chris Travers Reviewed-by: Ildar Musin, Murat Kabilov, Oleksii Kliukin Tested-by: Oleksii Kliukin Discussion: https://postgr.es/m/CAN-RpxB-oeZve_J3SM_6%3DHXPmvEG%3DHX%2B9V9pi8g2YR7YW0rBBg%40mail.gmail.com	2018-09-18 22:56:36 +12:00
Michael Paquier	9226a3b89b	Remove duplicated words split across lines in comments This has been detected using some interesting tricks with sed, and the method used is mentioned in details in the discussion below. Author: Justin Pryzby Discussion: https://postgr.es/m/20180908013109.GB15350@telsasoft.com	2018-09-08 12:24:19 -07:00
Tom Lane	f868a8143a	Fix longstanding recursion hazard in sinval message processing. LockRelationOid and sibling routines supposed that, if our session already holds the lock they were asked to acquire, they could skip calling AcceptInvalidationMessages on the grounds that we must have already read any remote sinval messages issued against the relation being locked. This is normally true, but there's a critical special case where it's not: processing inside AcceptInvalidationMessages might attempt to access system relations, resulting in a recursive call to acquire a relation lock. Hence, if the outer call had acquired that same system catalog lock, we'd fall through, despite the possibility that there's an as-yet-unread sinval message for that system catalog. This could, for example, result in failure to access a system catalog or index that had just been processed by VACUUM FULL. This is the explanation for buildfarm failures we've been seeing intermittently for the past three months. The bug is far older than that, but commits `a54e1f158` et al added a new recursion case within AcceptInvalidationMessages that is apparently easier to hit than any previous case. To fix this, we must not skip calling AcceptInvalidationMessages until we have finished a call to it since acquiring a relation lock, not merely acquired the lock. (There's already adequate logic inside AcceptInvalidationMessages to deal with being called recursively.) Fortunately, we can implement that at trivial cost, by adding a flag to LOCALLOCK hashtable entries that tracks whether we know we have completed such a call. There is an API hazard added by this patch for external callers of LockAcquire: if anything is testing for LOCKACQUIRE_ALREADY_HELD, it might be fooled by the new return code LOCKACQUIRE_ALREADY_CLEAR into thinking the lock wasn't already held. This should be a fail-soft condition, though, unless something very bizarre is being done in response to the test. Also, I added an additional output argument to LockAcquireExtended, assuming that that probably isn't called by any outside code given the very limited usefulness of its additional functionality. Back-patch to all supported branches. Discussion: https://postgr.es/m/12259.1532117714@sss.pgh.pa.us	2018-09-07 18:04:54 -04:00
Tom Lane	44cac93464	Avoid using potentially-under-aligned page buffers. There's a project policy against using plain "char buf[BLCKSZ]" local or static variables as page buffers; preferred style is to palloc or malloc each buffer to ensure it is MAXALIGN'd. However, that policy's been ignored in an increasing number of places. We've apparently got away with it so far, probably because (a) relatively few people use platforms on which misalignment causes core dumps and/or (b) the variables chance to be sufficiently aligned anyway. But this is not something to rely on. Moreover, even if we don't get a core dump, we might be paying a lot of cycles for misaligned accesses. To fix, invent new union types PGAlignedBlock and PGAlignedXLogBlock that the compiler must allocate with sufficient alignment, and use those in place of plain char arrays. I used these types even for variables where there's no risk of a misaligned access, since ensuring proper alignment should make kernel data transfers faster. I also changed some places where we had been palloc'ing short-lived buffers, for coding style uniformity and to save palloc/pfree overhead. Since this seems to be a live portability hazard (despite the lack of field reports), back-patch to all supported versions. Patch by me; thanks to Michael Paquier for review. Discussion: https://postgr.es/m/1535618100.1286.3.camel@credativ.de	2018-09-01 15:27:17 -04:00
Andres Freund	143290efd0	Introduce minimal C99 usage to verify compiler support. This just converts a few for loops in postgres.c to declare variables in the loop initializer, and uses designated initializers in smgr.c's definition of smgr callbacks. Author: Andres Freund Discussion: https://postgr.es/m/97d4b165-192d-3605-749c-f614a0c4e783@2ndquadrant.com	2018-08-23 18:36:07 -07:00
Michael Paquier	246a6c8f7b	Make autovacuum more aggressive to remove orphaned temp tables Commit `dafa084`, added in 10, made the removal of temporary orphaned tables more aggressive. This commit makes an extra step into the aggressiveness by adding a flag in each backend's MyProc which tracks down any temporary namespace currently in use. The flag is set when the namespace gets created and can be reset if the temporary namespace has been created in a transaction or sub-transaction which is aborted. The flag value assignment is assumed to be atomic, so this can be done in a lock-less fashion like other flags already present in PGPROC like databaseId or backendId, still the fact that the temporary namespace and table created are still locked until the transaction creating those commits acts as a barrier for other backends. This new flag gets used by autovacuum to discard more aggressively orphaned tables by additionally checking for the database a backend is connected to as well as its temporary namespace in-use, removing orphaned temporary relations even if a backend reuses the same slot as one which created temporary relations in a past session. The base idea of this patch comes from Robert Haas, has been written in its first version by Tsunakawa Takayuki, then heavily reviewed by me. Author: Tsunakawa Takayuki Reviewed-by: Michael Paquier, Kyotaro Horiguchi, Andres Freund Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8A4DC6@G01JPEXMBYT05 Backpatch: 11-, as PGPROC gains a new flag and we don't want silent ABI breakages on already released versions.	2018-08-13 11:49:04 +02:00
Tom Lane	130beba36d	Fix inadequate buffer locking in FSM and VM page re-initialization. When reading an existing FSM or VM page that was found to be corrupt by the buffer manager, the code applied PageInit() to reinitialize the page, but did so without any locking. There is thus a hazard that two backends might concurrently do PageInit, which in itself would still be OK, but the slower one might then zero over subsequent data changes applied by the faster one. Even that is unlikely to be fatal; but it's not desirable, so add locking to prevent it. This does not add any locking overhead in the normal code path where the page is OK. It's not immediately obvious that that's safe, but I believe it is, for reasons explained in the added comments. Problem noted by R P Asim. It's been like this for a long time, so back-patch to all supported branches. Discussion: https://postgr.es/m/CANXE4Te4G0TGq6cr0-TvwP0H4BNiK_-hB5gHe8mF+nz0mcYfMQ@mail.gmail.com	2018-07-13 11:53:10 -04:00
Peter Eisentraut	5e6e2c8773	Reset shmem_exit_inprogress after shmem_exit() In `ad9a274778`, shmem_exit_inprogress was introduced. But we need to reset it after shmem_exit(), because unlike the similar proc_exit(), shmem_exit() can also be called for cleanup when the process will not exit. Reported-by: Andrew Gierth <andrew@tao11.riddles.org.uk>	2018-07-12 20:22:17 +02:00
Alexander Korotkov	a01d0fa1d8	Fix wrong file path in header comment Header comment of shm_mq.c was mistakenly specifying path to shm_mq.h. It was introduced in `ec9037df`. So, theoretically it could be backpatched to 9.4, but it doesn't seem to worth it.	2018-07-11 13:16:46 +03:00
Thomas Munro	f98b8476cd	Use signals for postmaster death on FreeBSD. Use FreeBSD 11.2's new support for detecting parent process death to make PostmasterIsAlive() very cheap, as was done for Linux in an earlier commit. Author: Thomas Munro Discussion: https://postgr.es/m/7261eb39-0369-f2f4-1bb5-62f3b6083b5e@iki.fi	2018-07-11 13:14:07 +12:00
Thomas Munro	9f09529952	Use signals for postmaster death on Linux. Linux provides a way to ask for a signal when your parent process dies. Use that to make PostmasterIsAlive() very cheap. Based on a suggestion from Andres Freund. Author: Thomas Munro, Heikki Linnakangas Reviewed-By: Michael Paquier Discussion: https://postgr.es/m/7261eb39-0369-f2f4-1bb5-62f3b6083b5e%40iki.fi Discussion: https://postgr.es/m/20180411002643.6buofht4ranhei7k%40alap3.anarazel.de	2018-07-11 12:47:06 +12:00
Peter Eisentraut	bcbd940806	Remove dynamic_shared_memory_type=none PostgreSQL nowadays offers some kind of dynamic shared memory feature on all supported platforms. Having the choice of "none" prevents us from relying on DSM in core features. So this patch removes the choice of "none". Author: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>	2018-07-10 18:35:24 +02:00
Fujii Masao	b41669118c	Improve the performance of relation deletes during recovery. When multiple relations are deleted at the same transaction, the files of those relations are deleted by one call to smgrdounlinkall(), which leads to scan whole shared_buffers only one time. OTOH, previously, during recovery, smgrdounlink() (not smgrdounlinkall()) was called for each file to delete, which led to scan shared_buffers multiple times. Obviously this could cause to increase the WAL replay time very much especially when shared_buffers was huge. To alleviate this situation, this commit changes the recovery so that it also calls smgrdounlinkall() only one time to delete multiple relation files. This is just fix for oversight of commit `279628a0a7`, not new feature. So, per discussion on pgsql-hackers, we concluded to backpatch this to all supported versions. Author: Fujii Masao Reviewed-by: Michael Paquier, Andres Freund, Thomas Munro, Kyotaro Horiguchi, Takayuki Tsunakawa Discussion: https://postgr.es/m/CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com	2018-07-05 02:23:46 +09:00
Andrew Dunstan	1e9c858090	pgindent run prior to branching	2018-06-30 12:25:49 -04:00
Thomas Munro	a40cff8956	Move RecoveryLockList into a hash table. Standbys frequently need to release all locks held by a given xid. Instead of searching one big list linearly, let's create one list per xid and put them in a hash table, so we can find what we need in O(1) time. Earlier analysis and a prototype were done by David Rowley, though this isn't his patch. Back-patch all the way. Author: Thomas Munro Diagnosed-by: David Rowley, Andres Freund Reviewed-by: Andres Freund, Tom Lane, Robert Haas Discussion: https://postgr.es/m/CAEepm%3D1mL0KiQ2KJ4yuPpLGX94a4Ns_W6TL4EGRouxWibu56pA%40mail.gmail.com Discussion: https://postgr.es/m/CAKJS1f9vJ841HY%3DwonnLVbfkTWGYWdPN72VMxnArcGCjF3SywA%40mail.gmail.com	2018-06-26 18:45:45 +12:00
Simon Riggs	15378c1a15	Remove AELs from subxids correctly on standby Issues relate only to subtransactions that hold AccessExclusiveLocks when replayed on standby. Prior to PG10, aborting subtransactions that held an AccessExclusiveLock failed to release the lock until top level commit or abort. `49bff5300d` fixed that. However, `49bff5300d` also introduced a similar bug where subtransaction commit would fail to release an AccessExclusiveLock, leaving the lock to be removed sometimes early and sometimes late. This commit fixes that bug also. Backpatch to PG10 needed. Tested by observation. Note need for multi-node isolationtester to improve test coverage for this and other HS cases. Reported-by: Simon Riggs Author: Simon Riggs	2018-06-16 14:03:29 +01:00
Tatsuo Ishii	1cfdb1cb0e	Fix memory leak in BufFileCreateShared(). Also this commit unifies some duplicated code in makeBufFile() and BufFileOpenShared() into new function makeBufFileCommon(). Author: Antonin Houska Reviewed-By: Thomas Munro, Tatsuo Ishii Discussion: https://postgr.es/m/16139.1529049566%40localhost	2018-06-16 14:21:08 +09:00
Tatsuo Ishii	969274d813	Fix memory leak. Memory is allocated twice for "file" and "files" variables in BufFileOpenShared(). Author: Antonin Houska Discussion: https://postgr.es/m/11329.1529045692%40localhost	2018-06-15 16:32:59 +09:00
Simon Riggs	dc878ffedf	Remove spurious code comments in standby related code GetRunningTransactionData() suggested that subxids were not worth optimizing away if overflowed, yet they have already been removed for that case. Changes to LogAccessExclusiveLock() API forgot to remove the prior comment when it was copied to LockAcquire().	2018-06-14 12:17:51 +01:00
Simon Riggs	802bde87ba	Remove cut-off bug from RunningTransactionData `32ac7a118f` tried to fix a Hot Standby issue reported by Greg Stark, but in doing so caused a different bug to appear, noted by Andres Freund. Revoke the core changes from `32ac7a118f`, leaving in its place a minor change in code ordering and comments to explain for the future.	2018-06-14 12:02:41 +01:00
Simon Riggs	32ac7a118f	Exclude VACUUMs from RunningXactData GetRunningTransactionData() should ignore VACUUM procs because in some cases they are assigned xids. This could lead to holding back xmin via the route of passing the xid to standby and then having that hold back xmin on master via feedback. Backpatch to 9.1 needed, but will only do so on supported versions. Backpatch once proven on the buildfarm. Reported-by: Greg Stark Author: Simon Riggs Reviewed-by: Amit Kapila Discussion: https://postgr.es/m/CANP8+jJBYt=4PpTfiPb0UrH1_iPhzsxKH5Op_Wec634F0ohnAw@mail.gmail.com	2018-06-07 20:38:12 +01:00
Tom Lane	1d96c1b91a	Fix incorrect ordering of operations in pg_resetwal and pg_rewind. Commit `c37b3d08c` dropped its added GetDataDirectoryCreatePerm call into the wrong place in pg_resetwal.c, namely after the chdir to DataDir. That broke invocations using a relative path, as reported by Tushar Ahuja. We could have left it where it was and changed the argument to be ".", but that'd result in a rather confusing error message in event of a failure, so re-ordering seems like a better solution. Similarly reorder operations in pg_rewind.c. The issue there is that it doesn't seem like a good idea to do any actual operations before the not-root check (on Unix) or the restricted token acquisition (on Windows). I don't know that this is an actual bug, but I'm definitely not convinced that it isn't, either. Assorted other code review for `c37b3d08c` and `da9b580d8`: fix some misspelled or otherwise badly worded comments, put the #include for <sys/stat.h> where it actually belongs, etc. Discussion: https://postgr.es/m/aeb9c3a7-3c3f-a57f-1a18-c8d4fcdc2a1f@enterprisedb.com	2018-05-23 10:59:55 -04:00
Teodor Sigaev	0bef1c0678	Re-think predicate locking on GIN indexes. The principle behind the locking was not very well thought-out, and not documented. Add a section in the README to explain how it's supposed to work, and change the code so that it actually works that way. This fixes two bugs: 1. If fast update was turned on concurrently, subsequent inserts to the pending list would not conflict with predicate locks that were acquired earlier, on entry pages. The included 'predicate-gin-fastupdate' test demonstrates that. To fix, make all scans acquire a predicate lock on the metapage. That lock represents a scan of the pending list, whether or not there is a pending list at the moment. Forget about the optimization to skip locking/checking for locks, when fastupdate=off. 2. If a scan finds no match, it still needs to lock the entry page. The point of predicate locks is to lock the gabs between values, whether or not there is a match. The included 'predicate-gin-nomatch' test tests that case. In addition to those two bug fixes, this removes some unnecessary locking, following the principle laid out in the README. Because all items in a posting tree have the same key value, a lock on the posting tree root is enough to cover all the items. (With a very large posting tree, it would possibly be better to lock the posting tree leaf pages instead, so that a "skip scan" with a query like "A & B", you could avoid unnecessary conflict if a new tuple is inserted with A but !B. But let's keep this simple.) Also, some spelling fixes. Author: Heikki Linnakangas with some editorization by me Review: Andrey Borodin, Alexander Korotkov Discussion: https://www.postgresql.org/message-id/0b3ad2c2-2692-62a9-3a04-5724f2af9114@iki.fi	2018-05-04 11:27:50 +03:00
Tom Lane	fbb2e9a030	Fix assorted compiler warnings seen in the buildfarm. Failure to use DatumGetFoo/FooGetDatum macros correctly, or at all, causes some warnings about sign conversion. This is just cosmetic at the moment but in principle it's a type violation, so clean up the instances I could find. autoprewarm.c and sharedfileset.c contained code that unportably assumed that pid_t is the same size as int. We've variously dealt with this by casting pid_t to int or to unsigned long for printing purposes; I went with the latter. Fix uninitialized-variable warning in RestoreGUCState. This is a live bug in some sense, but of no great significance given that nobody is very likely to care what "line number" is associated with a GUC that hasn't got a source file recorded.	2018-05-02 15:52:54 -04:00
Heikki Linnakangas	445e31bdc7	Fix some sloppiness in the new BufFileSize() and BufFileAppend() functions. There were three related issues: * BufFileAppend() incorrectly reset the seek position on the 'source' file. As a result, if you had called BufFileRead() on the file before calling BufFileAppend(), it got confused, and subsequent calls would read/write at wrong position. * BufFileSize() did not work with files opened with BufFileOpenShared(). * FileGetSize() only worked on temporary files. To fix, change the way BufFileSize() works so that it works on shared files. Remove FileGetSize() altogether, as it's no longer needed. Remove buffilesize from TapeShare struct, as the leader process can simply call BufFileSize() to get the tape's size, there's no need to pass it through shared memory anymore. Discussion: https://www.postgresql.org/message-id/CAH2-WznEDYe_NZXxmnOfsoV54oFkTdMy7YLE2NPBLuttO96vTQ@mail.gmail.com	2018-05-02 17:23:13 +03:00
Tom Lane	9cb7db3f0c	In AtEOXact_Files, complain if any files remain unclosed at commit. This change makes this module act more like most of our other low-level resource management modules. It's a caller error if something is not explicitly closed by the end of a successful transaction, so issue a WARNING about it. This would not actually have caught the file leak bug fixed in commit `231bcd080`, because that was in a transaction-abort path; but it still seems like a good, and pretty cheap, cross-check. Discussion: https://postgr.es/m/152056616579.4966.583293218357089052@wrigleys.postgresql.org	2018-04-28 17:45:02 -04:00
Peter Eisentraut	d4f16d5071	perltidy: Add option --nooutdent-long-quotes	2018-04-27 11:37:43 -04:00
Tom Lane	bdf46af748	Post-feature-freeze pgindent run. Discussion: https://postgr.es/m/15719.1523984266@sss.pgh.pa.us	2018-04-26 14:47:16 -04:00
Tom Lane	231bcd0803	Fix incorrect close() call in dsm_impl_mmap(). One improbable error-exit path in this function used close() where it should have used CloseTransientFile(). This is unlikely to be hit in the field, and I think the consequences wouldn't be awful (just an elog(LOG) bleat later). But a bug is a bug, so back-patch to 9.4 where this code came in. Pan Bian Discussion: https://postgr.es/m/152056616579.4966.583293218357089052@wrigleys.postgresql.org	2018-04-10 18:34:54 -04:00
Tom Lane	af1a949109	Further cleanup of client dependencies on src/include/catalog headers. In commit `9c0a0de4c`, I'd failed to notice that catalog/catalog.h should also be considered a frontend-unsafe header, because it includes (and needs) the full form of pg_class.h, not to mention relcache.h. However, various frontend code was depending on it to get TABLESPACE_VERSION_DIRECTORY, so refactoring of some sort is called for. The cleanest answer seems to be to move TABLESPACE_VERSION_DIRECTORY, as well as the OIDCHARS symbol, to common/relpath.h. Do that, and mop up inclusions as necessary. (I found that quite a few current users of catalog/catalog.h don't seem to need it at all anymore, apparently as a result of the refactorings that created common/relpath.[hc]. And initdb.c needed it only as a route to pg_class_d.h.) Discussion: https://postgr.es/m/6629.1523294509@sss.pgh.pa.us	2018-04-09 14:39:58 -04:00
Magnus Hagander	a228cc13ae	Revert "Allow on-line enabling and disabling of data checksums" This reverts the backend sides of commit `1fde38beaa`. I have, at least for now, left the pg_verify_checksums tool in place, as this tool can be very valuable without the rest of the patch as well, and since it's a read-only tool that only runs when the cluster is down it should be a lot safer.	2018-04-09 19:03:42 +02:00
Stephen Frost	da9b580d89	Refactor dir/file permissions Consolidate directory and file create permissions for tools which work with the PG data directory by adding a new module (common/file_perm.c) that contains variables (pg_file_create_mode, pg_dir_create_mode) and constants to initialize them (0600 for files and 0700 for directories). Convert mkdir() calls in the backend to MakePGDirectory() if the original call used default permissions (always the case for regular PG directories). Add tests to make sure permissions in PGDATA are set correctly by the tools which modify the PG data directory. Authors: David Steele <david@pgmasters.net>, Adam Brightwell <adam.brightwell@crunchydata.com> Reviewed-By: Michael Paquier, with discussion amongst many others. Discussion: https://postgr.es/m/ad346fe6-b23e-59f1-ecb7-0e08390ad629%40pgmasters.net	2018-04-07 17:45:39 -04:00
Teodor Sigaev	b508a56f2f	Predicate locking in hash indexes. Hash index searches acquire predicate locks on the primary page of a bucket. It acquires a lock on both the old and new buckets for scans that happen concurrently with page splits. During a bucket split, a predicate lock is copied from the primary page of an old bucket to the primary page of a new bucket. Author: Shubham Barai, Amit Kapila Reviewed by: Amit Kapila, Alexander Korotkov, Thomas Munro Discussion: https://www.postgresql.org/message-id/flat/CALxAEPvNsM2GTiXdRgaaZ1Pjd1bs+sxfFsf7Ytr+iq+5JJoYXA@mail.gmail.com	2018-04-07 16:59:14 +03:00
Magnus Hagander	1fde38beaa	Allow on-line enabling and disabling of data checksums This makes it possible to turn checksums on in a live cluster, without the previous need for dump/reload or logical replication (and to turn it off). Enabling checkusm starts a background process in the form of a launcher/worker combination that goes through the entire database and recalculates checksums on each and every page. Only when all pages have been checksummed are they fully enabled in the cluster. Any failure of the process will revert to checksums off and the process has to be started. This adds a new WAL record that indicates the state of checksums, so the process works across replicated clusters. Authors: Magnus Hagander and Daniel Gustafsson Review: Tomas Vondra, Michael Banck, Heikki Linnakangas, Andrey Borodin	2018-04-05 22:04:48 +02:00
Tom Lane	0b11a674fb	Fix a boatload of typos in C comments. Justin Pryzby Discussion: https://postgr.es/m/20180331105640.GK28454@telsasoft.com	2018-04-01 15:01:28 -04:00
Tom Lane	e5eb4fa873	Remove obsolete SLRU wrapping and warnings from predicate.c. When SSI was developed, slru.c was limited to segment files with names in the range 0000-FFFF. This didn't allow enough space for predicate.c to store every possible XID when spilling old transactions to disk, so it would wrap around sooner and print warnings. Since commits `638cf09e` and `73c986ad` increased the number of segment files slru.c could manage, that behavior is unnecessary. Therefore remove that code. Also remove the macro OldSerXidSegment, which has been unused since `4cd3fb6e`. Thomas Munro, reviewed by Anastasia Lubennikova Discussion: https://postgr.es/m/CAEepm=3XfsTSxgEbEOmxu0QDiXy0o18NUg2nC89JZcCGE+XFPA@mail.gmail.com	2018-03-30 15:11:39 -04:00
Teodor Sigaev	43d1ed60fd	Predicate locking in GIN index Predicate locks are used on per page basis only if fastupdate = off, in opposite case predicate lock on pending list will effectively lock whole index, to reduce locking overhead, just lock a relation. Entry and posting trees are essentially B-tree, so locks are acquired on leaf pages only. Author: Shubham Barai with some editorization by me and Dmitry Ivanov Review by: Alexander Korotkov, Dmitry Ivanov, Fedor Sigaev Discussion: https://www.postgresql.org/message-id/flat/CALxAEPt5sWW+EwTaKUGFL5_XFcZ0MuGBcyJ70oqbWqr42YKR8Q@mail.gmail.com	2018-03-30 14:23:17 +03:00
Tom Lane	2b1759e267	Remove unnecessary BufferGetPage() calls in fsm_vacuum_page(). Just noticed that these were quite redundant, since we're holding the page address in a local variable anyway, and we have pin on the buffer throughout. Also improve a comment.	2018-03-29 12:44:19 -04:00
Tom Lane	a063baaced	Remove UpdateFreeSpaceMap(), use FreeSpaceMapVacuumRange() instead. FreeSpaceMapVacuumRange has the same effect, is more efficient if many pages are involved, and makes fewer assumptions about how it's used. Notably, Claudio Freire pointed out that UpdateFreeSpaceMap could fail if the specified freespace value isn't the maximum possible. This isn't a problem for the single existing user, but the function represents an attractive nuisance IMO, because it's named as though it were a general-purpose update function and its limitations are undocumented. In any case we don't need multiple ways to get the same result. In passing, do some code review and cleanup in RelationAddExtraBlocks. In particular, I see no excuse for it to omit the PageIsNew safety check that's done in the mainline extension path in RelationGetBufferForTuple. Discussion: https://postgr.es/m/CAGTBQpYR0uJCNTt3M5GOzBRHo+-GccNO1nCaQ8yEJmZKSW5q1A@mail.gmail.com	2018-03-29 12:22:44 -04:00
Bruce Momjian	bc0021ef09	C comment: fix wording about shared memory message queue Reported-by: Tels Discussion: https://postgr.es/m/e66e05bc55f5ce904e361ad17a3395ae.squirrel@sm.webmail.pair.com	2018-03-29 12:18:42 -04:00
Tom Lane	851a26e266	While vacuuming a large table, update upper-level FSM data every so often. VACUUM updates leaf-level FSM entries immediately after cleaning the corresponding heap blocks. fsmpage.c updates the intra-page search trees on the leaf-level FSM pages when this happens, but it does not touch the upper-level FSM pages, so that the released space might not actually be findable by searchers. Previously, updating the upper-level pages happened only at the conclusion of the VACUUM run, in a single FreeSpaceMapVacuum() call. This is bad because the VACUUM might get canceled before ever reaching that point, so that from the point of view of searchers no space has been freed at all, leading to table bloat. We can improve matters by updating the upper pages immediately after each cycle of index-cleaning and heap-cleaning, processing just the FSM pages corresponding to the range of heap blocks we have now fully cleaned. This adds a small amount of extra work, since the FSM pages leading down to each range boundary will be touched twice, but it's pretty negligible compared to everything else going on in a large VACUUM. If there are no indexes, VACUUM doesn't work in cycles but just cleans each heap page on first visit. In that case we just arbitrarily update upper FSM pages after each 8GB of heap. That maintains the goal of not letting all this work slide until the very end, and it doesn't seem worth expending extra complexity on a case that so seldom occurs in practice. In either case, the FSM is fully up to date before any attempt is made to truncate the relation, so that the most likely scenario for VACUUM cancellation no longer results in out-of-date upper FSM pages. When we do successfully truncate, adjusting the FSM to reflect that is now fully handled within FreeSpaceMapTruncateRel. Claudio Freire, reviewed by Masahiko Sawada and Jing Wang, some additional tweaks by me Discussion: https://postgr.es/m/CAGTBQpYR0uJCNTt3M5GOzBRHo+-GccNO1nCaQ8yEJmZKSW5q1A@mail.gmail.com	2018-03-29 11:29:54 -04:00
Teodor Sigaev	920a5e500a	Skip temp tables from basebackup. Do not store temp tables in basebackup, they will not be visible anyway, so, there are not reasons to store them. Author: David Steel Reviewed by: me Discussion: https://www.postgresql.org/message-id/flat/5ea4d26a-a453-c1b7-eff9-5a3ef8f8aceb@pgmasters.net	2018-03-27 16:14:40 +03:00
Teodor Sigaev	3ad55863e9	Add predicate locking for GiST Add page-level predicate locking, due to gist's code organization, patch seems close to trivial: add check before page changing, add predicate lock before page scanning. Although choosing right place to check is not simple: it should not be called during index build, it should support insertion of new downlink and so on. Author: Shubham Barai with editorization by me and Alexander Korotkov Reviewed by: Alexander Korotkov, Andrey Borodin, me Discussion: https://www.postgresql.org/message-id/flat/CALxAEPtdcANpw5ePU3LvnTP8HCENFw6wygupQAyNBgD-sG3h0g@mail.gmail.com	2018-03-27 15:43:19 +03:00
Tom Lane	4b538727e2	Fix make rules that generate multiple output files. For years, our makefiles have correctly observed that "there is no correct way to write a rule that generates two files". However, what we did is to provide empty rules that "generate" the secondary output files from the primary one, and that's not right either. Depending on the details of the creating process, the primary file might end up timestamped later than one or more secondary files, causing subsequent make runs to consider the secondary file(s) out of date. That's harmless in a plain build, since make will just re-execute the empty rule and nothing happens. But it's fatal in a VPATH build, since make will expect the secondary file to be rebuilt in the build directory. This would manifest as "file not found" failures during VPATH builds from tarballs, if we were ever unlucky enough to ship a tarball with apparently out-of-date secondary files. (It's not clear whether that has ever actually happened, but it definitely could.) To ensure that secondary output files have timestamps >= their primary's, change our makefile convention to be that we provide a "touch $@" action not an empty rule. Also, make sure that this rule actually gets invoked during a distprep run, else the hazard remains. It's been like this a long time, so back-patch to all supported branches. In HEAD, I skipped the changes in src/backend/catalog/Makefile, because those rules are due to get replaced soon in the bootstrap data format patch, and there seems no need to create a merge issue for that patch. If for some reason we fail to land that patch in v11, we'll need to back-fill the changes in that one makefile from v10. Discussion: https://postgr.es/m/18556.1521668179@sss.pgh.pa.us	2018-03-23 13:46:00 -04:00
Teodor Sigaev	8694cc96b5	Exclude unlogged tables from base backups Exclude unlogged tables from base backup entirely except init fork which marks created unlogged table. The next question is do not backup temp table but it's a story for separate patch. Author: David Steele Review by: Adam Brightwell, Masahiko Sawada Discussion: https://www.postgresql.org/message-id/flat/04791bab-cb04-ba43-e9c0-664a4c1ffb2c@pgmasters.net	2018-03-23 19:14:12 +03:00
Robert Haas	42d7074ebb	shm_mq: Fix detach race condition. Commit `34db06ef9a` adopted a lock-free design for shm_mq.c, but it introduced a race condition that could lose messages. When shm_mq_receive_bytes() detects that the other end has detached, it must make sure that it has seen the final version of mq_bytes_written, or it might miss a message sent before detaching. Thomas Munro Discussion: https://postgr.es/m/CAEepm%3D2myZ4qxpt1a%3DC%2BwEv3o188K13K3UvD-44FK0SdAzHy%2Bw%40mail.gmail.com	2018-03-05 15:12:49 -05:00
Robert Haas	497171d3e2	shm_mq: Have the receiver set the sender's less frequently. Instead of marking data from the ringer buffer consumed and setting the sender's latch for every message, do it only when the amount of data we can consume is at least 1/4 of the size of the ring buffer, or when no data remains in the ring buffer. This is dramatically faster in my testing; apparently, the savings from sending signals less frequently outweighs the benefit of letting the sender know about available buffer space sooner. Patch by me, reviewed by Andres Freund and tested by Rafia Sabih. Discussion: http://postgr.es/m/CA+TgmoYK7RFj6r7KLEfSGtYZCi3zqTRhAz8mcsDbUAjEmLOZ3Q@mail.gmail.com	2018-03-02 12:20:30 -05:00
Robert Haas	34db06ef9a	shm_mq: Reduce spinlock usage. Previously, mq_bytes_read and mq_bytes_written were protected by the spinlock, but that turns out to cause pretty serious spinlock contention on queries which send many tuples through a Gather or Gather Merge node. This patches changes things so that we instead read and write those values using 8-byte atomics. Since mq_bytes_read can only be changed by the receiver and mq_bytes_written can only be changed by the sender, the only purpose of the spinlock is to prevent reads and writes of these values from being torn on platforms where 8-byte memory access is not atomic, making the conversion fairly straightforward. Testing shows that this produces some slowdown if we're using emulated 64-bit atomics, but since they should be available on any platform where performance is a primary concern, that seems OK. It's faster, sometimes a lot faster, on platforms where such atomics are available. Patch by me, reviewed by Andres Freund, who also suggested the design. Also tested by Rafia Sabih. Discussion: http://postgr.es/m/CA+TgmoYuK0XXxmUNTFT9TSNiBtWnRwasBcHHRCOK9iYmDLQVPg@mail.gmail.com	2018-03-02 12:16:59 -05:00
Andres Freund	07c6e5163e	Remove volatile qualifiers from shm_mq.c. Since commit `0709b7ee`, spinlock primitives include a compiler barrier so it is no longer necessary to access either spinlocks or the memory they protect through pointer-to-volatile. Like earlier commits `e93b6298`, `d53e3d5f`, `430008b5`, `8f6bb851`, `df4077cd`. Author: Thomas Munro Discussion: https://postgr.es/m/CAEepm=204T37SxcHo4=xw5btho9jQ-=ZYYrVdcKyz82XYzMoqg@mail.gmail.com	2018-03-01 16:21:52 -08:00
Robert Haas	73797b7884	Document LWTRANCHE_PARALLEL_HASH_JOIN. Thomas Munro Discussion: http://postgr.es/m/CAEepm=3g1hhbFzYkR_QT9RmBvsGX4UaeCtX-4Js8OOEMmFeaSQ@mail.gmail.com	2018-02-28 11:46:26 -05:00
Robert Haas	a6a80134e3	Remove extra words. Thomas Munro Discussion: http://postgr.es/m/CAEepm=2x3NUSPed6=-wDYs39KtUU5Dw3mK_NAMWps+18FmkApQ@mail.gmail.com	2018-02-22 18:06:30 -05:00
Magnus Hagander	9a44a26b65	Fix typo Author: Masahiko Sawada	2018-02-20 12:03:18 +01:00
Tom Lane	524d64ea8e	Remove bogus "extern" annotations on function definitions. While this is not illegal C, project style is to put "extern" only on declarations not definitions. David Rowley Discussion: https://postgr.es/m/CAKJS1f9RKLWXcMBQhvDYhmsMEo+ALuNgA-NE+AX5Uoke9DJ2Xg@mail.gmail.com	2018-02-19 12:07:44 -05:00
Peter Eisentraut	ad9a274778	Fix crash when canceling parallel query elog(FATAL) would end up calling PortalCleanup(), which would call executor shutdown code, which could fail and crash, especially under parallel query. This was introduced by `8561e4840c`, which did not want to mark an active portal as failed by a normal transaction abort anymore. But we do need to do that for an elog(FATAL) exit. Introduce a variable shmem_exit_inprogress similar to the existing proc_exit_inprogress, so we can tell whether we are in the FATAL exit scenario. Reported-by: Andres Freund <andres@anarazel.de>	2018-02-16 16:21:24 -05:00
Tom Lane	51940f9760	Cast to void in StaticAssertExpr, not its callers. Seems a bit silly that many (in fact all, as of today) uses of StaticAssertExpr would need to cast it to void to avoid warnings from pickier compilers. Let's just do the cast right in the macro, instead. In passing, change StaticAssertExpr to StaticAssertStmt in one place where that seems more apropos. Discussion: https://postgr.es/m/16161.1518715186@sss.pgh.pa.us	2018-02-15 13:41:30 -05:00
Robert Haas	9da0cc3528	Support parallel btree index builds. To make this work, tuplesort.c and logtape.c must also support parallelism, so this patch adds that infrastructure and then applies it to the particular case of parallel btree index builds. Testing to date shows that this can often be 2-3x faster than a serial index build. The model for deciding how many workers to use is fairly primitive at present, but it's better than not having the feature. We can refine it as we get more experience. Peter Geoghegan with some help from Rushabh Lathia. While Heikki Linnakangas is not an author of this patch, he wrote other patches without which this feature would not have been possible, and therefore the release notes should possibly credit him as an author of this feature. Reviewed by Claudio Freire, Heikki Linnakangas, Thomas Munro, Tels, Amit Kapila, me. Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com	2018-02-02 13:32:44 -05:00
Peter Eisentraut	9e945f8626	Fix Latin spelling "c.f." should be "cf.".	2018-01-11 08:32:01 -05:00
Tom Lane	3afd75eaac	Remove dubious micro-optimization in ckpt_buforder_comparator(). It seems incorrect to assume that the list of CkptSortItems can never contain duplicate page numbers: concurrent activity could result in some page getting dropped from a low-numbered buffer and later loaded into a high-numbered buffer while BufferSync is scanning the buffer pool. If that happened, the comparator would give self-inconsistent results, potentially confusing qsort(). Saving one comparison step is not worth possibly getting the sort wrong. So far as I can tell, nothing would actually go wrong given our current implementation of qsort(). It might get a bit slower than expected if there were a large number of duplicates of one value, but that's surely a probability-epsilon case. Still, the comment is wrong, and if we ever switched to another sort implementation it might be less forgiving. In passing, avoid casting away const-ness of the argument pointers; I've not seen any compiler complaints from that, but it seems likely that some compilers would not like it. Back-patch to 9.6 where this code came in, just in case I've underestimated the possible consequences. Discussion: https://postgr.es/m/18437.1515607610@sss.pgh.pa.us	2018-01-10 15:50:54 -05:00
Tom Lane	80259d4dbf	While waiting for a condition variable, detect postmaster death. The general assumption for postmaster child processes is that they should just exit(1), reasonably promptly, if the postmaster disappears. condition_variable.c neglected this consideration and could be left waiting forever, if the counterpart process it is waiting for has done the right thing and exited. We had some discussion of adjusting the WaitEventSet API to make it harder to make this type of mistake in future; but for the moment, and for v10, let's make this narrow fix. Discussion: https://postgr.es/m/20412.1515456143@sss.pgh.pa.us	2018-01-09 12:34:57 -05:00

... 4 5 6 7 8 ...

2447 Commits