postgresql

Commit Graph

Author	SHA1	Message	Date
Alvaro Herrera	bf902c1393	Revert "Fix replay of create database records on standby" This reverts commit `49d9cfc68b`. The approach taken by this patch has problems, so we'll come up with a radically different fix. Discussion: https://postgr.es/m/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=iCFM0SuYqBbqZEamEg@mail.gmail.com	2022-03-29 15:36:21 +02:00
Alvaro Herrera	7103ebb7aa	Add support for MERGE SQL command MERGE performs actions that modify rows in the target table using a source table or query. MERGE provides a single SQL statement that can conditionally INSERT/UPDATE/DELETE rows -- a task that would otherwise require multiple PL statements. For example, MERGE INTO target AS t USING source AS s ON t.tid = s.sid WHEN MATCHED AND t.balance > s.delta THEN UPDATE SET balance = t.balance - s.delta WHEN MATCHED THEN DELETE WHEN NOT MATCHED AND s.delta > 0 THEN INSERT VALUES (s.sid, s.delta) WHEN NOT MATCHED THEN DO NOTHING; MERGE works with regular tables, partitioned tables and inheritance hierarchies, including column and row security enforcement, as well as support for row and statement triggers and transition tables therein. MERGE is optimized for OLTP and is parameterizable, though also useful for large scale ETL/ELT. MERGE is not intended to be used in preference to existing single SQL commands for INSERT, UPDATE or DELETE since there is some overhead. MERGE can be used from PL/pgSQL. MERGE does not support targetting updatable views or foreign tables, and RETURNING clauses are not allowed either. These limitations are likely fixable with sufficient effort. Rewrite rules are also not supported, but it's not clear that we'd want to support them. Author: Pavan Deolasee <pavan.deolasee@gmail.com> Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Author: Amit Langote <amitlangote09@gmail.com> Author: Simon Riggs <simon.riggs@enterprisedb.com> Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com> Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions) Reviewed-by: Peter Geoghegan <pg@bowt.ie> (earlier versions) Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier versions) Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Zhihong Yu <zyu@yugabyte.com> Discussion: https://postgr.es/m/CANP8+jKitBSrB7oTgT9CY2i1ObfOt36z0XMraQc+Xrz8QB0nXA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzkJdBuxj9PO=2QaO9-3h3xGbQPZ34kJH=HukRekwM-GZg@mail.gmail.com Discussion: https://postgr.es/m/20201231134736.GA25392@alvherre.pgsql	2022-03-28 16:47:48 +02:00
Alvaro Herrera	49d9cfc68b	Fix replay of create database records on standby Crash recovery on standby may encounter missing directories when replaying create database WAL records. Prior to this patch, the standby would fail to recover in such a case. However, the directories could be legitimately missing. Consider a sequence of WAL records as follows: CREATE DATABASE DROP DATABASE DROP TABLESPACE If, after replaying the last WAL record and removing the tablespace directory, the standby crashes and has to replay the create database record again, the crash recovery must be able to move on. This patch adds a mechanism similar to invalid-page tracking, to keep a tally of missing directories during crash recovery. If all the missing directory references are matched with corresponding drop records at the end of crash recovery, the standby can safely continue following the primary. Backpatch to 13, at least for now. The bug is older, but fixing it in older branches requires more careful study of the interactions with commit `e6d8069522`, which appeared in 13. A new TAP test file is added to verify the condition. However, because it depends on commit `d6d317dbf6`, it can only be added to branch master. I (Álvaro) manually verified that the code behaves as expected in branch 14. It's a bit nervous-making to leave the code uncovered by tests in older branches, but leaving the bug unfixed is even worse. Also, the main reason this fix took so long is precisely that we couldn't agree on a good strategy to approach testing for the bug, so perhaps this is the best we can do. Diagnosed-by: Paul Guo <paulguo@gmail.com> Author: Paul Guo <paulguo@gmail.com> Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Author: Asim R Praveen <apraveen@pivotal.io> Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27DN5xWJ2P9-ROH16e4JUA@mail.gmail.com	2022-03-25 13:16:21 +01:00
Michael Paquier	d4781d8873	Refactor code related to pg_hba_file_rules() into new file hba.c is growing big, and more contents are planned for it. In order to prepare for this future work, this commit moves all the code related to the system function processing the contents of pg_hba.conf, pg_hba_file_rules() to a new file called hbafuncs.c, which will be used as the location for the SQL portion of the authentication file parsing. While on it, HbaToken, the structure holding a string token lexed from a configuration file related to authentication, is renamed to a more generic AuthToken, as it gets used not only for pg_hba.conf, but also for pg_ident.conf. TokenizedLine is now named TokenizedAuthLine. The size of hba.c is reduced by ~12%. Author: Julien Rouhaud Reviewed-by: Aleksander Alekseev, Michael Paquier Discussion: https://postgr.es/m/20220223045959.35ipdsvbxcstrhya@jrouhaud	2022-03-24 12:42:30 +09:00
Andres Freund	a3a75b982b	pgstat: run pgindent on pgstat.c/h. Upcoming commits will touch a lot of the pgstats code. Reindenting separately makes it easier to keep the code in a well-formatted shape each step. Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20220303021600.hs34ghqcw6zcokdh@alap3.anarazel.de	2022-03-19 11:42:22 -07:00
Thomas Munro	3f1ce97346	Add circular WAL decoding buffer, take II. Teach xlogreader.c to decode the WAL into a circular buffer. This will support optimizations based on looking ahead, to follow in a later commit. * XLogReadRecord() works as before, decoding records one by one, and allowing them to be examined via the traditional XLogRecGetXXX() macros and certain traditional members like xlogreader->ReadRecPtr. * An alternative new interface XLogReadAhead()/XLogNextRecord() is added that returns pointers to DecodedXLogRecord objects so that it's now possible to look ahead in the WAL stream while replaying. * In order to be able to use the new interface effectively while streaming data, support is added for the page_read() callback to respond to a new nonblocking mode with XLREAD_WOULDBLOCK instead of waiting for more data to arrive. No direct user of the new interface is included in this commit, though XLogReadRecord() uses it internally. Existing code doesn't need to change, except in a few places where it was accessing reader internals directly and now needs to go through accessor macros. Reviewed-by: Julien Rouhaud <rjuju123@gmail.com> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com	2022-03-18 18:45:47 +13:00
Alvaro Herrera	25e777cf8e	Split ExecUpdate and ExecDelete into reusable pieces Create subroutines ExecUpdatePrologue / ExecUpdateAct / ExecUpdateEpilogue, and similar for ExecDelete. Introduce a new struct to be used internally in nodeModifyTable.c, dubbed ModifyTableContext, which contains all context information needed to perform these operations, as well as ExecInsert and others. This allows using a different schedule and a different way of evaluating the results of these operations, which can be exploited by a later commit introducing support for MERGE. It also makes ExecUpdate and ExecDelete proper shorter and (hopefully) simpler. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Zhihong Yu <zyu@yugabyte.com> Discussion: https://postgr.es/m/202202271724.4z7xv3cf46kv@alvherre.pgsql	2022-03-17 11:47:04 +01:00
Andres Freund	352d297dc7	dshash: Add sequential scan support. Add ability to scan all entries sequentially to dshash. The interface is similar but a bit different both from that of dynahash and simple dshash search functions. The most significant differences is that dshash's interfac always needs a call to dshash_seq_term when scan ends. Another is locking. Dshash holds partition lock when returning an entry, dshash_seq_next() also holds lock when returning an entry but callers shouldn't release it, since the lock is essential to continue a scan. The seqscan interface allows entry deletion while a scan is in progress using dshash_delete_current(). Reviewed-By: Andres Freund <andres@anarazel.de> Author: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>	2022-03-10 12:57:05 -08:00
Amit Kapila	7a85073290	Reconsider pg_stat_subscription_workers view. It was decided (refer to the Discussion link below) that the stats collector is not an appropriate place to store the error information of subscription workers. This patch changes the pg_stat_subscription_workers view (introduced by commit `8d74fc96db`) so that it stores only statistics counters: apply_error_count and sync_error_count, and has one entry for each subscription. The removed error information such as error-XID and the error message would be stored in another way in the future which is more reliable and persistent. After removing these error details, there is no longer any relation information, so the subscription statistics are now a cluster-wide statistics. The patch also changes the view name to pg_stat_subscription_stats since the word "worker" is an implementation detail that we use one worker for one tablesync and one apply. Author: Masahiko Sawada, based on suggestions by Andres Freund Reviewed-by: Peter Smith, Haiying Tang, Takamichi Osumi, Amit Kapila Discussion: https://postgr.es/m/20220125063131.4cmvsxbz2tdg6g65@alap3.anarazel.de	2022-03-01 06:17:52 +05:30
Amit Kapila	52e4f0cd47	Allow specifying row filters for logical replication of tables. This feature adds row filtering for publication tables. When a publication is defined or modified, an optional WHERE clause can be specified. Rows that don't satisfy this WHERE clause will be filtered out. This allows a set of tables to be partially replicated. The row filter is per table. A new row filter can be added simply by specifying a WHERE clause after the table name. The WHERE clause must be enclosed by parentheses. The row filter WHERE clause for a table added to a publication that publishes UPDATE and/or DELETE operations must contain only columns that are covered by REPLICA IDENTITY. The row filter WHERE clause for a table added to a publication that publishes INSERT can use any column. If the row filter evaluates to NULL, it is regarded as "false". The WHERE clause only allows simple expressions that don't have user-defined functions, user-defined operators, user-defined types, user-defined collations, non-immutable built-in functions, or references to system columns. These restrictions could be addressed in the future. If you choose to do the initial table synchronization, only data that satisfies the row filters is copied to the subscriber. If the subscription has several publications in which a table has been published with different WHERE clauses, rows that satisfy ANY of the expressions will be copied. If a subscriber is a pre-15 version, the initial table synchronization won't use row filters even if they are defined in the publisher. The row filters are applied before publishing the changes. If the subscription has several publications in which the same table has been published with different filters (for the same publish operation), those expressions get OR'ed together so that rows satisfying any of the expressions will be replicated. This means all the other filters become redundant if (a) one of the publications have no filter at all, (b) one of the publications was created using FOR ALL TABLES, (c) one of the publications was created using FOR ALL TABLES IN SCHEMA and the table belongs to that same schema. If your publication contains a partitioned table, the publication parameter publish_via_partition_root determines if it uses the partition's row filter (if the parameter is false, the default) or the root partitioned table's row filter. Psql commands \dRp+ and \d <table-name> will display any row filters. Author: Hou Zhijie, Euler Taveira, Peter Smith, Ajin Cherian Reviewed-by: Greg Nancarrow, Haiying Tang, Amit Kapila, Tomas Vondra, Dilip Kumar, Vignesh C, Alvaro Herrera, Andres Freund, Wei Wang Discussion: https://www.postgresql.org/message-id/flat/CAHE3wggb715X%2BmK_DitLXF25B%3DjE6xyNCH4YOwM860JR7HarGQ%40mail.gmail.com	2022-02-22 08:11:50 +05:30
Heikki Linnakangas	70e81861fa	Split xlog.c into xlog.c and xlogrecovery.c. This moves the functions related to performing WAL recovery into the new xlogrecovery.c source file, leaving xlog.c responsible for maintaining the WAL buffers, coordinating the startup and switch from recovery to normal operations, and other miscellaneous stuff that have always been in xlog.c. Reviewed-by: Andres Freund, Kyotaro Horiguchi, Robert Haas Discussion: https://www.postgresql.org/message-id/a31f27b4-a31d-f976-6217-2b03be646ffa%40iki.fi	2022-02-16 09:30:38 +02:00
Alexander Korotkov	f1ea98a797	Reduce non-leaf keys overlap in GiST indexes produced by a sorted build The GiST sorted build currently chooses split points according to the only page space utilization. That may lead to higher non-leaf keys overlap and, in turn, slower search query answers. This commit makes the sorted build use the opclass's picksplit method. Once four pages at the level are accumulated, the picksplit method is applied until each split partition fits the page. Some of our split algorithms could show significant performance degradation while processing 4-times more data at once. But those opclasses haven't received the sorted build support and shouldn't receive it before their split algorithms are improved. Discussion: https://postgr.es/m/CAHqSB9jqtS94e9%3D0vxqQX5dxQA89N95UKyz-%3DA7Y%2B_YJt%2BVW5A%40mail.gmail.com Author: Aliaksandr Kalenik, Sergei Shoulbakov, Andrey Borodin Reviewed-by: Björn Harrtell, Darafei Praliaskouski, Andres Freund Reviewed-by: Alexander Korotkov	2022-02-07 23:20:42 +03:00
Michael Paquier	5513dc6a30	Improve error handling of HMAC computations This is similar to `b69aba7`, except that this completes the work for HMAC with a new routine called pg_hmac_error() that would provide more context about the type of error that happened during a HMAC computation: - The fallback HMAC implementation in hmac.c relies on cryptohashes, so in some code paths it is necessary to return back the error generated by cryptohashes. - For the OpenSSL implementation (hmac_openssl.c), the logic is very similar to cryptohash_openssl.c, where the error context comes from OpenSSL if one of its internal routines failed, with different error codes if something internal to hmac_openssl.c failed or was incorrect. Any in-core code paths that use the centralized HMAC interface are related to SCRAM, for errors that are unlikely going to happen, with only SHA-256. It would be possible to see errors when computing some HMACs with MD5 for example and OpenSSL FIPS enabled, and this commit would help in reporting the correct errors but nothing in core uses that. So, at the end, no backpatch to v14 is done, at least for now. Errors in SCRAM related to the computation of the server key, stored key, etc. need to pass down the potential error context string across more layers of their respective call stacks for the frontend and the backend, so each surrounding routine is adapted for this purpose. Reviewed-by: Sergey Shinderuk Discussion: https://postgr.es/m/Yd0N9tSAIIkFd+qi@paquier.xyz	2022-01-13 16:17:21 +09:00
Michael Paquier	b69aba7457	Improve error handling of cryptohash computations The existing cryptohash facility was causing problems in some code paths related to MD5 (frontend and backend) that relied on the fact that the only type of error that could happen would be an OOM, as the MD5 implementation used in PostgreSQL ~13 (the in-core implementation is used when compiling with or without OpenSSL in those older versions), could fail only under this circumstance. The new cryptohash facilities can fail for reasons other than OOMs, like attempting MD5 when FIPS is enabled (upstream OpenSSL allows that up to 1.0.2, Fedora and Photon patch OpenSSL 1.1.1 to allow that), so this would cause incorrect reports to show up. This commit extends the cryptohash APIs so as callers of those routines can fetch more context when an error happens, by using a new routine called pg_cryptohash_error(). The error states are stored within each implementation's internal context data, so as it is possible to extend the logic depending on what's suited for an implementation. The default implementation requires few error states, but OpenSSL could report various issues depending on its internal state so more is needed in cryptohash_openssl.c, and the code is shaped so as we are always able to grab the necessary information. The core code is changed to adapt to the new error routine, painting more "const" across the call stack where the static errors are stored, particularly in authentication code paths on variables that provide log details. This way, any future changes would warn if attempting to free these strings. The MD5 authentication code was also a bit blurry about the handling of "logdetail" (LOG sent to the postmaster), so improve the comments related that, while on it. The origin of the problem is `87ae969`, that introduced the centralized cryptohash facility. Extra changes are done for pgcrypto in v14 for the non-OpenSSL code path to cope with the improvements done by this commit. Reported-by: Michael Mühlbeyer Author: Michael Paquier Reviewed-by: Tom Lane Discussion: https://postgr.es/m/89B7F072-5BBE-4C92-903E-D83E865D9367@trivadis.com Backpatch-through: 14	2022-01-11 09:55:16 +09:00
Amit Kapila	8e1fae1938	Move parallel vacuum code to vacuumparallel.c. This commit moves parallel vacuum related code to a new file commands/vacuumparallel.c so that any table AM supporting indexes can utilize parallel vacuum in order to call index AM callbacks (ambulkdelete and amvacuumcleanup) with parallel workers. Another reason for this refactoring is that the parallel vacuum isn't specific to heap so it doesn't make sense to keep this code in heap/vacuumlazy.c. Author: Masahiko Sawada, based on suggestion from Andres Freund Reviewed-by: Hou Zhijie, Amit Kapila, Haiying Tang Discussion: https://www.postgresql.org/message-id/20211030212101.ae3qcouatwmy7tbr%40alap3.anarazel.de	2021-12-23 11:42:52 +05:30
Amit Kapila	cc8b25712b	Move index vacuum routines to vacuum.c. An upcoming patch moves parallel vacuum code out of vacuumlazy.c. This code restructuring will allow both lazy vacuum and parallel vacuum to use index vacuum functions. Author: Masahiko Sawada Reviewed-by: Hou Zhijie, Amit Kapila Discussion: https://www.postgresql.org/message-id/20211030212101.ae3qcouatwmy7tbr%40alap3.anarazel.de	2021-12-22 07:55:14 +05:30
Amit Kapila	22bd3cbe0c	Improve parallel vacuum implementation. Previously, in parallel vacuum, we allocated shmem area of IndexBulkDeleteResult only for indexes where parallel index vacuuming is safe and had null-bitmap in shmem area to access them. This logic was too complicated with a small benefit of saving only a few bits per indexes. In this commit, we allocate a dedicated shmem area for the array of LVParallelIndStats that includes a parallel-safety flag, the index vacuum status, and IndexBulkdeleteResult. There is one array element for every index, even those indexes where parallel index vacuuming is unsafe or not worthwhile. This commit makes the code clear by removing all bitmap-related code. Also, add the check each index vacuum status after parallel index vacuum to make sure that all indexes have been processed. Finally, rename parallel vacuum functions to parallel_vacuum_* for consistency. Author: Masahiko Sawada, based on suggestions by Andres Freund Reviewed-by: Hou Zhijie, Amit Kapila Discussion: https://www.postgresql.org/message-id/20211030212101.ae3qcouatwmy7tbr%40alap3.anarazel.de	2021-12-15 07:58:19 +05:30
Amit Kapila	8d74fc96db	Add a view to show the stats of subscription workers. This commit adds a new system view pg_stat_subscription_workers, that shows information about any errors which occur during the application of logical replication changes as well as during performing initial table synchronization. The subscription statistics entries are removed when the corresponding subscription is removed. It also adds an SQL function pg_stat_reset_subscription_worker() to reset single subscription errors. The contents of this view can be used by an upcoming patch that skips the particular transaction that conflicts with the existing data on the subscriber. This view can be extended in the future to track other xact related statistics like the number of xacts committed/aborted for subscription workers. Author: Masahiko Sawada Reviewed-by: Greg Nancarrow, Hou Zhijie, Tang Haiying, Vignesh C, Dilip Kumar, Takamichi Osumi, Amit Kapila Discussion: https://postgr.es/m/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com	2021-11-30 08:54:30 +05:30
Michael Paquier	d62bcc8b07	Rework compression options of pg_receivewal pg_receivewal includes since `cada1af` the option --compress, to allow the compression of WAL segments using gzip, with a value of 0 (the default) meaning that no compression can be used. This commit introduces a new option, called --compression-method, able to use as values "none", the default, and "gzip", to make things more extensible. The case of --compress=0 becomes fuzzy with this option layer, so we have made the choice to make pg_receivewal return an error when using "none" and a non-zero compression level, meaning that the authorized values of --compress are now [1,9] instead of [0,9]. Not specifying --compress with "gzip" as compression method makes pg_receivewal use the default of zlib instead (Z_DEFAULT_COMPRESSION). The code in charge of finding the streaming start LSN when scanning the existing archives is refactored and made more extensible. While on it, rename "compression" to "compression_level" in walmethods.c, to reduce the confusion with the introduction of the compression method, even if the tar method used by pg_basebackup does not rely on the compression method (yet, at least), but just on the compression level (this area could be improved more, actually). This is in preparation for an upcoming patch that adds LZ4 support to pg_receivewal. Author: Georgios Kokolatos Reviewed-by: Michael Paquier, Jian Guo, Magnus Hagander, Dilip Kumar, Robert Haas Discussion: https://postgr.es/m/ZCm1J5vfyQ2E6dYvXz8si39HQ2gwxSZ3IpYaVgYa3lUwY88SLapx9EEnOf5uEwrddhx2twG7zYKjVeuP5MwZXCNPybtsGouDsAD1o2L_I5E=@pm.me	2021-11-04 11:10:31 +09:00
Amit Kapila	5a2832465f	Allow publishing the tables of schema. A new option "FOR ALL TABLES IN SCHEMA" in Create/Alter Publication allows one or more schemas to be specified, whose tables are selected by the publisher for sending the data to the subscriber. The new syntax allows specifying both the tables and schemas. For example: CREATE PUBLICATION pub1 FOR TABLE t1,t2,t3, ALL TABLES IN SCHEMA s1,s2; OR ALTER PUBLICATION pub1 ADD TABLE t1,t2,t3, ALL TABLES IN SCHEMA s1,s2; A new system table "pg_publication_namespace" has been added, to maintain the schemas that the user wants to publish through the publication. Modified the output plugin (pgoutput) to publish the changes if the relation is part of schema publication. Updates pg_dump to identify and dump schema publications. Updates the \d family of commands to display schema publications and \dRp+ variant will now display associated schemas if any. Author: Vignesh C, Hou Zhijie, Amit Kapila Syntax-Suggested-by: Tom Lane, Alvaro Herrera Reviewed-by: Greg Nancarrow, Masahiko Sawada, Hou Zhijie, Amit Kapila, Haiying Tang, Ajin Cherian, Rahila Syed, Bharath Rupireddy, Mark Dilger Tested-by: Haiying Tang Discussion: https://www.postgresql.org/message-id/CALDaNm0OANxuJ6RXqwZsM1MSY4s19nuH3734j4a72etDwvBETQ@mail.gmail.com	2021-10-27 07:44:52 +05:30
Michael Paquier	b4ada4e19f	Add replication command READ_REPLICATION_SLOT The command is supported for physical slots for now, and returns the type of slot, its restart_lsn and its restart_tli. This will be useful for an upcoming patch related to pg_receivewal, to allow the tool to be able to stream from the position of a slot, rather than the last WAL position flushed by the backend (as reported by IDENTIFY_SYSTEM) if the archive directory is found as empty, which would be an advantage in the case of switching to a different archive locations with the same slot used to avoid holes in WAL segment archives. Author: Ronan Dunklau Reviewed-by: Kyotaro Horiguchi, Michael Paquier, Bharath Rupireddy Discussion: https://postgr.es/m/18708360.4lzOvYHigE@aivenronan	2021-10-25 07:40:42 +09:00
Noah Misch	fdd965d074	Avoid race in RelationBuildDesc() affecting CREATE INDEX CONCURRENTLY. CIC and REINDEX CONCURRENTLY assume backends see their catalog changes no later than each backend's next transaction start. That failed to hold when a backend absorbed a relevant invalidation in the middle of running RelationBuildDesc() on the CIC index. Queries that use the resulting index can silently fail to find rows. Fix this for future index builds by making RelationBuildDesc() loop until it finishes without accepting a relevant invalidation. It may be necessary to reindex to recover from past occurrences; REINDEX CONCURRENTLY suffices. Back-patch to 9.6 (all supported versions). Noah Misch and Andrey Borodin, reviewed (in earlier versions) by Andres Freund. Discussion: https://postgr.es/m/20210730022548.GA1940096@gust.leadboat.com	2021-10-23 18:36:38 -07:00
Alvaro Herrera	ff9f111bce	Fix WAL replay in presence of an incomplete record Physical replication always ships WAL segment files to replicas once they are complete. This is a problem if one WAL record is split across a segment boundary and the primary server crashes before writing down the segment with the next portion of the WAL record: WAL writing after crash recovery would happily resume at the point where the broken record started, overwriting that record ... but any standby or backup may have already received a copy of that segment, and they are not rewinding. This causes standbys to stop following the primary after the latter crashes: LOG: invalid contrecord length 7262 at A8/D9FFFBC8 because the standby is still trying to read the continuation record (contrecord) for the original long WAL record, but it is not there and it will never be. A workaround is to stop the replica, delete the WAL file, and restart it -- at which point a fresh copy is brought over from the primary. But that's pretty labor intensive, and I bet many users would just give up and re-clone the standby instead. A fix for this problem was already attempted in commit `515e3d84a0`, but it only addressed the case for the scenario of WAL archiving, so streaming replication would still be a problem (as well as other things such as taking a filesystem-level backup while the server is down after having crashed), and it had performance scalability problems too; so it had to be reverted. This commit fixes the problem using an approach suggested by Andres Freund, whereby the initial portion(s) of the split-up WAL record are kept, and a special type of WAL record is written where the contrecord was lost, so that WAL replay in the replica knows to skip the broken parts. With this approach, we can continue to stream/archive segment files as soon as they are complete, and replay of the broken records will proceed across the crash point without a hitch. Because a new type of WAL record is added, users should be careful to upgrade standbys first, primaries later. Otherwise they risk the standby being unable to start if the primary happens to write such a record. A new TAP test that exercises this is added, but the portability of it is yet to be seen. This has been wrong since the introduction of physical replication, so backpatch all the way back. In stable branches, keep the new XLogReaderState members at the end of the struct, to avoid an ABI break. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Nathan Bossart <bossartn@amazon.com> Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql	2021-09-29 11:21:51 -03:00
Andres Freund	37a9aa6591	Fix performance regression from session statistics. Session statistics, as introduced by `960869da08`, had several shortcomings: - an additional GetCurrentTimestamp() call that also impaired the accuracy of the data collected This can be avoided by passing the current timestamp we already have in pgstat_report_stat(). - an additional statistics UDP packet sent every 500ms This is solved by adding the new statistics to PgStat_MsgTabstat. This is conceptually ugly, because session statistics are not table statistics. But the struct already contains data unrelated to tables, so there is not much damage done. Connection and disconnection are reported in separate messages, which reduces the number of additional messages to two messages per session and a slight increase in PgStat_MsgTabstat size (but the same number of table stats fit). - Session time computation could overflow on systems where long is 32 bit. Reported-By: Andres Freund <andres@anarazel.de> Author: Andres Freund <andres@anarazel.de> Author: Laurenz Albe <laurenz.albe@cybertec.at> Discussion: https://postgr.es/m/20210801205501.nyxzxoelqoo4x2qc%40alap3.anarazel.de Backpatch: 14-, where the feature was introduced.	2021-09-16 02:05:50 -07:00
Alvaro Herrera	0c6828fa98	Add PublicationTable and PublicationRelInfo structs These encapsulate a relation when referred from replication DDL. Currently they don't do anything useful (they're just wrappers around RangeVar and Relation respectively) but in the future they'll be used to carry column lists. Extracted from a larger patch by Rahila Syed. Author: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CAH2L28vddB_NFdRVpuyRBJEBWjz4BSyTB=_ektNRH8NJ1jf95g@mail.gmail.com	2021-09-06 14:24:50 -03:00
Amit Kapila	dcac5e7ac1	Refactor sharedfileset.c to separate out fileset implementation. Move fileset related implementation out of sharedfileset.c to allow its usage by backends that don't want to share filesets among different processes. After this split, fileset infrastructure is used by both sharedfileset.c and worker.c for the named temporary files that survive across transactions. Author: Dilip Kumar, based on suggestion by Andres Freund Reviewed-by: Hou Zhijie, Masahiko Sawada, Amit Kapila Discussion: https://postgr.es/m/E1mCC6U-0004Ik-Fs@gemulon.postgresql.org	2021-08-30 08:48:15 +05:30
Amit Kapila	abc0910e2e	Add logical change details to logical replication worker errcontext. Previously, on the subscriber, we set the error context callback for the tuple data conversion failures. This commit replaces the existing error context callback with a comprehensive one so that it shows not only the details of data conversion failures but also the details of logical change being applied by the apply worker or table sync worker. The additional information displayed will be the command, transaction id, and timestamp. The error context is added to an error only when applying a change but not while doing other work like receiving data etc. This will help users in diagnosing the problems that occur during logical replication. It also can be used for future work that allows skipping a particular transaction on the subscriber. Author: Masahiko Sawada Reviewed-by: Hou Zhijie, Greg Nancarrow, Haiying Tang, Amit Kapila Tested-by: Haiying Tang Discussion: https://postgr.es/m/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com	2021-08-27 08:30:23 +05:30
Amit Kapila	a8fd13cab0	Add support for prepared transactions to built-in logical replication. To add support for streaming transactions at prepare time into the built-in logical replication, we need to do the following things: * Modify the output plugin (pgoutput) to implement the new two-phase API callbacks, by leveraging the extended replication protocol. * Modify the replication apply worker, to properly handle two-phase transactions by replaying them on prepare. * Add a new SUBSCRIPTION option "two_phase" to allow users to enable two-phase transactions. We enable the two_phase once the initial data sync is over. We however must explicitly disable replication of two-phase transactions during replication slot creation, even if the plugin supports it. We don't need to replicate the changes accumulated during this phase, and moreover, we don't have a replication connection open so we don't know where to send the data anyway. The streaming option is not allowed with this new two_phase option. This can be done as a separate patch. We don't allow to toggle two_phase option of a subscription because it can lead to an inconsistent replica. For the same reason, we don't allow to refresh the publication once the two_phase is enabled for a subscription unless copy_data option is false. Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi, Greg Nancarrow Tested-By: Haiying Tang Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com	2021-07-14 07:33:50 +05:30
David Rowley	83f4fcc655	Change the name of the Result Cache node to Memoize "Result Cache" was never a great name for this node, but nobody managed to come up with another name that anyone liked enough. That was until David Johnston mentioned "Node Memoization", which Tom Lane revised to just "Memoize". People seem to like "Memoize", so let's do the rename. Reviewed-by: Justin Pryzby Discussion: https://postgr.es/m/20210708165145.GG1176@momjian.us Backpatch-through: 14, where Result Cache was introduced	2021-07-14 12:43:58 +12:00
Michael Paquier	9fd85570d1	Refactor SASL code with a generic interface for its mechanisms The code of SCRAM and SASL have been tightly linked together since SCRAM exists in the core code, making hard to apprehend the addition of new SASL mechanisms, but these are by design different facilities, with SCRAM being an option for SASL. This refactors the code related to both so as the backend and the frontend use a set of callbacks for SASL mechanisms, documenting while on it what is expected by anybody adding a new SASL mechanism. The separation between both layers is neat, using two sets of callbacks for the frontend and the backend to mark the frontier between both facilities. The shape of the callbacks is now directly inspired from the routines used by SCRAM, so the code change is straight-forward, and the SASL code is moved into its own set of files. These will likely change depending on how and if new SASL mechanisms get added in the future. Author: Jacob Champion Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/3d2a6f5d50e741117d6baf83eb67ebf1a8a35a11.camel@vmware.com	2021-07-07 10:55:15 +09:00
Amit Kapila	8aafb02616	Refactor function parse_subscription_options. Instead of using multiple parameters in parse_subscription_options function signature, use the struct SubOpts that encapsulate all the subscription options and their values. It will be useful for future work where we need to add other options in the subscription. Also, use bitmaps to pass the supported and retrieve the specified options much like the way it is done in the commit `a3dc926009`. Author: Bharath Rupireddy Reviewed-By: Peter Smith, Amit Kapila, Alvaro Herrera Discussion: https://postgr.es/m/CALj2ACXtoQczfNsDQWobypVvHbX2DtgEHn8DawS0eGFwuo72kw@mail.gmail.com	2021-07-06 07:46:50 +05:30
Michael Paquier	4035cd5d4e	Add support for LZ4 with compression of full-page writes in WAL The logic is implemented so as there can be a choice in the compression used when building a WAL record, and an extra per-record bit is used to track down if a block is compressed with PGLZ, LZ4 or nothing. wal_compression, the existing parameter, is changed to an enum with support for the following backward-compatible values: - "off", the default, to not use compression. - "pglz" or "on", to compress FPWs with PGLZ. - "lz4", the new mode, to compress FPWs with LZ4. Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be also an interesting choice, but going just with LZ4 for now makes the patch minimalistic as toast compression is already able to use LZ4, so there is no need to worry about any build-related needs for this implementation. Author: Andrey Borodin, Justin Pryzby Reviewed-by: Dilip Kumar, Michael Paquier Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru	2021-06-29 11:17:55 +09:00
Andrew Dunstan	e1c1c30f63	Pre branch pgindent / pgperltidy run Along the way make a slight adjustment to src/include/utils/queryjumble.h to avoid an unused typedef.	2021-06-28 11:05:54 -04:00
Tom Lane	def5b065ff	Initial pgindent and pgperltidy run for v14. Also "make reformat-dat-files". The only change worthy of note is that pgindent messed up the formatting of launcher.c's struct LogicalRepWorkerId, which led me to notice that that struct wasn't used at all anymore, so I just took it out.	2021-05-12 13:14:10 -04:00
Thomas Munro	c2dc19342e	Revert recovery prefetching feature. This set of commits has some bugs with known fixes, but at this late stage in the release cycle it seems best to revert and resubmit next time, along with some new automated test coverage for this whole area. Commits reverted: dc88460c: Doc: Review for "Optionally prefetch referenced data in recovery." 1d257577: Optionally prefetch referenced data in recovery. f003d9f8: Add circular WAL decoding buffer. 323cbe7c: Remove read_page callback from XLogReader. Remove the new GUC group WAL_RECOVERY recently added by `a55a9847`, as the corresponding section of config.sgml is now reverted. Discussion: https://postgr.es/m/CAOuzzgrn7iKnFRsB4MHp3UisEQAGgZMbk_ViTN4HV4-Ksq8zCg%40mail.gmail.com	2021-05-10 16:06:09 +12:00
Thomas Munro	ec48314708	Revert per-index collation version tracking feature. Design problems were discovered in the handling of composite types and record types that would cause some relevant versions not to be recorded. Misgivings were also expressed about the use of the pg_depend catalog for this purpose. We're out of time for this release so we'll revert and try again. Commits reverted: 1bf946bd: Doc: Document known problem with Windows collation versions. cf002008: Remove no-longer-relevant test case. ef387bed: Fix bogus collation-version-recording logic. 0fb0a050: Hide internal error for pg_collation_actual_version(<bad OID>). ff942057: Suppress "warning: variable 'collcollate' set but not used". d50e3b1f: Fix assertion in collation version lookup. f24b1569: Rethink extraction of collation dependencies. 257836a7: Track collation versions for indexes. cd6f479e: Add pg_depend.refobjversion. 7d1297df: Remove pg_collation.collversion. Discussion: https://postgr.es/m/CA%2BhUKGLhj5t1fcjqAu8iD9B3ixJtsTNqyCCD4V0aTO9kAKAjjA%40mail.gmail.com	2021-05-07 21:10:11 +12:00
Amit Kapila	3fa17d3771	Use HTAB for replication slot statistics. Previously, we used to use the array of size max_replication_slots to store stats for replication slots. But that had two problems in the cases where a message for dropping a slot gets lost: 1) the stats for the new slot are not recorded if the array is full and 2) writing beyond the end of the array if the user reduces the max_replication_slots. This commit uses HTAB for replication slot statistics, resolving both problems. Now, pgstat_vacuum_stat() search for all the dead replication slots in stats hashtable and tell the collector to remove them. To avoid showing the stats for the already-dropped slots, pg_stat_replication_slots view searches slot stats by the slot name taken from pg_replication_slots. Also, we send a message for creating a slot at slot creation, initializing the stats. This reduces the possibility that the stats are accumulated into the old slot stats when a message for dropping a slot gets lost. Reported-by: Andres Freund Author: Sawada Masahiko, test case by Vignesh C Reviewed-by: Amit Kapila, Vignesh C, Dilip Kumar Discussion: https://postgr.es/m/20210319185247.ldebgpdaxsowiflw@alap3.anarazel.de	2021-04-27 09:09:11 +05:30
Fujii Masao	8ff1c94649	Allow TRUNCATE command to truncate foreign tables. This commit introduces new foreign data wrapper API for TRUNCATE. It extends TRUNCATE command so that it accepts foreign tables as the targets to truncate and invokes that API. Also it extends postgres_fdw so that it can issue TRUNCATE command to foreign servers, by adding new routine for that TRUNCATE API. The information about options specified in TRUNCATE command, e.g., ONLY, CACADE, etc is passed to FDW via API. The list of foreign tables to truncate is also passed to FDW. FDW truncates the foreign data sources that the passed foreign tables specify, based on those information. For example, postgres_fdw constructs TRUNCATE command using them and issues it to the foreign server. For performance, TRUNCATE command invokes the FDW routine for TRUNCATE once per foreign server that foreign tables to truncate belong to. Author: Kazutaka Onishi, Kohei KaiGai, slightly modified by Fujii Masao Reviewed-by: Bharath Rupireddy, Michael Paquier, Zhihong Yu, Alvaro Herrera, Stephen Frost, Ashutosh Bapat, Amit Langote, Daniel Gustafsson, Ibrar Ahmed, Fujii Masao Discussion: https://postgr.es/m/CAOP8fzb_gkReLput7OvOK+8NHgw-RKqNv59vem7=524krQTcWA@mail.gmail.com Discussion: https://postgr.es/m/CAJuF6cMWDDqU-vn_knZgma+2GMaout68YUgn1uyDnexRhqqM5Q@mail.gmail.com	2021-04-08 20:56:08 +09:00
Thomas Munro	1d257577e0	Optionally prefetch referenced data in recovery. Introduce a new GUC recovery_prefetch, disabled by default. When enabled, look ahead in the WAL and try to initiate asynchronous reading of referenced data blocks that are not yet cached in our buffer pool. For now, this is done with posix_fadvise(), which has several caveats. Better mechanisms will follow in later work on the I/O subsystem. The GUC maintenance_io_concurrency is used to limit the number of concurrent I/Os we allow ourselves to initiate, based on pessimistic heuristics used to infer that I/Os have begun and completed. The GUC wal_decode_buffer_size is used to limit the maximum distance we are prepared to read ahead in the WAL to find uncached blocks. Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (parts) Reviewed-by: Andres Freund <andres@anarazel.de> (parts) Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (parts) Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com> Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com	2021-04-08 23:20:42 +12:00
Robert Haas	ec7ffb8096	amcheck: fix multiple problems with TOAST pointer validation First, don't perform database access while holding a buffer lock. When checking a heap, we can validate that TOAST pointers are sane by performing a scan on the TOAST index and looking up the chunks that correspond to each value ID that appears in a TOAST poiner in the main table. But, to do that while holding a buffer lock at least risks causing other backends to wait uninterruptibly, and probably can cause undetected and uninterruptible deadlocks. So, instead, make a list of checks to perform while holding the lock, and then perform the checks after releasing it. Second, adjust things so that we don't try to follow TOAST pointers for tuples that are already eligible to be pruned. The TOAST tuples become eligible for pruning at the same time that the main tuple does, so trying to check them may lead to spurious reports of corruption, as observed in the buildfarm. The necessary infrastructure to decide whether or not the tuple being checked is prunable was added by commit `3b6c1259f9`, but it wasn't actually used for its intended purpose prior to this patch. Mark Dilger, adjusted by me to avoid a memory leak. Discussion: http://postgr.es/m/AC5479E4-6321-473D-AC92-5EC36299FBC2@enterprisedb.com	2021-04-07 13:39:12 -04:00
Peter Geoghegan	8523492d4e	Remove tupgone special case from vacuumlazy.c. Retry the call to heap_prune_page() in rare cases where there is disagreement between the heap_prune_page() call and the call to HeapTupleSatisfiesVacuum() that immediately follows. Disagreement is possible when a concurrently-aborted transaction makes a tuple DEAD during the tiny window between each step. This was the only case where a tuple considered DEAD by VACUUM still had storage following pruning. VACUUM's definition of dead tuples is now uniformly simple and unambiguous: dead tuples from each page are always LP_DEAD line pointers that were encountered just after we performed pruning (and just before we considered freezing remaining items with tuple storage). Eliminating the tupgone=true special case enables INDEX_CLEANUP=off style skipping of index vacuuming that takes place based on flexible, dynamic criteria. The INDEX_CLEANUP=off case had to know about skipping indexes up-front before now, due to a subtle interaction with the special case (see commit `dd695979`) -- this was a special case unto itself. Now there are no special cases. And so now it won't matter when or how we decide to skip index vacuuming: it won't affect how pruning behaves, and it won't be affected by any of the implementation details of pruning or freezing. Also remove XLOG_HEAP2_CLEANUP_INFO records. These are no longer necessary because we now rely entirely on heap pruning taking care of recovery conflicts. There is no longer any need to generate recovery conflicts for DEAD tuples that pruning just missed. This also means that heap vacuuming now uses exactly the same strategy for recovery conflicts as index vacuuming always has: REDO routines never need to process a latestRemovedXid from the WAL record, since earlier REDO of the WAL record from pruning is sufficient in all cases. The generic XLOG_HEAP2_CLEAN record type is now split into two new record types to reflect this new division (these are called XLOG_HEAP2_PRUNE and XLOG_HEAP2_VACUUM). Also stop acquiring a super-exclusive lock for heap pages when they're vacuumed during VACUUM's second heap pass. A regular exclusive lock is enough. This is correct because heap page vacuuming is now strictly a matter of setting the LP_DEAD line pointers to LP_UNUSED. No other backend can have a pointer to a tuple located in a pinned buffer that can be invalidated by a concurrent heap page vacuum operation. Heap vacuuming can now be thought of as conceptually similar to index vacuuming and conceptually dissimilar to heap pruning. Heap pruning now has sole responsibility for anything involving the logical contents of the database (e.g., managing transaction status information, recovery conflicts, considering what to do with HOT chains). Index vacuuming and heap vacuuming are now only concerned with recycling garbage items from physical data structures that back the logical database. Bump XLOG_PAGE_MAGIC due to pruning and heap page vacuum WAL record changes. Credit for the idea of retrying pruning a page to avoid the tupgone case goes to Andres Freund. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com	2021-04-06 08:49:22 -07:00
Michael Paquier	e6bdfd9700	Refactor HMAC implementations Similarly to the cryptohash implementations, this refactors the existing HMAC code into a single set of APIs that can be plugged with any crypto libraries PostgreSQL is built with (only OpenSSL currently). If there is no such libraries, a fallback implementation is available. Those new APIs are designed similarly to the existing cryptohash layer, so there is no real new design here, with the same logic around buffer bound checks and memory handling. HMAC has a dependency on cryptohashes, so all the cryptohash types supported by cryptohash{_openssl}.c can be used with HMAC. This refactoring is an advantage mainly for SCRAM, that included its own implementation of HMAC with SHA256 without relying on the existing crypto libraries even if PostgreSQL was built with their support. This code has been tested on Windows and Linux, with and without OpenSSL, across all the versions supported on HEAD from 1.1.1 down to 1.0.1. I have also checked that the implementations are working fine using some sample results, a custom extension of my own, and doing cross-checks across different major versions with SCRAM with the client and the backend. Author: Michael Paquier Reviewed-by: Bruce Momjian Discussion: https://postgr.es/m/X9m0nkEJEzIPXjeZ@paquier.xyz	2021-04-03 17:30:49 +09:00
Amit Kapila	26acb54a13	Revert "Enable parallel SELECT for "INSERT INTO ... SELECT ..."." To allow inserts in parallel-mode this feature has to ensure that all the constraints, triggers, etc. are parallel-safe for the partition hierarchy which is costly and we need to find a better way to do that. Additionally, we could have used existing cached information in some cases like indexes, domains, etc. to determine the parallel-safety. List of commits reverted, in reverse chronological order: `ed62d3737c` Doc: Update description for parallel insert reloption. `c8f78b6161` Add a new GUC and a reloption to enable inserts in parallel-mode. `c5be48f092` Improve FK trigger parallel-safety check added by `05c8482f7f`. `e2cda3c20a` Fix use of relcache TriggerDesc field introduced by commit `05c8482f7f`. `e4e87a32cc` Fix valgrind issue in commit `05c8482f7f`. `05c8482f7f` Enable parallel SELECT for "INSERT INTO ... SELECT ...". Discussion: https://postgr.es/m/E1lMiB9-0001c3-SY@gemulon.postgresql.org	2021-03-24 11:29:15 +05:30
Amit Kapila	c8f78b6161	Add a new GUC and a reloption to enable inserts in parallel-mode. Commit `05c8482f7f` added the implementation of parallel SELECT for "INSERT INTO ... SELECT ..." which may incur non-negligible overhead in the additional parallel-safety checks that it performs, even when, in the end, those checks determine that parallelism can't be used. This is normally only ever a problem in the case of when the target table has a large number of partitions. A new GUC option "enable_parallel_insert" is added, to allow insert in parallel-mode. The default is on. In addition to the GUC option, the user may want a mechanism to allow inserts in parallel-mode with finer granularity at table level. The new table option "parallel_insert_enabled" allows this. The default is true. Author: "Hou, Zhijie" Reviewed-by: Greg Nancarrow, Amit Langote, Takayuki Tsunakawa, Amit Kapila Discussion: https://postgr.es/m/CAA4eK1K-cW7svLC2D7DHoGHxdAdg3P37BLgebqBOC2ZLc9a6QQ%40mail.gmail.com Discussion: https://postgr.es/m/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV=qpFJrR3AcrTS3g@mail.gmail.com	2021-03-18 07:25:27 +05:30
Alvaro Herrera	acb7e4eb6b	Implement pipeline mode in libpq Pipeline mode in libpq lets an application avoid the Sync messages in the FE/BE protocol that are implicit in the old libpq API after each query. The application can then insert Sync at its leisure with a new libpq function PQpipelineSync. This can lead to substantial reductions in query latency. Co-authored-by: Craig Ringer <craig.ringer@enterprisedb.com> Co-authored-by: Matthieu Garrigues <matthieu.garrigues@gmail.com> Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Aya Iwata <iwata.aya@jp.fujitsu.com> Reviewed-by: Daniel Vérité <daniel@manitou-mail.org> Reviewed-by: David G. Johnston <david.g.johnston@gmail.com> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Reviewed-by: Kirk Jamison <k.jamison@fujitsu.com> Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reviewed-by: Nikhil Sontakke <nikhils@2ndquadrant.com> Reviewed-by: Vaishnavi Prabakaran <VaishnaviP@fast.au.fujitsu.com> Reviewed-by: Zhihong Yu <zyu@yugabyte.com> Discussion: https://postgr.es/m/CAMsr+YFUjJytRyV4J-16bEoiZyH=4nj+sQ7JP9ajwz=B4dMMZw@mail.gmail.com Discussion: https://postgr.es/m/CAJkzx4T5E-2cQe3dtv2R78dYFvz+in8PY7A8MArvLhs_pg75gg@mail.gmail.com	2021-03-15 18:13:42 -03:00
Fujii Masao	d75288fb27	Make archiver process an auxiliary process. This commit changes WAL archiver process so that it's treated as an auxiliary process and can use shared memory. This is an infrastructure patch required for upcoming shared-memory based stats collector patch series. These patch series basically need any processes including archiver that can report the statistics to access to shared memory. Since this patch itself is useful to simplify the code and when users monitor the status of archiver, it's committed separately in advance. This commit simplifies the code for WAL archiving. For example, previously backends need to signal to archiver via postmaster when they notify archiver that there are some WAL files to archive. On the other hand, this commit removes that signal to postmaster and enables backends to notify archier directly using shared latch. Also, as the side of this change, the information about archiver process becomes viewable at pg_stat_activity view. Author: Kyotaro Horiguchi Reviewed-by: Andres Freund, Álvaro Herrera, Julien Rouhaud, Tomas Vondra, Arthur Zakirov, Fujii Masao Discussion: https://postgr.es/m/20180629.173418.190173462.horiguchi.kyotaro@lab.ntt.co.jp	2021-03-15 13:13:14 +09:00
Robert Haas	f71519e545	Refactor and generalize the ParallelSlot machinery. Create a wrapper object, ParallelSlotArray, to encapsulate the number of slots and the slot array itself, plus some other relevant bits of information. This reduces the number of parameters we have to pass around all over the place. Allow for a ParallelSlotArray to contain slots connected to different databases within a single cluster. The current clients of this mechanism don't need this, but it is expected to be used by future patches. Defer connecting to databases until we actually need the connection for something. This is a slight behavior change for vacuumdb and reindexdb. If you specify a number of jobs that is larger than the number of objects, the extra connections will now not be used. But, on the other hand, if you specify a number of jobs that is so large that it's going to fail, the failure would previously have happened before any operations were actually started, and now it won't. Mark Dilger, reviewed by me. Discussion: http://postgr.es/m/12ED3DA8-25F0-4B68-937D-D907CFBF08E7@enterprisedb.com Discussion: http://postgr.es/m/BA592F2D-F928-46FF-9516-2B827F067F57@enterprisedb.com	2021-03-11 13:17:46 -05:00
Thomas Munro	d87251048a	Replace buffer I/O locks with condition variables. 1. Backends waiting for buffer I/O are now interruptible. 2. If something goes wrong in a backend that is currently performing I/O, waiting backends no longer wake up until that backend reaches AbortBufferIO() and broadcasts on the CV. Previously, any waiters would wake up (because the I/O lock was automatically released) and then busy-loop until AbortBufferIO() cleared BM_IO_IN_PROGRESS. 3. LWLockMinimallyPadded is removed, as it would now be unused. Author: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Julien Rouhaud <rjuju123@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (earlier version, 2016) Discussion: https://postgr.es/m/CA%2BhUKGJ8nBFrjLuCTuqKN0pd2PQOwj9b_jnsiGFFMDvUxahj_A%40mail.gmail.com Discussion: https://postgr.es/m/CA+Tgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr=C56Xng@mail.gmail.com	2021-03-11 10:36:17 +13:00
Thomas Munro	44bf3d5083	Add missing pthread_barrier_t. Supply a simple implementation of the missing pthread_barrier_t type and functions, for macOS. Discussion: https://postgr.es/m/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de	2021-03-10 17:44:04 +13:00
Thomas Munro	547f04e734	pgbench: Improve time logic. Instead of instr_time (struct timespec) and the INSTR_XXX macros, introduce pg_time_usec_t and use integer arithmetic. Don't include the connection time in TPS unless using -C mode, but report it separately. Author: Fabien COELHO <coelho@cri.ensmp.fr> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Discussion: https://postgr.es/m/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de	2021-03-10 17:44:04 +13:00
Thomas Munro	0a1f1d3cac	Add sort_template.h for making sort functions. Move our qsort implementation into a header that can be used to define specialized functions for better performance and reduced duplication. Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/CA%2BhUKGJ2-eaDqAum5bxhpMNhvuJmRDZxB_Tow0n-gse%2BHG0Yig%40mail.gmail.com	2021-03-03 17:02:22 +13:00
Amit Kapila	ce0fdbfe97	Allow multiple xacts during table sync in logical replication. For the initial table data synchronization in logical replication, we use a single transaction to copy the entire table and then synchronize the position in the stream with the main apply worker. There are multiple downsides of this approach: (a) We have to perform the entire copy operation again if there is any error (network breakdown, error in the database operation, etc.) while we synchronize the WAL position between tablesync worker and apply worker; this will be onerous especially for large copies, (b) Using a single transaction in the synchronization-phase (where we can receive WAL from multiple transactions) will have the risk of exceeding the CID limit, (c) The slot will hold the WAL till the entire sync is complete because we never commit till the end. This patch solves all the above downsides by allowing multiple transactions during the tablesync phase. The initial copy is done in a single transaction and after that, we commit each transaction as we receive. To allow recovery after any error or crash, we use a permanent slot and origin to track the progress. The slot and origin will be removed once we finish the synchronization of the table. We also remove slot and origin of tablesync workers if the user performs DROP SUBSCRIPTION .. or ALTER SUBSCRIPTION .. REFERESH and some of the table syncs are still not finished. The commands ALTER SUBSCRIPTION ... REFRESH PUBLICATION and ALTER SUBSCRIPTION ... SET PUBLICATION ... with refresh option as true cannot be executed inside a transaction block because they can now drop the slots for which we have no provision to rollback. This will also open up the path for logical replication of 2PC transactions on the subscriber side. Previously, we can't do that because of the requirement of maintaining a single transaction in tablesync workers. Bump catalog version due to change of state in the catalog (pg_subscription_rel). Author: Peter Smith, Amit Kapila, and Takamichi Osumi Reviewed-by: Ajin Cherian, Petr Jelinek, Hou Zhijie and Amit Kapila Discussion: https://postgr.es/m/CAA4eK1KHJxaZS-fod-0fey=0tq3=Gkn4ho=8N4-5HWiCfu0H1A@mail.gmail.com	2021-02-12 07:41:51 +05:30
Alexander Korotkov	676887a3b0	Implementation of subscripting for jsonb Subscripting for jsonb does not support slices, does not have a limit for the number of subscripts, and an assignment expects a replace value to have jsonb type. There is also one functional difference between assignment via subscripting and assignment via jsonb_set(). When an original jsonb container is NULL, the subscripting replaces it with an empty jsonb and proceeds with an assignment. For the sake of code reuse, we rearrange some parts of jsonb functionality to allow the usage of the same functions for jsonb_set and assign subscripting operation. The original idea belongs to Oleg Bartunov. Catversion is bumped. Discussion: https://postgr.es/m/CA%2Bq6zcV8qvGcDXurwwgUbwACV86Th7G80pnubg42e-p9gsSf%3Dg%40mail.gmail.com Discussion: https://postgr.es/m/CA%2Bq6zcX3mdxGCgdThzuySwH-ApyHHM-G4oB1R0fn0j2hZqqkLQ%40mail.gmail.com Discussion: https://postgr.es/m/CA%2Bq6zcVDuGBv%3DM0FqBYX8DPebS3F_0KQ6OVFobGJPM507_SZ_w%40mail.gmail.com Discussion: https://postgr.es/m/CA%2Bq6zcVovR%2BXY4mfk-7oNk-rF91gH0PebnNfuUjuuDsyHjOcVA%40mail.gmail.com Author: Dmitry Dolgov Reviewed-by: Tom Lane, Arthur Zakirov, Pavel Stehule, Dian M Fay Reviewed-by: Andrew Dunstan, Chapman Flack, Merlin Moncure, Peter Geoghegan Reviewed-by: Alvaro Herrera, Jim Nasby, Josh Berkus, Victor Wagner Reviewed-by: Aleksander Alekseev, Robert Haas, Oleg Bartunov	2021-01-31 23:50:40 +03:00
Michael Paquier	a8ed6bb8f4	Introduce SHA1 implementations in the cryptohash infrastructure With this commit, SHA1 goes through the implementation provided by OpenSSL via EVP when building the backend with it, and uses as fallback implementation KAME which was located in pgcrypto and already shaped for an integration with a set of init, update and final routines. Structures and routines have been renamed to make things consistent with the fallback implementations of MD5 and SHA2. uuid-ossp has used for ages a shortcut with pgcrypto to fetch a copy of SHA1 if needed. This was built depending on the build options within ./configure, so this cleans up some code and removes the build dependency between pgcrypto and uuid-ossp. Note that this will help with the refactoring of HMAC, as pgcrypto offers the option to use MD5, SHA1 or SHA2, so only the second option was missing to make that possible. Author: Michael Paquier Reviewed-by: Heikki Linnakangas Discussion: https://postgr.es/m/X9HXKTgrvJvYO7Oh@paquier.xyz	2021-01-23 11:33:04 +09:00
Michael Paquier	a3dc926009	Refactor option handling of CLUSTER, REINDEX and VACUUM This continues the work done in `b5913f6`. All the options of those commands are changed to use hex values rather than enums to reduce the risk of compatibility bugs when introducing new options. Each option set is moved into a new structure that can be extended with more non-boolean options (this was already the case of VACUUM). The code of REINDEX is restructured so as manual REINDEX commands go through a single routine from utility.c, like VACUUM, to ease the allocation handling of option parameters when a command needs to go through multiple transactions. This can be used as a base infrastructure for future patches related to those commands, including reindex filtering and tablespace support. Per discussion with people mentioned below, as well as Alvaro Herrera and Peter Eisentraut. Author: Michael Paquier, Justin Pryzby Reviewed-by: Alexey Kondratov, Justin Pryzby Discussion: https://postgr.es/m/X8riynBLwxAD9uKk@paquier.xyz	2021-01-18 14:03:10 +09:00
Alvaro Herrera	c6c4b37395	Invent struct ReindexIndexInfo This struct is used by ReindexRelationConcurrently to keep track of the relations to process. This saves having to obtain some data repeatedly, and has future uses as well. Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Hamid Akhtar <hamid.akhtar@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/20201130195439.GA24598@alvherre.pgsql	2021-01-12 17:05:06 -03:00
Michael Paquier	55fe26a4b5	Fix allocation logic of cryptohash context data with OpenSSL The allocation of the cryptohash context data when building with OpenSSL was happening in the memory context of the caller of pg_cryptohash_create(), which could lead to issues with resowner cleanup if cascading resources are cleaned up on an error. Like other facilities using resowners, move the base allocation to TopMemoryContext to ensure a correct cleanup on failure. The resulting code gets simpler with this commit as the context data is now hold by a unique opaque pointer, so as there is only one single allocation done in TopMemoryContext. After discussion, also change the cryptohash subroutines to return an error if the caller provides NULL for the context data to ease error detection on OOM. Author: Heikki Linnakangas Discussion: https://postgr.es/m/X9xbuEoiU3dlImfa@paquier.xyz	2021-01-07 10:21:02 +09:00
Amit Kapila	0aa8a01d04	Extend the output plugin API to allow decoding of prepared xacts. This adds six methods to the output plugin API, adding support for streaming changes of two-phase transactions at prepare time. * begin_prepare * filter_prepare * prepare * commit_prepared * rollback_prepared * stream_prepare Most of this is a simple extension of the existing methods, with the semantic difference that the transaction is not yet committed and maybe aborted later. Until now two-phase transactions were translated into regular transactions on the subscriber, and the GID was not forwarded to it. None of the two-phase commands were communicated to the subscriber. This patch provides the infrastructure for logical decoding plugins to be informed of two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED and ROLLBACK PREPARED commands with the corresponding GID. This also extends the 'test_decoding' plugin, implementing these new methods. This commit simply adds these new APIs and the upcoming patch to "allow the decoding at prepare time in ReorderBuffer" will use these APIs. Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, and Dilip Kumar Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com	2020-12-30 16:17:26 +05:30
Alexander Korotkov	6df7a9698b	Multirange datatypes Multiranges are basically sorted arrays of non-overlapping ranges with set-theoretic operations defined over them. Since v14, each range type automatically gets a corresponding multirange datatype. There are both manual and automatic mechanisms for naming multirange types. Once can specify multirange type name using multirange_type_name attribute in CREATE TYPE. Otherwise, a multirange type name is generated automatically. If the range type name contains "range" then we change that to "multirange". Otherwise, we add "_multirange" to the end. Implementation of multiranges comes with a space-efficient internal representation format, which evades extra paddings and duplicated storage of oids. Altogether this format allows fetching a particular range by its index in O(n). Statistic gathering and selectivity estimation are implemented for multiranges. For this purpose, stored multirange is approximated as union range without gaps. This field will likely need improvements in the future. Catversion is bumped. Discussion: https://postgr.es/m/CALNJ-vSUpQ_Y%3DjXvTxt1VYFztaBSsWVXeF1y6gTYQ4bOiWDLgQ%40mail.gmail.com Discussion: https://postgr.es/m/a0b8026459d1e6167933be2104a6174e7d40d0ab.camel%40j-davis.com#fe7218c83b08068bfffb0c5293eceda0 Author: Paul Jungwirth, revised by me Reviewed-by: David Fetter, Corey Huinker, Jeff Davis, Pavel Stehule Reviewed-by: Alvaro Herrera, Tom Lane, Isaac Morland, David G. Johnston Reviewed-by: Zhihong Yu, Alexander Korotkov	2020-12-20 07:20:33 +03:00
Michael Paquier	b67b57a966	Refactor MD5 implementations according to new cryptohash infrastructure This commit heavily reorganizes the MD5 implementations that exist in the tree in various aspects. First, MD5 is added to the list of options available in cryptohash.c and cryptohash_openssl.c. This means that if building with OpenSSL, EVP is used for MD5 instead of the fallback implementation that Postgres had for ages. With the recent refactoring work for cryptohash functions, this change is straight-forward. If not building with OpenSSL, a fallback implementation internal to src/common/ is used. Second, this reduces the number of MD5 implementations present in the tree from two to one, by moving the KAME implementation from pgcrypto to src/common/, and by removing the implementation that existed in src/common/. KAME was already structured with an init/update/final set of routines by pgcrypto (see original pgcrypto/md5.h) for compatibility with OpenSSL, so moving it to src/common/ has proved to be a straight-forward move, requiring no actual manipulation of the internals of each routine. Some benchmarking has not shown any performance gap between both implementations. Similarly to the fallback implementation used for SHA2, the fallback implementation of MD5 is moved to src/common/md5.c with an internal header called md5_int.h for the init, update and final routines. This gets then consumed by cryptohash.c. The original routines used for MD5-hashed passwords are moved to a separate file called md5_common.c, also in src/common/, aimed at being shared between all MD5 implementations as utility routines to keep compatibility with any code relying on them. Like the SHA2 changes, this commit had its round of tests on both Linux and Windows, across all versions of OpenSSL supported on HEAD, with and even without OpenSSL. Author: Michael Paquier Reviewed-by: Daniel Gustafsson Discussion: https://postgr.es/m/20201106073434.GA4961@paquier.xyz	2020-12-10 11:59:10 +09:00
Michael Paquier	4f48a6fbe2	Change SHA2 implementation based on OpenSSL to use EVP digest routines The use of low-level hash routines is not recommended by upstream OpenSSL since 2000, and pgcrypto already switched to EVP as of `5ff4a67`. This takes advantage of the refactoring done in `87ae969` that has introduced the allocation and free routines for cryptographic hashes. Since 1.1.0, OpenSSL does not publish the contents of the cryptohash contexts, forcing any consumers to rely on OpenSSL for all allocations. Hence, the resource owner callback mechanism gains a new set of routines to track and free cryptohash contexts when using OpenSSL, preventing any risks of leaks in the backend. Nothing is needed in the frontend thanks to the refactoring of `87ae969`, and the resowner knowledge is isolated into cryptohash_openssl.c. Note that this also fixes a failure with SCRAM authentication when using FIPS in OpenSSL, but as there have been few complaints about this problem and as this causes an ABI breakage, no backpatch is done. Author: Michael Paquier Reviewed-by: Daniel Gustafsson, Heikki Linnakangas Discussion: https://postgr.es/m/20200924025314.GE7405@paquier.xyz Discussion: https://postgr.es/m/20180911030250.GA27115@paquier.xyz	2020-12-04 10:49:23 +09:00
Michael Paquier	87ae9691d2	Move SHA2 routines to a new generic API layer for crypto hashes Two new routines to allocate a hash context and to free it are created, as these become necessary for the goal behind this refactoring: switch the all cryptohash implementations for OpenSSL to use EVP (for FIPS and also because upstream does not recommend the use of low-level cryptohash functions for 20 years). Note that OpenSSL hides the internals of cryptohash contexts since 1.1.0, so it is necessary to leave the allocation to OpenSSL itself, explaining the need for those two new routines. This part is going to require more work to properly track hash contexts with resource owners, but this not introduced here. Still, this refactoring makes the move possible. This reduces the number of routines for all SHA2 implementations from twelve (SHA{224,256,386,512} with init, update and final calls) to five (create, free, init, update and final calls) by incorporating the hash type directly into the hash context data. The new cryptohash routines are moved to a new file, called cryptohash.c for the fallback implementations, with SHA2 specifics becoming a part internal to src/common/. OpenSSL specifics are part of cryptohash_openssl.c. This infrastructure is usable for more hash types, like MD5 or HMAC. Any code paths using the internal SHA2 routines are adapted to report correctly errors, which are most of the changes of this commit. The zones mostly impacted are checksum manifests, libpq and SCRAM. Note that `e21cbb4` was a first attempt to switch SHA2 to EVP, but it lacked the refactoring needed for libpq, as done here. This patch has been tested on Linux and Windows, with and without OpenSSL, and down to 1.0.1, the oldest version supported on HEAD. Author: Michael Paquier Reviewed-by: Daniel Gustafsson Discussion: https://postgr.es/m/20200924025314.GE7405@paquier.xyz	2020-12-02 10:37:20 +09:00
Fujii Masao	9fbc3f318d	pg_stat_statements: Track number of times pgss entries were deallocated. If more distinct statements than pg_stat_statements.max are observed, pg_stat_statements entries about the least-executed statements are deallocated. This commit enables us to track the total number of times those entries were deallocated. That number can be viewed in the pg_stat_statements_info view that this commit adds. It's useful when tuning pg_stat_statements.max parameter. If it's high, i.e., the entries are deallocated very frequently, which might cause the performance regression and we can increase pg_stat_statements.max to avoid those frequent deallocations. The pg_stat_statements_info view is intended to display the statistics of pg_stat_statements module itself. Currently it has only one column "dealloc" indicating the number of times entries were deallocated. But an upcoming patch will add other columns (for example, the time at which pg_stat_statements statistics were last reset) into the view. Author: Katsuragi Yuta, Yuki Seino Reviewed-by: Fujii Masao Discussion: https://postgr.es/m/0d9f1107772cf5c3f954e985464c7298@oss.nttdata.com	2020-11-26 21:18:05 +09:00
Heikki Linnakangas	0a2bc5d61e	Move per-agg and per-trans duplicate finding to the planner. This has the advantage that the cost estimates for aggregates can count the number of calls to transition and final functions correctly. Bump catalog version, because views can contain Aggrefs. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/b2e3536b-1dbc-8303-c97e-89cb0b4a9a48%40iki.fi	2020-11-24 10:45:00 +02:00
Amit Kapila	9653f24ad8	Fix 'skip-empty-xacts' option in test_decoding for streaming mode. In streaming mode, the transaction can be decoded in multiple streams and those streams can be interleaved with streams of other transactions. So, we can't remember the transaction's write status in the logical decoding context because that might get changed due to some other transactions and lead to wrong answers for 'skip-empty-xacts' option. We decided to keep each transaction's write status in the ReorderBufferTxn to avoid interleaved streams changing the status of some unrelated transactions. Diagnosed-by: Amit Kapila Author: Dilip Kumar Reviewed-by: Amit Kapila Discussion: https://postgr.es/m/CAA4eK1LR7=XNM_TLmpZMFuV8ZQpoxkem--NZJYf8YXmesbvwLA@mail.gmail.com	2020-11-17 12:14:53 +05:30
Alexander Korotkov	8afa3795a7	Fix alphabetic ordering in typedefs.list	2020-11-15 08:56:31 +03:00
Andres Freund	6c57f2ed16	jit: Add support for LLVM 12. LLVM 12, to be released in a few months, made some breaking changes to the Orc JIT interface. OrcV2 eventually will make it easier to support features like concurrent JIT compilation, but this commit only allows to compile against LLVM 12. This commit is a bit bigger than desirable. That partially is because the V2 interface is more granular than V1 interface, but also because I chose to make some minor changes to < LLVM 12 code to keep the code somewhat readable. The LLVM 12 support will need to be backpatched. I plan to do so after the patch stewed on the buildfarm for a few days. Author: Andres Freund Discussion: https://postgr.es/m/20201016011244.pmyvr3ee2gbzplq4@alap3.anarazel.de	2020-11-09 20:01:33 -08:00
Heikki Linnakangas	37d2ff3803	pg_rewind: Refactor the abstraction to fetch from local/libpq source. This makes the abstraction of a "source" server more clear, by introducing a common abstract class, borrowing the object-oriented programming term, that represents all the operations that can be done on the source server. There are two implementations of it, one for fetching via libpq, and another to fetch from a local directory. This adds some code, but makes it easier to understand what's going on. The copy_executeFileMap() and libpq_executeFileMap() functions contained basically the same logic, just calling different functions to fetch the source files. Refactor so that the common logic is in one place, in a new function called perform_rewind(). Reviewed-by: Kyotaro Horiguchi, Soumyadeep Chakraborty Discussion: https://www.postgresql.org/message-id/0c5b3783-af52-3ee5-f8fa-6e794061f70d%40iki.fi	2020-11-04 11:21:18 +02:00
Michael Paquier	e152506ade	Revert pg_relation_check_pages() This reverts the following set of commits, following complaints about the lack of portability of the central part of the code in bufmgr.c as well as the use of partition mapping locks during page reads: `c780a7a9` `f2b88396` `b787d4ce` `ce7f772c` `60a51c6b` Per discussion with Andres Freund, Robert Haas and myself. Bump catalog version. Discussion: https://postgr.es/m/20201029181729.2nrub47u7yqncsv7@alap3.anarazel.de	2020-11-04 10:21:46 +09:00
Thomas Munro	257836a755	Track collation versions for indexes. Record the current version of dependent collations in pg_depend when creating or rebuilding an index. When accessing the index later, warn that the index may be corrupted if the current version doesn't match. Thanks to Douglas Doole, Peter Eisentraut, Christoph Berg, Laurenz Albe, Michael Paquier, Robert Haas, Tom Lane and others for very helpful discussion. Author: Thomas Munro <thomas.munro@gmail.com> Author: Julien Rouhaud <rjuju123@gmail.com> Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com> (earlier versions) Discussion: https://postgr.es/m/CAEepm%3D0uEQCpfq_%2BLYFBdArCe4Ot98t1aR4eYiYTe%3DyavQygiQ%40mail.gmail.com	2020-11-03 01:19:50 +13:00
Michael Paquier	f2b8839695	Add pg_relation_check_pages() to check on-disk pages of a relation This makes use of CheckBuffer() introduced in `c780a7a`, adding a SQL wrapper able to do checks for all the pages of a relation. By default, all the fork types of a relation are checked, and it is possible to check only a given relation fork. Note that if the relation given in input has no physical storage or is temporary, then no errors are generated, allowing full-database checks when coupled with a simple scan of pg_class for example. This is not limited to clusters with data checksums enabled, as clusters without data checksums can still apply checks on pages using the page headers or for the case of a page full of zeros. This function returns a set of tuples consisting of: - The physical file where a broken page has been detected (without the segment number as that can be AM-dependent, which can be guessed from the block number for heap). A relative path from PGPATH is used. - The block number of the broken page. By default, only superusers have an access to this function but execution rights can be granted to other users. The feature introduced here is still minimal, and more improvements could be done, like: - Addition of a start and end block number to run checks on a range of blocks, which would apply only if one fork type is checked. - Addition of some progress reporting. - Throttling, with configuration parameters in function input or potentially some cost-based GUCs. Regression tests are added for positive cases in the main regression test suite, and TAP tests are added for cases involving the emulation of page corruptions. Bump catalog version. Author: Julien Rouhaud, Michael Paquier Reviewed-by: Masahiko Sawada, Justin Pryzby Discussion: https://postgr.es/m/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com	2020-10-28 12:15:00 +09:00
Robert Haas	866e24d47d	Extend amcheck to check heap pages. Mark Dilger, reviewed by Peter Geoghegan, Andres Freund, Álvaro Herrera, Michael Paquier, Amul Sul, and by me. Some last-minute cosmetic revisions by me. Discussion: http://postgr.es/m/12ED3DA8-25F0-4B68-937D-D907CFBF08E7@enterprisedb.com	2020-10-22 08:44:18 -04:00
Michael Paquier	80f8eb79e2	Use perfect hash for NFC and NFKC Unicode Normalization quick check This makes the normalization quick check about 30% faster for NFC and 50% faster for NFKC than the binary search used previously. The hash lookup reuses the existing array of bit fields used for the binary search to get the quick check property and is generated as part of "make update-unicode" in src/common/unicode/. Author: John Naylor Reviewed-by: Mark Dilger, Michael Paquier Discussion: https://postgr.es/m/CACPNZCt4fbJ0_bGrN5QPt34N4whv=mszM0LMVQdoa2rC9UMRXA@mail.gmail.com	2020-10-11 19:09:01 +09:00
Amit Kapila	9868167500	Track statistics for spilling of changes from ReorderBuffer. This adds the statistics about transactions spilled to disk from ReorderBuffer. Users can query the pg_stat_replication_slots view to check these stats and call pg_stat_reset_replication_slot to reset the stats of a particular slot. Users can pass NULL in pg_stat_reset_replication_slot to reset stats of all the slots. This commit extends the statistics collector to track this information about slots. Author: Sawada Masahiko and Amit Kapila Reviewed-by: Amit Kapila and Dilip Kumar Discussion: https://postgr.es/m/CA+fd4k5_pPAYRTDrO2PbtTOe0eHQpBvuqmCr8ic39uTNmR49Eg@mail.gmail.com	2020-10-08 09:09:08 +05:30
Fujii Masao	8d9a935965	Add pg_stat_wal statistics view. This view shows the statistics about WAL activity. Currently it has only two columns: wal_buffers_full and stats_reset. wal_buffers_full column indicates the number of times WAL data was written to the disk because WAL buffers got full. This information is useful when tuning wal_buffers. stats_reset column indicates the time at which these statistics were last reset. pg_stat_wal view is also the basic infrastructure to expose other various statistics about WAL activity later. Bump PGSTAT_FILE_FORMAT_ID due to the change in pgstat format. Bump catalog version. Author: Masahiro Ikeda Reviewed-by: Takayuki Tsunakawa, Kyotaro Horiguchi, Amit Kapila, Fujii Masao Discussion: https://postgr.es/m/188bd3f2d2233cf97753b5ced02bb050@oss.nttdata.com	2020-10-02 10:17:11 +09:00
Thomas Munro	dee663f784	Defer flushing of SLRU files. Previously, we called fsync() after writing out individual pg_xact, pg_multixact and pg_commit_ts pages due to cache pressure, leading to regular I/O stalls in user backends and recovery. Collapse requests for the same file into a single system call as part of the next checkpoint, as we already did for relation files, using the infrastructure developed by commit `3eb77eba`. This can cause a significant improvement to recovery performance, especially when it's otherwise CPU-bound. Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer that it applies to all the SLRU mini-buffer-pools as well as the main buffer pool. Rearrange things so that data collected in CheckpointStats includes SLRU activity. Also remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions, because they were redundant after the shutdown checkpoint that immediately precedes them. (I'm not sure if they were ever needed, but they aren't now.) Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (parts) Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> Discussion: https://postgr.es/m/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com	2020-09-25 19:00:15 +12:00
Robert Haas	34a947ca13	New contrib module, pg_surgery, with heap surgery functions. Sometimes it happens that the visibility information for a tuple becomes corrupted, either due to bugs in the database software or external factors. Provide a function heap_force_kill() that can be used to truncate such dead tuples to dead line pointers, and a function heap_force_freeze() that can be used to overwrite the visibility information in such a way that the tuple becomes all-visible. These functions are unsafe, in that you can easily use them to corrupt a database that was not previously corrupted, and you can use them to further corrupt an already-corrupted database or to destroy data. The documentation accordingly cautions against casual use. However, in some cases they permit recovery of data that would otherwise be very difficult to recover, or to allow a system to continue to function when it would otherwise be difficult to do so. Because we may want to add other functions for performing other kinds of surgery in the future, the new contrib module is called pg_surgery rather than something specific to these functions. I proposed back-patching this so that it could be more easily used by people running existing releases who are facing these kinds of problems, but that proposal did not attract enough support, so no back-patch for now. Ashutosh Sharma, reviewed and tested by Andrey M. Borodin, M. Beena Emerson, Masahiko Sawada, Rajkumar Raghuwanshi, Asim Praveen, and Mark Dilger, and somewhat revised by me. Discussion: http://postgr.es/m/CA+TgmoZW1fsU-QUNCRUQMGUygBDPVeOTLCqRdQZch=EYZnctSA@mail.gmail.com	2020-09-10 11:14:07 -04:00
Michael Paquier	a6642b3ae0	Add support for partitioned tables and indexes in REINDEX Until now, REINDEX was not able to work with partitioned tables and indexes, forcing users to reindex partitions one by one. This extends REINDEX INDEX and REINDEX TABLE so as they can accept a partitioned index and table in input, respectively, to reindex all the partitions assigned to them with physical storage (foreign tables, partitioned tables and indexes are then discarded). This shares some logic with schema and database REINDEX as each partition gets processed in its own transaction after building a list of relations to work on. This choice has the advantage to minimize the number of invalid indexes to one partition with REINDEX CONCURRENTLY in the event a cancellation or failure in-flight, as the only indexes handled at once in a single REINDEX CONCURRENTLY loop are the ones from the partition being working on. Isolation tests are added to emulate some cases I bumped into while developing this feature, particularly with the concurrent drop of a leaf partition reindexed. However, this is rather limited as LOCK would cause REINDEX to block in the first transaction building the list of partitions. Per its multi-transaction nature, this new flavor cannot run in a transaction block, similarly to REINDEX SCHEMA, SYSTEM and DATABASE. Author: Justin Pryzby, Michael Paquier Reviewed-by: Anastasia Lubennikova Discussion: https://postgr.es/m/db12e897-73ff-467e-94cb-4af03705435f.adger.lj@alibaba-inc.com	2020-09-08 10:09:22 +09:00
Thomas Munro	861c6e7c8e	Skip unnecessary stat() calls in walkdir(). Some kernels can tell us the type of a "dirent", so we can avoid a call to stat() or lstat() in many cases. Define a new function get_dirent_type() to contain that logic, for use by the backend and frontend versions of walkdir(), and perhaps other callers in future. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKG%2BFzxupGGN4GpUdbzZN%2Btn6FQPHo8w0Q%2BAPH5Wz8RG%2Bww%40mail.gmail.com	2020-09-07 18:28:06 +12:00
Amit Kapila	464824323e	Add support for streaming to built-in logical replication. To add support for streaming of in-progress transactions into the built-in logical replication, we need to do three things: * Extend the logical replication protocol, so identify in-progress transactions, and allow adding additional bits of information (e.g. XID of subtransactions). * Modify the output plugin (pgoutput) to implement the new stream API callbacks, by leveraging the extended replication protocol. * Modify the replication apply worker, to properly handle streamed in-progress transaction by spilling the data to disk and then replaying them on commit. We however must explicitly disable streaming replication during replication slot creation, even if the plugin supports it. We don't need to replicate the changes accumulated during this phase, and moreover we don't have a replication connection open so we don't have where to send the data anyway. Author: Tomas Vondra, Dilip Kumar and Amit Kapila Reviewed-by: Amit Kapila, Kuntal Ghosh and Ajin Cherian Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com	2020-09-03 07:54:07 +05:30
Andres Freund	73487a60fc	snapshot scalability: Move subxact info to ProcGlobal, remove PGXACT. Similar to the previous changes this increases the chance that data frequently needed by GetSnapshotData() stays in l2 cache. In many workloads subtransactions are very rare, and this makes the check for that considerably cheaper. As this removes the last member of PGXACT, there is no need to keep it around anymore. On a larger 2 socket machine this and the two preceding commits result in a ~1.07x performance increase in read-only pgbench. For read-heavy mixed r/w workloads without row level contention, I see about 1.1x. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de	2020-08-14 15:33:35 -07:00
Andres Freund	dc7420c2c9	snapshot scalability: Don't compute global horizons while building snapshots. To make GetSnapshotData() more scalable, it cannot not look at at each proc's xmin: While snapshot contents do not need to change whenever a read-only transaction commits or a snapshot is released, a proc's xmin is modified in those cases. The frequency of xmin modifications leads to, particularly on higher core count systems, many cache misses inside GetSnapshotData(), despite the data underlying a snapshot not changing. That is the most significant source of GetSnapshotData() scaling poorly on larger systems. Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons / thresholds as it has so far. But we don't really have to: The horizons don't actually change that much between GetSnapshotData() calls. Nor are the horizons actually used every time a snapshot is built. The trick this commit introduces is to delay computation of accurate horizons until there use and using horizon boundaries to determine whether accurate horizons need to be computed. The use of RecentGlobal[Data]Xmin to decide whether a row version could be removed has been replaces with new GlobalVisTest* functions. These use two thresholds to determine whether a row can be pruned: 1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed are definitely still visible. 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can definitely be removed GetSnapshotData() updates definitely_needed to be the xmin of the computed snapshot. When testing whether a row can be removed (with GlobalVisTestIsRemovableXid()) and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID < definitely_needed) the boundaries can be recomputed to be more accurate. As it is not cheap to compute accurate boundaries, we limit the number of times that happens in short succession. As the boundaries used by GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by GetSnapshotData()), it is likely that further test can benefit from an earlier computation of accurate horizons. To avoid regressing performance when old_snapshot_threshold is set (as that requires an accurate horizon to be computed), heap_page_prune_opt() doesn't unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the computation of the limited horizon, and the triggering of errors (with SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove tuples. This commit just removes the accesses to PGXACT->xmin from GetSnapshotData(), but other members of PGXACT residing in the same cache line are accessed. Therefore this in itself does not result in a significant improvement. Subsequent commits will take advantage of the fact that GetSnapshotData() now does not need to access xmins anymore. Note: This contains a workaround in heap_page_prune_opt() to keep the snapshot_too_old tests working. While that workaround is ugly, the tests currently are not meaningful, and it seems best to address them separately. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de	2020-08-12 16:03:49 -07:00
Amit Kapila	a69e041d0c	Improve vacuum error context handling. Use separate functions to save and restore error context information as that made code easier to understand. Also, make it clear that the index information required for error context is sane. Author: Andres Freund, Justin Pryzby, Amit Kapila Backpatch-through: 13, where it was introduced Discussion: https://postgr.es/m/CAA4eK1LWo+v1OWu=Sky27GTGSCuOmr7iaURNbc5xz6jO+SaPeA@mail.gmail.com	2020-07-01 07:58:36 +05:30
Tom Lane	b5d69b7c22	pgindent run prior to branching v13. pgperltidy and reformat-dat-files too, though those didn't find anything to change.	2020-06-07 16:57:08 -04:00
Tom Lane	e02ad575d8	Final pgindent run with pg_bsd_indent version 2.1. This is just to provide a clean basis for comparison of the results of the new version. I did fix a typo that crept into `242dfcbaf`. Discussion: https://postgr.es/m/20200114221814.GA19630@alvherre.pgsql	2020-05-16 11:49:14 -04:00
Tom Lane	5cbfce562f	Initial pgindent and pgperltidy run for v13. Includes some manual cleanup of places that pgindent messed up, most of which weren't per project style anyway. Notably, it seems some people didn't absorb the style rules of commit `c9d297751`, because there were a bunch of new occurrences of function calls with a newline just after the left paren, all with faulty expectations about how the rest of the call would get indented.	2020-05-14 13:06:50 -04:00
Michael Paquier	4e87c4836a	Fix handling of WAL segments ready to be archived during crash recovery `78ea8b5` has fixed an issue related to the recycling of WAL segments on standbys depending on archive_mode. However, it has introduced a regression with the handling of WAL segments ready to be archived during crash recovery, causing those files to be recycled without getting archived. This commit fixes the regression by tracking in shared memory if a live cluster is either in crash recovery or archive recovery as the handling of WAL segments ready to be archived is different in both cases (those WAL segments should not be removed during crash recovery), and by using this new shared memory state to decide if a segment can be recycled or not. Previously, it was not possible to know if a cluster was in crash recovery or archive recovery as the shared state was able to track only if recovery was happening or not, leading to the problem. A set of TAP tests is added to close the gap here, making sure that WAL segments ready to be archived are correctly handled when a cluster is in archive or crash recovery with archive_mode set to "on" or "always", for both standby and primary. Reported-by: Benoît Lobréau Author: Jehan-Guillaume de Rorthais Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost Backpatch-through: 9.5	2020-04-24 08:48:28 +09:00
Alexander Korotkov	1aac32df89	Revert `0f5ca02f53` `0f5ca02f53` introduces 3 new keywords. It appears to be too much for relatively small feature. Given now we past feature freeze, it's already late for discussion of the new syntax. So, revert. Discussion: https://postgr.es/m/28209.1586294824%40sss.pgh.pa.us	2020-04-08 11:37:27 +03:00
Alexander Korotkov	0f5ca02f53	Implement waiting for given lsn at transaction start This commit adds following optional clause to BEGIN and START TRANSACTION commands. WAIT FOR LSN lsn [ TIMEOUT timeout ] New clause pospones transaction start till given lsn is applied on standby. This clause allows user be sure, that changes previously made on primary would be visible on standby. New shared memory struct is used to track awaited lsn per backend. Recovery process wakes up backend once required lsn is applied. Author: Ivan Kartyshov, Anna Akenteva Reviewed-by: Craig Ringer, Thomas Munro, Robert Haas, Kyotaro Horiguchi Reviewed-by: Masahiko Sawada, Ants Aasma, Dmitry Ivanov, Simon Riggs Reviewed-by: Amit Kapila, Alexander Korotkov Discussion: https://postgr.es/m/0240c26c-9f84-30ea-fca9-93ab2df5f305%40postgrespro.ru	2020-04-07 23:51:10 +03:00
Thomas Munro	4c04be9b05	Introduce xid8-based functions to replace txid_XXX. The txid_XXX family of fmgr functions exposes 64 bit transaction IDs to users as int8. Now that we have an SQL type xid8 for FullTransactionId, define a new set of functions including pg_current_xact_id() and pg_current_snapshot() based on that. Keep the old functions around too, for now. It's a bit sneaky to use the same C functions for both, but since the binary representation is identical except for the signedness of the type, and since older functions are the ones using the wrong signedness, and since we'll presumably drop the older ones after a reasonable period of time, it seems reasonable to switch to FullTransactionId internally and share the code for both. Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Takao Fujii <btfujiitkp@oss.nttdata.com> Reviewed-by: Yoshikazu Imai <imai.yoshikazu@fujitsu.com> Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://postgr.es/m/20190725000636.666m5mad25wfbrri%40alap3.anarazel.de	2020-04-07 12:04:32 +12:00
Amit Kapila	df3b181499	Add infrastructure to track WAL usage. This allows gathering the WAL generation statistics for each statement execution. The three statistics that we collect are the number of WAL records, the number of full page writes and the amount of WAL bytes generated. This helps the users who have write-intensive workload to see the impact of I/O due to WAL. This further enables us to see approximately what percentage of overall WAL is due to full page writes. In the future, we can extend this functionality to allow us to compute the the exact amount of WAL data due to full page writes. This patch in itself is just an infrastructure to compute WAL usage data. The upcoming patches will expose this data via explain, auto_explain, pg_stat_statements and verbose (auto)vacuum output. Author: Kirill Bychik, Julien Rouhaud Reviewed-by: Dilip Kumar, Fujii Masao and Amit Kapila Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com	2020-04-04 10:02:08 +05:30
Fujii Masao	17e0328224	Allow pg_stat_statements to track planning statistics. This commit makes pg_stat_statements support new GUC pg_stat_statements.track_planning. If this option is enabled, pg_stat_statements tracks the planning statistics of the statements, e.g., the number of times the statement was planned, the total time spent planning the statement, etc. This feature is useful to check the statements that it takes a long time to plan. Previously since pg_stat_statements tracked only the execution statistics, we could not use that for the purpose. The planning and execution statistics are stored at the end of each phase separately. So there are not always one-to-one relationship between them. For example, if the statement is successfully planned but fails in the execution phase, only its planning statistics are stored. This may cause the users to be able to see different pg_stat_statements results from the previous version. To avoid this, pg_stat_statements.track_planning needs to be disabled. This commit bumps the version of pg_stat_statements to 1.8 since it changes the definition of pg_stat_statements function. Author: Julien Rouhaud, Pascal Legrand, Thomas Munro, Fujii Masao Reviewed-by: Sergei Kornilov, Tomas Vondra, Yoshikazu Imai, Haribabu Kommi, Tom Lane Discussion: https://postgr.es/m/CAHGQGwFx_=DO-Gu-MfPW3VQ4qC7TfVdH2zHmvZfrGv6fQ3D-Tw@mail.gmail.com Discussion: https://postgr.es/m/CAEepm=0e59Y_6Q_YXYCTHZkqOc6H2pJ54C_Xe=VFu50Aqqp_sA@mail.gmail.com Discussion: https://postgr.es/m/DB6PR0301MB21352F6210E3B11934B0DCC790B00@DB6PR0301MB2135.eurprd03.prod.outlook.com	2020-04-02 11:20:19 +09:00
Alexander Korotkov	911e702077	Implement operator class parameters PostgreSQL provides set of template index access methods, where opclasses have much freedom in the semantics of indexing. These index AMs are GiST, GIN, SP-GiST and BRIN. There opclasses define representation of keys, operations on them and supported search strategies. So, it's natural that opclasses may be faced some tradeoffs, which require user-side decision. This commit implements opclass parameters allowing users to set some values, which tell opclass how to index the particular dataset. This commit doesn't introduce new storage in system catalog. Instead it uses pg_attribute.attoptions, which is used for table column storage options but unused for index attributes. In order to evade changing signature of each opclass support function, we implement unified way to pass options to opclass support functions. Options are set to fn_expr as the constant bytea expression. It's possible due to the fact that opclass support functions are executed outside of expressions, so fn_expr is unused for them. This commit comes with some examples of opclass options usage. We parametrize signature length in GiST. That applies to multiple opclasses: tsvector_ops, gist__intbig_ops, gist_ltree_ops, gist__ltree_ops, gist_trgm_ops and gist_hstore_ops. Also we parametrize maximum number of integer ranges for gist__int_ops. However, the main future usage of this feature is expected to be json, where users would be able to specify which way to index particular json parts. Catversion is bumped. Discussion: https://postgr.es/m/d22c3a18-31c7-1879-fc11-4c1ce2f5e5af%40postgrespro.ru Author: Nikita Glukhov, revised by me Reviwed-by: Nikolay Shaplov, Robert Haas, Tom Lane, Tomas Vondra, Alvaro Herrera	2020-03-30 19:17:23 +03:00
Amit Kapila	b61d161c14	Introduce vacuum errcontext to display additional information. The additional information displayed will be block number for error occurring while processing heap and index name for error occurring while processing the index. This will help us in diagnosing the problems that occur during a vacuum. For ex. due to corruption (either caused by bad hardware or by some bug) if we get some error while vacuuming, it can help us identify the block in heap and or additional index information. It sets up an error context callback to display additional information with the error. During different phases of vacuum (heap scan, heap vacuum, index vacuum, index clean up, heap truncate), we update the error context callback to display appropriate information. We can extend it to a bit more granular level like adding the phases for FSM operations or for prefetching the blocks while truncating. However, I felt that it requires adding many more error callback function calls and can make the code a bit complex, so left those for now. Author: Justin Pryzby, with few changes by Amit Kapila Reviewed-by: Alvaro Herrera, Amit Kapila, Andres Freund, Michael Paquier and Sawada Masahiko Discussion: https://www.postgresql.org/message-id/20191120210600.GC30362@telsasoft.com	2020-03-30 07:33:38 +05:30
Alexander Korotkov	0df94beb36	Fix ordering in typedefs.list	2020-03-24 00:59:17 +03:00
Alexander Korotkov	45452825e5	Add new typedefs introduced in `773df883e8` to typedefs.list	2020-03-19 21:40:45 +03:00
Amit Kapila	40d964ec99	Allow vacuum command to process indexes in parallel. This feature allows the vacuum to leverage multiple CPUs in order to process indexes. This enables us to perform index vacuuming and index cleanup with background workers. This adds a PARALLEL option to VACUUM command where the user can specify the number of workers that can be used to perform the command which is limited by the number of indexes on a table. Specifying zero as a number of workers will disable parallelism. This option can't be used with the FULL option. Each index is processed by at most one vacuum process. Therefore parallel vacuum can be used when the table has at least two indexes. The parallel degree is either specified by the user or determined based on the number of indexes that the table has, and further limited by max_parallel_maintenance_workers. The index can participate in parallel vacuum iff it's size is greater than min_parallel_index_scan_size. Author: Masahiko Sawada and Amit Kapila Reviewed-by: Dilip Kumar, Amit Kapila, Robert Haas, Tomas Vondra, Mahendra Singh and Sergei Kornilov Tested-by: Mahendra Singh and Prabhat Sahu Discussion: https://postgr.es/m/CAD21AoDTPMgzSkV4E3SFo1CH_x50bf5PqZFQf4jmqjk-C03BWg@mail.gmail.com https://postgr.es/m/CAA4eK1J-VoR9gzS5E75pcD-OH0mEyCdp8RihcwKrcuw7J-Q0+w@mail.gmail.com	2020-01-20 07:57:49 +05:30
Michael Paquier	e1551f96e6	Refactor attribute mappings used in logical tuple conversion Tuple conversion support in tupconvert.c is able to convert rowtypes between two relations, inner and outer, which are logically equivalent but have a different ordering or even dropped columns (used mainly for inheritance tree and partitions). This makes use of attribute mappings, which are simple arrays made of AttrNumber elements with a length matching the number of attributes of the outer relation. The length of the attribute mapping has been treated as completely independent of the mapping itself until now, making it easy to pass down an incorrect mapping length. This commit refactors the code related to attribute mappings and moves it into an independent facility called attmap.c, extracted from tupconvert.c. This merges the attribute mapping with its length, avoiding to try to guess what is the length of a mapping to use as this is computed once, when the map is built. This will avoid mistakes like what has been fixed in `dc816e58`, which has used an incorrect mapping length by matching it with the number of attributes of an inner relation (a child partition) instead of an outer relation (a partitioned table). Author: Michael Paquier Reviewed-by: Amit Langote Discussion: https://postgr.es/m/20191121042556.GD153437@paquier.xyz	2019-12-18 16:23:02 +09:00
Michael Paquier	4cb658af70	Refactor reloption handling for index AMs in-core This reworks the reloption parsing and build of a couple of index AMs by creating new structures for each index AM's options. This split was already done for BRIN, GIN and GiST (which actually has a fillfactor parameter), but not for hash, B-tree and SPGiST which relied on StdRdOptions due to an overlap with the default option set. This saves a couple of bytes for rd_options in each relcache entry with indexes making use of relation options, and brings more consistency between all index AMs. While on it, add a couple of AssertMacro() calls to make sure that utility macros to grab values of reloptions are used with the expected index AM. Author: Nikolay Shaplov Reviewed-by: Amit Langote, Michael Paquier, Álvaro Herrera, Dent John Discussion: https://postgr.es/m/4127670.gFlpRb6XCm@x200m	2019-11-25 09:40:53 +09:00
Michael Paquier	640c19869f	Add dummy_index_am to src/test/modules/ This includes more tests dedicated to relation options, bringing the coverage of this code close to 100%, and the module can be used for other purposes, like a base template for an index AM implementation. Author: Nikolay Sharplov, Michael Paquier Reviewed-by: Álvaro Herrera, Dent John Discussion: https://postgr.es/m/17071942.m9zZutALE6@x200m	2019-09-25 12:11:12 +09:00
Alexander Korotkov	6cae9d2c10	Improve handling of NULLs in KNN-GiST and KNN-SP-GiST This commit improves subject in two ways: * It removes ugliness of `02f90879e7`, which stores distance values and null flags in two separate arrays after GISTSearchItem struct. Instead we pack both distance value and null flag in IndexOrderByDistance struct. Alignment overhead should be negligible, because we typically deal with at most few "col op const" expressions in ORDER BY clause. * It fixes handling of "col op NULL" expression in KNN-SP-GiST. Now, these expression are not passed to support functions, which can't deal with them. Instead, NULL result is implicitly assumed. It future we may decide to teach support functions to deal with NULL arguments, but current solution is bugfix suitable for backpatch. Reported-by: Nikita Glukhov Discussion: https://postgr.es/m/826f57ee-afc7-8977-c44c-6111d18b02ec%40postgrespro.ru Author: Nikita Glukhov Reviewed-by: Alexander Korotkov Backpatch-through: 9.4	2019-09-19 21:48:39 +03:00
Robert Haas	bd124996ef	Create an API for inserting and deleting rows in TOAST tables. This moves much of the non-heap-specific logic from toast_delete and toast_insert_or_update into a helper functions accessible via a new header, toast_helper.h. Using the functions in this module, a table AM can implement creation and deletion of TOAST table rows with much less code duplication than was possible heretofore. Some table AMs won't want to use the TOAST logic at all, but for those that do this will make that easier. Patch by me, reviewed and tested by Prabhat Sabu, Thomas Munro, Andres Freund, and Álvaro Herrera. Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com	2019-09-06 10:38:51 -04:00
Michael Paquier	9adda24543	Refactor code of reindexdb for query generation This merges the portion related to REINDEX SYSTEM into the routine already available for all the other reindex types, making the query generation cleaner. While on it, change the handling of the reindex types using an enum, which allows to get rid of the hardcoded strings used directly in the query generation present for the same purpose (aka "TABLE", "DATABASE", etc.). Per discussion with Julien Rouhaud, Tom Lane, Alvaro Herrera and me. Author: Julien Rouhaud Discussion: https://postgr.es/m/CAOBaU_bSmSik_WRK9niDnm-3NkNZky6+uKxkmQwvthZvMWpS5A@mail.gmail.com	2019-07-02 11:36:53 +09:00
Tom Lane	9e1c9f9594	pgindent run prior to branching v12. pgperltidy and reformat-dat-files too, though the latter didn't find anything to change.	2019-07-01 12:37:52 -04:00
Tomas Vondra	6cbfb784c3	Rework the pg_statistic_ext catalog Since extended statistic got introduced in PostgreSQL 10, there was a single catalog pg_statistic_ext storing both the definitions and built statistic. That's however problematic when a user is supposed to have access only to the definitions, but not to user data. Consider for example pg_dump on a database with RLS enabled - if the pg_statistic_ext catalog respects RLS (which it should, if it contains user data), pg_dump would not see any records and the result would not define any extended statistics. That would be a surprising behavior. Until now this was not a pressing issue, because the existing types of extended statistic (functional dependencies and ndistinct coefficients) do not include any user data directly. This changed with introduction of MCV lists, which do include most common combinations of values. The easiest way to fix this is to split the pg_statistic_ext catalog into two - one for definitions, one for the built statistic values. The new catalog is called pg_statistic_ext_data, and we're maintaining a 1:1 relationship with the old catalog - either there are matching records in both catalogs, or neither of them. Bumped CATVERSION due to changing system catalog definitions. Author: Dean Rasheed, with improvements by me Reviewed-by: Dean Rasheed, John Naylor Discussion: https://postgr.es/m/CAEZATCUhT9rt7Ui%3DVdx4N%3D%3DVV5XOK5dsXfnGgVOz_JhAicB%3DZA%40mail.gmail.com	2019-06-16 01:20:31 +02:00
Tom Lane	be76af171c	Initial pgindent run for v12. This is still using the 2.0 version of pg_bsd_indent. I thought it would be good to commit this separately, so as to document the differences between 2.0 and 2.1 behavior. Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us	2019-05-22 12:55:34 -04:00
Andres Freund	86b85044e8	tableam: Add table_multi_insert() and revamp/speed-up COPY FROM buffering. This adds table_multi_insert(), and converts COPY FROM, the only user of heap_multi_insert, to it. A simple conversion of COPY FROM use slots would have yielded a slowdown when inserting into a partitioned table for some workloads. Different partitions might need different slots (both slot types and their descriptors), and dropping / creating slots when there's constant partition changes is measurable. Thus instead revamp the COPY FROM buffering for partitioned tables to allow to buffer inserts into multiple tables, flushing only when limits are reached across all partition buffers. By only dropping slots when there've been inserts into too many different partitions, the aforementioned overhead is gone. By allowing larger batches, even when there are frequent partition changes, we actuall speed such cases up significantly. By using slots COPY of very narrow rows into unlogged / temporary might slow down very slightly (due to the indirect function calls). Author: David Rowley, Andres Freund, Haribabu Kommi Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de https://postgr.es/m/20190327054923.t3epfuewxfqdt22e@alap3.anarazel.de	2019-04-04 16:28:18 -07:00
Thomas Munro	3eb77eba5a	Refactor the fsync queue for wider use. Previously, md.c and checkpointer.c were tightly integrated so that fsync calls could be handed off and processed in the background. Introduce a system of callbacks and file tags, so that other modules can hand off fsync work in the same way. For now only md.c uses the new interface, but other users are being proposed. Since there may be use cases that are not strictly SMGR implementations, use a new function table for sync handlers rather than extending the traditional SMGR one. Instead of using a bitmapset of segment numbers for each RelFileNode in the checkpointer's hash table, make the segment number part of the key. This requires sending explicit "forget" requests for every segment individually when relations are dropped, but suits the file layout schemes of proposed future users better (ie sparse or high segment numbers). Author: Shawn Debnath and Thomas Munro Reviewed-by: Thomas Munro, Andres Freund Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com	2019-04-04 23:38:38 +13:00
Alexander Korotkov	0a02e2ae02	GIN support for @@ and @? jsonpath operators This commit makes existing GIN operator classes jsonb_ops and json_path_ops support "jsonb @@ jsonpath" and "jsonb @? jsonpath" operators. Basic idea is to extract statements of following form out of jsonpath. key1.key2. ... .keyN = const The rest of jsonpath is rechecked from heap. Catversion is bumped. Discussion: https://postgr.es/m/fcc6fc6a-b497-f39a-923d-aa34d0c588e8%402ndQuadrant.com Author: Nikita Glukhov, Alexander Korotkov Reviewed-by: Jonathan Katz, Pavel Stehule	2019-04-01 18:08:52 +03:00
Thomas Munro	ad308058cc	Use FullTransactionId for the transaction stack. Provide GetTopFullTransactionId() and GetCurrentFullTransactionId(). The intended users of these interfaces are access methods that use xids for visibility checks but don't want to have to go back and "freeze" existing references some time later before the 32 bit xid counter wraps around. Use a new struct to serialize the transaction state for parallel query, because FullTransactionId doesn't fit into the previous serialization scheme very well. Author: Thomas Munro Reviewed-by: Heikki Linnakangas Discussion: https://postgr.es/m/CAA4eK1%2BMv%2Bmb0HFfWM9Srtc6MVe160WFurXV68iAFMcagRZ0dQ%40mail.gmail.com	2019-03-28 18:24:43 +13:00
Thomas Munro	2fc7af5e96	Add basic infrastructure for 64 bit transaction IDs. Instead of inferring epoch progress from xids and checkpoints, introduce a 64 bit FullTransactionId type and use it to track xid generation. This fixes an unlikely bug where the epoch is reported incorrectly if the range of active xids wraps around more than once between checkpoints. The only user-visible effect of this commit is to correct the epoch used by txid_current() and txid_status(), also visible with pg_controldata, in those rare circumstances. It also creates some basic infrastructure so that later patches can use 64 bit transaction IDs in more places. The new type is a struct that we pass by value, as a form of strong typedef. This prevents the sort of accidental confusion between TransactionId and FullTransactionId that would be possible if we were to use a plain old uint64. Author: Thomas Munro Reported-by: Amit Kapila Reviewed-by: Andres Freund, Tom Lane, Heikki Linnakangas Discussion: https://postgr.es/m/CAA4eK1%2BMv%2Bmb0HFfWM9Srtc6MVe160WFurXV68iAFMcagRZ0dQ%40mail.gmail.com	2019-03-28 18:12:20 +13:00
Andres Freund	2a96909a4a	tableam: Support for an index build's initial table scan(s). To support building indexes over tables of different AMs, the scans to do so need to be routed through the table AM. While moving a fair amount of code, nearly all the changes are just moving code to below a callback. Currently the range based interface wouldn't make much sense for non block based table AMs. But that seems aceptable for now. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de	2019-03-27 19:59:06 -07:00
Andres Freund	558a9165e0	Compute XID horizon for page level index vacuum on primary. Previously the xid horizon was only computed during WAL replay. That had two major problems: 1) It relied on knowing what the table pointed to looks like. That was easy enough before the introducing of tableam (we knew it had to be heap, although some trickery around logging the heap relfilenodes was required). But to properly handle table AMs we need per-database catalog access to look up the AM handler, which recovery doesn't allow. 2) Not knowing the xid horizon also makes it hard to support logical decoding on standbys. When on a catalog table, we need to be able to conflict with slots that have an xid horizon that's too old. But computing the horizon by visiting the heap only works once consistency is reached, but we always need to be able to detect conflicts. There's also a secondary problem, in that the current method performs redundant work on every standby. But that's counterbalanced by potentially computing the value when not necessary (either because there's no standby, or because there's no connected backends). Solve 1) and 2) by moving computation of the xid horizon to the primary and by involving tableam in the computation of the horizon. To address the potentially increased overhead, increase the efficiency of the xid horizon computation for heap by sorting the tids, and eliminating redundant buffer accesses. When prefetching is available, additionally perform prefetching of buffers. As this is more of a maintenance task, rather than something routinely done in every read only query, we add an arbitrary 10 to the effective concurrency - thereby using IO concurrency, when not globally enabled. That's possibly not the perfect formula, but seems good enough for now. Bumps WAL format, as latestRemovedXid is now part of the records, and the heap's relfilenode isn't anymore. Author: Andres Freund, Amit Khandekar, Robert Haas Reviewed-By: Robert Haas Discussion: https://postgr.es/m/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de https://postgr.es/m/20181214014235.dal5ogljs3bmlq44@alap3.anarazel.de https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de	2019-03-26 16:52:54 -07:00
Andres Freund	5db6df0c01	tableam: Add tuple_{insert, delete, update, lock} and use. This adds new, required, table AM callbacks for insert/delete/update and lock_tuple. To be able to reasonably use those, the EvalPlanQual mechanism had to be adapted, moving more logic into the AM. Previously both delete/update/lock call-sites and the EPQ mechanism had to have awareness of the specific tuple format to be able to fetch the latest version of a tuple. Obviously that needs to be abstracted away. To do so, move the logic that find the latest row version into the AM. lock_tuple has a new flag argument, TUPLE_LOCK_FLAG_FIND_LAST_VERSION, that forces it to lock the last version, rather than the current one. It'd have been possible to do so via a separate callback as well, but finding the last version usually also necessitates locking the newest version, making it sensible to combine the two. This replaces the previous use of EvalPlanQualFetch(). Additionally HeapTupleUpdated, which previously signaled either a concurrent update or delete, is now split into two, to avoid callers needing AM specific knowledge to differentiate. The move of finding the latest row version into tuple_lock means that encountering a row concurrently moved into another partition will now raise an error about "tuple to be locked" rather than "tuple to be updated/deleted" - which is accurate, as that always happens when locking rows. While possible slightly less helpful for users, it seems like an acceptable trade-off. As part of this commit HTSU_Result has been renamed to TM_Result, and its members been expanded to differentiated between updating and deleting. HeapUpdateFailureData has been renamed to TM_FailureData. The interface to speculative insertion is changed so nodeModifyTable.c does not have to set the speculative token itself anymore. Instead there's a version of tuple_insert, tuple_insert_speculative, that performs the speculative insertion (without requiring a flag to signal that fact), and the speculative insertion is either made permanent with table_complete_speculative(succeeded = true) or aborted with succeeded = false). Note that multi_insert is not yet routed through tableam, nor is COPY. Changing multi_insert requires changes to copy.c that are large enough to better be done separately. Similarly, although simpler, CREATE TABLE AS and CREATE MATERIALIZED VIEW are also only going to be adjusted in a later commit. Author: Andres Freund and Haribabu Kommi Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de https://postgr.es/m/20190313003903.nwvrxi7rw3ywhdel@alap3.anarazel.de https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql	2019-03-23 19:55:57 -07:00
Michael Paquier	ed308d7837	Add options to enable and disable checksums in pg_checksums An offline cluster can now work with more modes in pg_checksums: - --enable enables checksums in a cluster, updating all blocks with a correct checksum, and updating the control file at the end. - --disable disables checksums in a cluster, updating only the control file. - --check is an extra option able to verify checksums for a cluster, and the default used if no mode is specified. When running --enable or --disable, the data folder gets fsync'd for durability, and then it is followed by a control file update and flush to keep the operation consistent should the tool be interrupted, killed or the host unplugged. If no mode is specified in the options, then --check is used for compatibility with older versions of pg_checksums (named pg_verify_checksums in v11 where it was introduced). Author: Michael Banck, Michael Paquier Reviewed-by: Fabien Coelho, Magnus Hagander, Sergei Kornilov Discussion: https://postgr.es/m/20181221201616.GD4974@nighthawk.caipicrew.dd-dns.de	2019-03-23 08:12:55 +09:00
Alexander Korotkov	72b6460336	Partial implementation of SQL/JSON path language SQL 2016 standards among other things contains set of SQL/JSON features for JSON processing inside of relational database. The core of SQL/JSON is JSON path language, allowing access parts of JSON documents and make computations over them. This commit implements partial support JSON path language as separate datatype called "jsonpath". The implementation is partial because it's lacking datetime support and suppression of numeric errors. Missing features will be added later by separate commits. Support of SQL/JSON features requires implementation of separate nodes, and it will be considered in subsequent patches. This commit includes following set of plain functions, allowing to execute jsonpath over jsonb values: * jsonb_path_exists(jsonb, jsonpath[, jsonb, bool]), * jsonb_path_match(jsonb, jsonpath[, jsonb, bool]), * jsonb_path_query(jsonb, jsonpath[, jsonb, bool]), * jsonb_path_query_array(jsonb, jsonpath[, jsonb, bool]). * jsonb_path_query_first(jsonb, jsonpath[, jsonb, bool]). This commit also implements "jsonb @? jsonpath" and "jsonb @@ jsonpath", which are wrappers over jsonpath_exists(jsonb, jsonpath) and jsonpath_predicate(jsonb, jsonpath) correspondingly. These operators will have an index support (implemented in subsequent patches). Catversion bumped, to add new functions and operators. Code was written by Nikita Glukhov and Teodor Sigaev, revised by me. Documentation was written by Oleg Bartunov and Liudmila Mantrova. The work was inspired by Oleg Bartunov. Discussion: https://postgr.es/m/fcc6fc6a-b497-f39a-923d-aa34d0c588e8%402ndQuadrant.com Author: Nikita Glukhov, Teodor Sigaev, Alexander Korotkov, Oleg Bartunov, Liudmila Mantrova Reviewed-by: Tomas Vondra, Andrew Dunstan, Pavel Stehule, Alexander Korotkov	2019-03-16 12:16:48 +03:00
Andres Freund	c2fe139c20	tableam: Add and use scan APIs. Too allow table accesses to be not directly dependent on heap, several new abstractions are needed. Specifically: 1) Heap scans need to be generalized into table scans. Do this by introducing TableScanDesc, which will be the "base class" for individual AMs. This contains the AM independent fields from HeapScanDesc. The previous heap_{beginscan,rescan,endscan} et al. have been replaced with a table_ version. There's no direct replacement for heap_getnext(), as that returned a HeapTuple, which is undesirable for a other AMs. Instead there's table_scan_getnextslot(). But note that heap_getnext() lives on, it's still used widely to access catalog tables. This is achieved by new scan_begin, scan_end, scan_rescan, scan_getnextslot callbacks. 2) The portion of parallel scans that's shared between backends need to be able to do so without the user doing per-AM work. To achieve that new parallelscan_{estimate, initialize, reinitialize} callbacks are introduced, which operate on a new ParallelTableScanDesc, which again can be subclassed by AMs. As it is likely that several AMs are going to be block oriented, block oriented callbacks that can be shared between such AMs are provided and used by heap. table_block_parallelscan_{estimate, intiialize, reinitialize} as callbacks, and table_block_parallelscan_{nextpage, init} for use in AMs. These operate on a ParallelBlockTableScanDesc. 3) Index scans need to be able to access tables to return a tuple, and there needs to be state across individual accesses to the heap to store state like buffers. That's now handled by introducing a sort-of-scan IndexFetchTable, which again is intended to be subclassed by individual AMs (for heap IndexFetchHeap). The relevant callbacks for an AM are index_fetch_{end, begin, reset} to create the necessary state, and index_fetch_tuple to retrieve an indexed tuple. Note that index_fetch_tuple implementations need to be smarter than just blindly fetching the tuples for AMs that have optimizations similar to heap's HOT - the currently alive tuple in the update chain needs to be fetched if appropriate. Similar to table_scan_getnextslot(), it's undesirable to continue to return HeapTuples. Thus index_fetch_heap (might want to rename that later) now accepts a slot as an argument. Core code doesn't have a lot of call sites performing index scans without going through the systable_* API (in contrast to loads of heap_getnext calls and working directly with HeapTuples). Index scans now store the result of a search in IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the target is not generally a HeapTuple anymore that seems cleaner. To be able to sensible adapt code to use the above, two further callbacks have been introduced: a) slot_callbacks returns a TupleTableSlotOps* suitable for creating slots capable of holding a tuple of the AMs type. table_slot_callbacks() and table_slot_create() are based upon that, but have additional logic to deal with views, foreign tables, etc. While this change could have been done separately, nearly all the call sites that needed to be adapted for the rest of this commit also would have been needed to be adapted for table_slot_callbacks(), making separation not worthwhile. b) tuple_satisfies_snapshot checks whether the tuple in a slot is currently visible according to a snapshot. That's required as a few places now don't have a buffer + HeapTuple around, but a slot (which in heap's case internally has that information). Additionally a few infrastructure changes were needed: I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now internally uses a slot to keep track of tuples. While systable_getnext() still returns HeapTuples, and will so for the foreseeable future, the index API (see 1) above) now only deals with slots. The remainder, and largest part, of this commit is then adjusting all scans in postgres to use the new APIs. Author: Andres Freund, Haribabu Kommi, Alvaro Herrera Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql	2019-03-11 12:46:41 -07:00
Andres Freund	8586bf7ed8	tableam: introduce table AM infrastructure. This introduces the concept of table access methods, i.e. CREATE ACCESS METHOD ... TYPE TABLE and CREATE TABLE ... USING (storage-engine). No table access functionality is delegated to table AMs as of this commit, that'll be done in following commits. Subsequent commits will incrementally abstract table access functionality to be routed through table access methods. That change is too large to be reviewed & committed at once, so it'll be done incrementally. Docs will be updated at the end, as adding them incrementally would likely make them less coherent, and definitely is a lot more work, without a lot of benefit. Table access methods are specified similar to index access methods, i.e. pg_am.amhandler returns, as INTERNAL, a pointer to a struct with callbacks. In contrast to index AMs that struct needs to live as long as a backend, typically that's achieved by just returning a pointer to a constant struct. Psql's \d+ now displays a table's access method. That can be disabled with HIDE_TABLEAM=true, which is mainly useful so regression tests can be run against different AMs. It's quite possible that this behaviour still needs to be fine tuned. For now it's not allowed to set a table AM for a partitioned table, as we've not resolved how partitions would inherit that. Disallowing allows us to introduce, if we decide that's the way forward, such a behaviour without a compatibility break. Catversion bumped, to add the heap table AM and references to it. Author: Haribabu Kommi, Andres Freund, Alvaro Herrera, Dimitri Golgov and others Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql https://postgr.es/m/20190107235616.6lur25ph22u5u5av@alap3.anarazel.de https://postgr.es/m/20190304234700.w5tmhducs5wxgzls@alap3.anarazel.de	2019-03-06 09:54:38 -08:00
Andres Freund	a9c35cf85c	Change function call information to be variable length. Before this change FunctionCallInfoData, the struct arguments etc for V1 function calls are stored in, always had space for FUNC_MAX_ARGS/100 arguments, storing datums and their nullness in two arrays. For nearly every function call 100 arguments is far more than needed, therefore wasting memory. Arg and argnull being two separate arrays also guarantees that to access a single argument, two cachelines have to be touched. Change the layout so there's a single variable-length array with pairs of value / isnull. That drastically reduces memory consumption for most function calls (on x86-64 a two argument function now uses 64bytes, previously 936 bytes), and makes it very likely that argument value and its nullness are on the same cacheline. Arguments are stored in a new NullableDatum struct, which, due to padding, needs more memory per argument than before. But as usually far fewer arguments are stored, and individual arguments are cheaper to access, that's still a clear win. It's likely that there's other places where conversion to NullableDatum arrays would make sense, e.g. TupleTableSlots, but that's for another commit. Because the function call information is now variable-length allocations have to take the number of arguments into account. For heap allocations that can be done with SizeForFunctionCallInfoData(), for on-stack allocations there's a new LOCAL_FCINFO(name, nargs) macro that helps to allocate an appropriately sized and aligned variable. Some places with stack allocation function call information don't know the number of arguments at compile time, and currently variably sized stack allocations aren't allowed in postgres. Therefore allow for FUNC_MAX_ARGS space in these cases. They're not that common, so for now that seems acceptable. Because of the need to allocate FunctionCallInfo of the appropriate size, older extensions may need to update their code. To avoid subtle breakages, the FunctionCallInfoData struct has been renamed to FunctionCallInfoBaseData. Most code only references FunctionCallInfo, so that shouldn't cause much collateral damage. This change is also a prerequisite for more efficient expression JIT compilation (by allocating the function call information on the stack, allowing LLVM to optimize it away); previously the size of the call information caused problems inside LLVM's optimizer. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20180605172952.x34m5uz6ju6enaem@alap3.anarazel.de	2019-01-26 14:17:52 -08:00
Andres Freund	63746189b2	Change snapshot type to be determined by enum rather than callback. This is in preparation for allowing the same snapshot be used for different table AMs. With the current callback based approach we would need one callback for each supported AM, which clearly would not be extensible. Thus add a new Snapshot->snapshot_type field, and move the dispatch into HeapTupleSatisfiesVisibility() (which is now a function). Later work will then dispatch calls to HeapTupleSatisfiesVisibility() and other AMs visibility functions depending on the type of the table. The central SnapshotType enum also seems like a good location to centralize documentation about the intended behaviour of various types of snapshots. As tqual.h isn't included by bufmgr.h any more (as HeapTupleSatisfies* isn't referenced by TestForOldSnapshot() anymore) a few files now need to include it directly. Author: Andres Freund, loosely based on earlier work by Haribabu Kommi Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql	2019-01-21 17:03:15 -08:00
Michael Paquier	9ebe0572ce	Refactor cluster_rel() to handle more options This extends cluster_rel() in such a way that more options can be added in the future, which will reduce the amount of chunk code for an upcoming SKIP_LOCKED aimed for VACUUM. As VACUUM FULL is a different flavor of CLUSTER, we want to make that extensible to ease integration. This only reworks the API and its callers, without providing anything user-facing. Two options are present now: verbose mode and relation recheck when doing the cluster command work across multiple transactions. This could be used as well as a base to extend the grammar of CLUSTER later on. Author: Michael Paquier Reviewed-by: Nathan Bossart Discussion: https://postgr.es/m/20180723031058.GE2854@paquier.xyz	2018-07-24 11:37:32 +09:00
Robert Haas	32df1c9afa	Add subtransaction handling for table synchronization workers. Since the old logic was completely unaware of subtransactions, a change made in a subsequently-aborted subtransaction would still cause workers to be stopped at toplevel transaction commit. Fix that by managing a stack of worker lists rather than just one. Amit Khandekar and Robert Haas Discussion: http://postgr.es/m/CAJ3gD9eaG_mWqiOTA2LfAug-VRNn1hrhf50Xi1YroxL37QkZNg@mail.gmail.com	2018-07-16 17:33:22 -04:00
Andrew Dunstan	2c64d20048	Update typedefs list	2018-06-30 12:07:27 -04:00
Tom Lane	f83bf385c1	Preliminary work for pgindent run. Update typedefs.list from current buildfarm results. Adjust pgindent's typedef blacklist to block some more unfortunate typedef names that have snuck in since last time. Manually tweak a few places where I didn't like the initial results of pgindent'ing.	2018-04-26 14:45:04 -04:00
Simon Riggs	08ea7a2291	Revert MERGE patch This reverts commits `d204ef6377`, `83454e3c2b` and a few more commits thereafter (complete list at the end) related to MERGE feature. While the feature was fully functional, with sufficient test coverage and necessary documentation, it was felt that some parts of the executor and parse-analyzer can use a different design and it wasn't possible to do that in the available time. So it was decided to revert the patch for PG11 and retry again in the future. Thanks again to all reviewers and bug reporters. List of commits reverted, in reverse chronological order: `f1464c5380` Improve parse representation for MERGE `ddb4158579` MERGE syntax diagram correction `530e69e59b` Allow cpluspluscheck to pass by renaming variable `01b88b4df5` MERGE minor errata `3af7b2b0d4` MERGE fix variable warning in non-assert builds `a5d86181ec` MERGE INSERT allows only one VALUES clause `4b2d44031f` MERGE post-commit review `4923550c20` Tab completion for MERGE `aa3faa3c7a` WITH support in MERGE `83454e3c2b` New files for MERGE `d204ef6377` MERGE SQL Command following SQL:2016 Author: Pavan Deolasee Reviewed-by: Michael Paquier	2018-04-12 11:22:56 +01:00
Simon Riggs	d204ef6377	MERGE SQL Command following SQL:2016 MERGE performs actions that modify rows in the target table using a source table or query. MERGE provides a single SQL statement that can conditionally INSERT/UPDATE/DELETE rows a task that would other require multiple PL statements. e.g. MERGE INTO target AS t USING source AS s ON t.tid = s.sid WHEN MATCHED AND t.balance > s.delta THEN UPDATE SET balance = t.balance - s.delta WHEN MATCHED THEN DELETE WHEN NOT MATCHED AND s.delta > 0 THEN INSERT VALUES (s.sid, s.delta) WHEN NOT MATCHED THEN DO NOTHING; MERGE works with regular and partitioned tables, including column and row security enforcement, as well as support for row, statement and transition triggers. MERGE is optimized for OLTP and is parameterizable, though also useful for large scale ETL/ELT. MERGE is not intended to be used in preference to existing single SQL commands for INSERT, UPDATE or DELETE since there is some overhead. MERGE can be used statically from PL/pgSQL. MERGE does not yet support inheritance, write rules, RETURNING clauses, updatable views or foreign tables. MERGE follows SQL Standard per the most recent SQL:2016. Includes full tests and documentation, including full isolation tests to demonstrate the concurrent behavior. This version written from scratch in 2017 by Simon Riggs, using docs and tests originally written in 2009. Later work from Pavan Deolasee has been both complex and deep, leaving the lead author credit now in his hands. Extensive discussion of concurrency from Peter Geoghegan, with thanks for the time and effort contributed. Various issues reported via sqlsmith by Andreas Seltenreich Authors: Pavan Deolasee, Simon Riggs Reviewer: Peter Geoghegan, Amit Langote, Tomas Vondra, Simon Riggs Discussion: https://postgr.es/m/CANP8+jKitBSrB7oTgT9CY2i1ObfOt36z0XMraQc+Xrz8QB0nXA@mail.gmail.com https://postgr.es/m/CAH2-WzkJdBuxj9PO=2QaO9-3h3xGbQPZ34kJH=HukRekwM-GZg@mail.gmail.com	2018-04-03 09:28:16 +01:00
Simon Riggs	7cf8a5c302	Revert "Modified files for MERGE" This reverts commit `354f13855e`.	2018-04-02 21:34:15 +01:00
Simon Riggs	354f13855e	Modified files for MERGE	2018-04-02 21:12:47 +01:00
Andres Freund	51bc271790	Add Bloom filter implementation. A Bloom filter is a space-efficient, probabilistic data structure that can be used to test set membership. Callers will sometimes incur false positives, but never false negatives. The rate of false positives is a function of the total number of elements and the amount of memory available for the Bloom filter. Two classic applications of Bloom filters are cache filtering, and data synchronization testing. Any user of Bloom filters must accept the possibility of false positives as a cost worth paying for the benefit in space efficiency. This commit adds a test harness extension module, test_bloomfilter. It can be used to get a sense of how the Bloom filter implementation performs under varying conditions. This is infrastructure for the upcoming "heapallindexed" amcheck patch, which verifies the consistency of a heap relation against one of its indexes. Author: Peter Geoghegan Reviewed-By: Andrey Borodin, Michael Paquier, Thomas Munro, Andres Freund Discussion: https://postgr.es/m/CAH2-Wzm5VmG7cu1N-H=nnS57wZThoSDQU+F5dewx3o84M+jY=g@mail.gmail.com	2018-03-31 17:49:41 -07:00
Andres Freund	2a0faed9d7	Add expression compilation support to LLVM JIT provider. In addition to the interpretation of expressions (which back evaluation of WHERE clauses, target list projection, aggregates transition values etc) support compiling expressions to native code, using the infrastructure added in earlier commits. To avoid duplicating a lot of code, only support emitting code for cases that are likely to be performance critical. For expression steps that aren't deemed that, use the existing interpreter. The generated code isn't great - some architectural changes are required to address that. But this already yields a significant speedup for some analytics queries, particularly with WHERE clauses filtering a lot, or computing multiple aggregates. Author: Andres Freund Tested-By: Thomas Munro Discussion: https://postgr.es/m/20170901064131.tazjxwus3k2w3ybh@alap3.anarazel.de Disable JITing for VALUES() nodes. VALUES() nodes are only ever executed once. This is primarily helpful for debugging, when forcing JITing even for cheap queries. Author: Andres Freund Discussion: https://postgr.es/m/20170901064131.tazjxwus3k2w3ybh@alap3.anarazel.de	2018-03-22 14:45:59 -07:00
Andres Freund	7ec0d80c05	Add helpers for emitting LLVM IR. These basically just help to make code a bit more concise and pgindent proof. Author: Andres Freund Discussion: https://postgr.es/m/20170901064131.tazjxwus3k2w3ybh@alap3.anarazel.de	2018-03-22 11:51:58 -07:00
Andres Freund	b96d550eb0	Support for optimizing and emitting code in LLVM JIT provider. This commit introduces the ability to actually generate code using LLVM. In particular, this adds: - Ability to emit code both in heavily optimized and largely unoptimized fashion - Batching facility to allow functions to be defined in small increments, but optimized and emitted in executable form in larger batches (for performance and memory efficiency) - Type and function declaration synchronization between runtime generated code and normal postgres code. This is critical to be able to access struct fields etc. - Developer oriented jit_dump_bitcode GUC, for inspecting / debugging the generated code. - per JitContext statistics of number of functions, time spent generating code, optimizing, and emitting it. This will later be employed for EXPLAIN support. This commit doesn't yet contain any code actually generating functions. That'll follow in later commits. Documentation for GUCs added, and for JIT in general, will be added in later commits. Author: Andres Freund, with contributions by Pierre Ducroquet Testing-By: Thomas Munro, Peter Eisentraut Discussion: https://postgr.es/m/20170901064131.tazjxwus3k2w3ybh@alap3.anarazel.de	2018-03-22 11:05:22 -07:00
Robert Haas	e2f1eb0ee3	Implement partition-wise grouping/aggregation. If the partition keys of input relation are part of the GROUP BY clause, all the rows belonging to a given group come from a single partition. This allows aggregation/grouping over a partitioned relation to be broken down * into aggregation/grouping on each partition. This should be no worse, and often better, than the normal approach. If the GROUP BY clause does not contain all the partition keys, we can still perform partial aggregation for each partition and then finalize aggregation after appending the partial results. This is less certain to be a win, but it's still useful. Jeevan Chalke, Ashutosh Bapat, Robert Haas. The larger patch series of which this patch is a part was also reviewed and tested by Antonin Houska, Rajkumar Raghuwanshi, David Rowley, Dilip Kumar, Konstantin Knizhnik, Pascal Legrand, and Rafia Sabih. Discussion: http://postgr.es/m/CAM2+6=V64_xhstVHie0Rz=KPEQnLJMZt_e314P0jaT_oJ9MR8A@mail.gmail.com	2018-03-22 12:49:48 -04:00
Andres Freund	432bb9e04d	Basic JIT provider and error handling infrastructure. This commit introduces: 1) JIT provider abstraction, which allows JIT functionality to be implemented in separate shared libraries. That's desirable because it allows to install JIT support as a separate package, and because it allows experimentation with different forms of JITing. 2) JITContexts which can be, using functions introduced in follow up commits, used to emit JITed functions, and have them be cleaned up on error. 3) The outline of a LLVM JIT provider, which will be fleshed out in subsequent commits. Documentation for GUCs added, and for JIT in general, will be added in later commits. Author: Andres Freund, with architectural input from Jeff Davis Discussion: https://postgr.es/m/20170901064131.tazjxwus3k2w3ybh@alap3.anarazel.de	2018-03-21 19:28:28 -07:00
Robert Haas	9da0cc3528	Support parallel btree index builds. To make this work, tuplesort.c and logtape.c must also support parallelism, so this patch adds that infrastructure and then applies it to the particular case of parallel btree index builds. Testing to date shows that this can often be 2-3x faster than a serial index build. The model for deciding how many workers to use is fairly primitive at present, but it's better than not having the feature. We can refine it as we get more experience. Peter Geoghegan with some help from Rushabh Lathia. While Heikki Linnakangas is not an author of this patch, he wrote other patches without which this feature would not have been possible, and therefore the release notes should possibly credit him as an author of this feature. Reviewed by Claudio Freire, Heikki Linnakangas, Thomas Munro, Tels, Amit Kapila, me. Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com	2018-02-02 13:32:44 -05:00
Robert Haas	2f17844104	Allow UPDATE to move rows between partitions. When an UPDATE causes a row to no longer match the partition constraint, try to move it to a different partition where it does match the partition constraint. In essence, the UPDATE is split into a DELETE from the old partition and an INSERT into the new one. This can lead to surprising behavior in concurrency scenarios because EvalPlanQual rechecks won't work as they normally did; the known problems are documented. (There is a pending patch to improve the situation further, but it needs more review.) Amit Khandekar, reviewed and tested by Amit Langote, David Rowley, Rajkumar Raghuwanshi, Dilip Kumar, Amul Sul, Thomas Munro, Álvaro Herrera, Amit Kapila, and me. A few final revisions by me. Discussion: http://postgr.es/m/CAJ3gD9do9o2ccQ7j7+tSgiE1REY65XRiMb=yJO3u3QhyP8EEPQ@mail.gmail.com	2018-01-19 15:33:06 -05:00
Andres Freund	69c3936a14	Expression evaluation based aggregate transition invocation. Previously aggregate transition and combination functions were invoked by special case code in nodeAgg.c, evaluating input and filters separately using the expression evaluation machinery. That turns out to not be great for performance for several reasons: - repeated expression evaluations have some cost - the transition functions invocations are poorly predicted, as commonly there are multiple aggregates in a query, resulting in the same call-stack invoking different functions. - filter and input computation had to be done separately - the special case code made it hard to implement JITing of the whole transition function invocation Address this by building one large expression that computes input, evaluates filters, and invokes transition functions. This leads to moderate speedups in queries bottlenecked by aggregate computations, and enables large speedups for similar cases once JITing is done. There's potential for further improvement: - It'd be nice if we could simplify the somewhat expensive aggstate->all_pergroups lookups. - right now there's still an advance_transition_function invocation in nodeAgg.c, leading to some code duplication. Author: Andres Freund Discussion: https://postgr.es/m/20170901064131.tazjxwus3k2w3ybh@alap3.anarazel.de	2018-01-09 13:25:38 -08:00
Andres Freund	1804284042	Add parallel-aware hash joins. Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel Hash Join with Parallel Hash. While hash joins could already appear in parallel queries, they were previously always parallel-oblivious and had a partial subplan only on the outer side, meaning that the work of the inner subplan was duplicated in every worker. After this commit, the planner will consider using a partial subplan on the inner side too, using the Parallel Hash node to divide the work over the available CPU cores and combine its results in shared memory. If the join needs to be split into multiple batches in order to respect work_mem, then workers process different batches as much as possible and then work together on the remaining batches. The advantages of a parallel-aware hash join over a parallel-oblivious hash join used in a parallel query are that it: * avoids wasting memory on duplicated hash tables * avoids wasting disk space on duplicated batch files * divides the work of building the hash table over the CPUs One disadvantage is that there is some communication between the participating CPUs which might outweigh the benefits of parallelism in the case of small hash tables. This is avoided by the planner's existing reluctance to supply partial plans for small scans, but it may be necessary to estimate synchronization costs in future if that situation changes. Another is that outer batch 0 must be written to disk if multiple batches are required. A potential future advantage of parallel-aware hash joins is that right and full outer joins could be supported, since there is a single set of matched bits for each hashtable, but that is not yet implemented. A new GUC enable_parallel_hash is defined to control the feature, defaulting to on. Author: Thomas Munro Reviewed-By: Andres Freund, Robert Haas Tested-By: Rafia Sabih, Prabhat Sahu Discussion: https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com	2017-12-21 00:43:41 -08:00
Andres Freund	ab9e0e718a	Add shared tuplestores. SharedTuplestore allows multiple participants to write into it and then read the tuples back from it in parallel. Each reader receives partial results. For now it always uses disk files, but other buffering policies and other kinds of scans (ie each reader receives complete results) may be useful in future. The upcoming parallel hash join feature will use this facility. Author: Thomas Munro Reviewed-By: Peter Geoghegan, Andres Freund, Robert Haas Discussion: https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com	2017-12-18 14:23:19 -08:00
Andres Freund	dc6c4c9dc2	Add infrastructure for sharing temporary files between backends. SharedFileSet allows temporary files to be created by one backend and then exported for read-only access by other backends, with clean-up managed by reference counting associated with a DSM segment. This includes changes to fd.c and buffile.c to support the new kind of temporary file. This will be used by an upcoming patch adding support for parallel hash joins. Author: Thomas Munro Reviewed-By: Peter Geoghegan, Andres Freund, Robert Haas, Rushabh Lathia Discussion: https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com https://postgr.es/m/CAH2-WznJ_UgLux=_jTgCQ4yFz0iBntudsNKa1we3kN1BAG=88w@mail.gmail.com	2017-12-01 16:30:56 -08:00
Robert Haas	eaedf0df71	Update typedefs.list and re-run pgindent Discussion: http://postgr.es/m/CA+TgmoaA9=1RWKtBWpDaj+sF3Stgc8sHgf5z=KGtbjwPLQVDMA@mail.gmail.com	2017-11-29 09:24:24 -05:00
Andres Freund	7082e614c0	Provide DSM segment to ExecXXXInitializeWorker functions. Previously, executor nodes running in parallel worker processes didn't have access to the dsm_segment object used for parallel execution. In order to support resource management based on DSM segment lifetime, they need that. So create a ParallelWorkerContext object to hold it and pass it to all InitializeWorker functions. Author: Thomas Munro Reviewed-By: Andres Freund Discussion: https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com	2017-11-16 17:39:18 -08:00
Robert Haas	1aba8e651a	Add hash partitioning. Hash partitioning is useful when you want to partition a growing data set evenly. This can be useful to keep table sizes reasonable, which makes maintenance operations such as VACUUM faster, or to enable partition-wise join. At present, we still depend on constraint exclusion for partitioning pruning, and the shape of the partition constraints for hash partitioning is such that that doesn't work. Work is underway to fix that, which should both improve performance and make partitioning pruning work with hash partitioning. Amul Sul, reviewed and tested by Dilip Kumar, Ashutosh Bapat, Yugo Nagata, Rajkumar Raghuwanshi, Jesper Pedersen, and by me. A few final tweaks also by me. Discussion: http://postgr.es/m/CAAJ_b96fhpJAP=ALbETmeLk1Uni_GFZD938zgenhF49qgDTjaQ@mail.gmail.com	2017-11-09 18:07:44 -05:00
Andres Freund	141fd1b66c	Improve sys/catcache performance. The following are the individual improvements: 1) Avoidance of FunctionCallInfo based function calls, replaced by more efficient functions with a native C argument interface. 2) Don't extract columns from a cache entry's tuple whenever matching entries - instead store them as a Datum array. This also allows to get rid of having to build dummy tuples for negative & list entries, and of a hack for dealing with cstring vs. text weirdness. 3) Reorder members of catcache.h struct, so imortant entries are more likely to be on one cacheline. 4) Allowing the compiler to specialize critical SearchCatCache for a specific number of attributes allows to unroll loops and avoid other nkeys dependant initialization. 5) Only initializing the ScanKey when necessary, i.e. catcache misses, greatly reduces cache unnecessary cpu cache misses. 6) Split of the cache-miss case from the hash lookup, reducing stack allocations etc in the common case. 7) CatCTup and their corresponding heaptuple are allocated in one piece. This results in making cache lookups themselves roughly three times as fast - full-system benchmarks obviously improve less than that. I've also evaluated further techniques: - replace open coded hash with simplehash - the list walk right now shows up in profiles. Unfortunately it's not easy to do so safely as an entry's memory location can change at various times, which doesn't work well with the refcounting and cache invalidation. - Cacheline-aligning CatCTup entries - helps some with performance, but the win isn't big and the code for it is ugly, because the tuples have to be freed as well. - add more proper functions, rather than macros for SearchSysCacheCopyN etc., but right now they don't show up in profiles. The reason the macro wrapper for syscache.c/h have to be changed, rather than just catcache, is that doing otherwise would require exposing the SysCache array to the outside. That might be a good idea anyway, but it's for another day. Author: Andres Freund Reviewed-By: Robert Haas Discussion: https://postgr.es/m/20170914061207.zxotvyopetm7lrrp@alap3.anarazel.de	2017-10-13 14:22:41 -07:00
Andres Freund	cc5f81366c	Add support for coordinating record typmods among parallel workers. Tuples can have type RECORDOID and a typmod number that identifies a blessed TupleDesc in a backend-private cache. To support the sharing of such tuples through shared memory and temporary files, provide a typmod registry in shared memory. To achieve that, introduce per-session DSM segments, created on demand when a backend first runs a parallel query. The per-session DSM segment has a table-of-contents just like the per-query DSM segment, and initially the contents are a shared record typmod registry and a DSA area to provide the space it needs to grow. State relating to the current session is accessed via a Session object reached through global variable CurrentSession that may require significant redesign further down the road as we figure out what else needs to be shared or remodelled. Author: Thomas Munro Reviewed-By: Andres Freund Discussion: https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com	2017-09-14 19:59:21 -07:00
Andres Freund	8c0d7bafad	Hash tables backed by DSA shared memory. Add general purpose chaining hash tables for DSA memory. Unlike DynaHash in shared memory mode, these hash tables can grow as required, and cope with being mapped into different addresses in different backends. There is a wide range of potential users for such a hash table, though it's very likely the interface will need to evolve as we come to understand the needs of different kinds of users. E.g support for iterators and incremental resizing is planned for later commits and the details of the callback signatures are likely to change. Author: Thomas Munro Reviewed-By: John Gorman, Andres Freund, Dilip Kumar, Robert Haas Discussion: https://postgr.es/m/CAEepm=3d8o8XdVwYT6O=bHKsKAM2pu2D6sV1S_=4d+jStVCE7w@mail.gmail.com https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com	2017-08-22 22:43:07 -07:00
Robert Haas	79ccd7cbd5	pg_prewarm: Add automatic prewarm feature. Periodically while the server is running, and at shutdown, write out a list of blocks in shared buffers. When the server reaches consistency -- unfortunatey, we can't do it before that point without breaking things -- reload those blocks into any still-unused shared buffers. Mithun Cy and Robert Haas, reviewed and tested by Beena Emerson, Amit Kapila, Jim Nasby, and Rafia Sabih. Discussion: http://postgr.es/m/CAD__OugubOs1Vy7kgF6xTjmEqTR4CrGAv8w+ZbaY_+MZeitukw@mail.gmail.com	2017-08-21 14:17:39 -04:00
Tom Lane	21d304dfed	Final pgindent + perltidy run for v10.	2017-08-14 17:29:33 -04:00
Dean Rasheed	d363d42bb9	Use MINVALUE/MAXVALUE instead of UNBOUNDED for range partition bounds. Previously, UNBOUNDED meant no lower bound when used in the FROM list, and no upper bound when used in the TO list, which was OK for single-column range partitioning, but problematic with multiple columns. For example, an upper bound of (10.0, UNBOUNDED) would not be collocated with a lower bound of (10.0, UNBOUNDED), thus making it difficult or impossible to define contiguous multi-column range partitions in some cases. Fix this by using MINVALUE and MAXVALUE instead of UNBOUNDED to represent a partition column that is unbounded below or above respectively. This syntax removes any ambiguity, and ensures that if one partition's lower bound equals another partition's upper bound, then the partitions are contiguous. Also drop the constraint prohibiting finite values after an unbounded column, and just document the fact that any values after MINVALUE or MAXVALUE are ignored. Previously it was necessary to repeat UNBOUNDED multiple times, which was needlessly verbose. Note: Forces a post-PG 10 beta2 initdb. Report by Amul Sul, original patch by Amit Langote with some additional hacking by me. Discussion: https://postgr.es/m/CAAJ_b947mowpLdxL3jo3YLKngRjrq9+Ej4ymduQTfYR+8=YAYQ@mail.gmail.com	2017-07-21 09:20:47 +01:00
Tom Lane	9ef2dbefc7	Final pgindent run with old pg_bsd_indent (version 1.3). This is just to have a clean basis for comparison with the results of the new version (which will indeed end up reverting some of these changes...) Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us	2017-06-21 14:09:24 -04:00

1 2 3 4 5 ...

305 Commits