Commit Graph

2008 Commits

Author SHA1 Message Date
Andres Freund 91c40548d5 Fix XLogReader FD leak that makes backends unusable after 2PC usage.
Before the fix every 2PC commit/abort leaked a file descriptor. As the
files are opened using BasicOpenFile(), that quickly leads to the
backend running out of file descriptors.

Once enough 2PC abort/commit have caused enough FDs to leak, any IO
in the backend will fail with "Too many open files", as
BasicOpenFilePerm() will have triggered all open files known to fd.c
to be closed.

The leak causing the problem at hand is a consequence of 0dc8ead46,
but is only exascerbated by it. Previously most XLogPageReadCB
callbacks used static variables to cache one open file, but after the
commit the cache is private to each XLogReader instance. There never
was infrastructure to close FDs at the time of XLogReaderFree, but the
way XLogReader was used limited the leak to one FD.

This commit just closes the during XLogReaderFree() if the FD is
stored in XLogReaderState.seg.ws_segno. This may not be the way to
solve this medium/long term, but at least unbreaks 2PC.

Discussion: https://postgr.es/m/20200406025651.fpzdb5yyb7qyhqko@alap3.anarazel.de
2020-04-07 17:03:04 -07:00
Alvaro Herrera c655077639
Allow users to limit storage reserved by replication slots
Replication slots are useful to retain data that may be needed by a
replication system.  But experience has shown that allowing them to
retain excessive data can lead to the primary failing because of running
out of space.  This new feature allows the user to configure a maximum
amount of space to be reserved using the new option
max_slot_wal_keep_size.  Slots that overrun that space are invalidated
at checkpoint time, enabling the storage to be released.

Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20170228.122736.123383594.horiguchi.kyotaro@lab.ntt.co.jp
2020-04-07 18:35:00 -04:00
Alexander Korotkov 0f5ca02f53 Implement waiting for given lsn at transaction start
This commit adds following optional clause to BEGIN and START TRANSACTION
commands.

  WAIT FOR LSN lsn [ TIMEOUT timeout ]

New clause pospones transaction start till given lsn is applied on standby.
This clause allows user be sure, that changes previously made on primary would
be visible on standby.

New shared memory struct is used to track awaited lsn per backend.  Recovery
process wakes up backend once required lsn is applied.

Author: Ivan Kartyshov, Anna Akenteva
Reviewed-by: Craig Ringer, Thomas Munro, Robert Haas, Kyotaro Horiguchi
Reviewed-by: Masahiko Sawada, Ants Aasma, Dmitry Ivanov, Simon Riggs
Reviewed-by: Amit Kapila, Alexander Korotkov
Discussion: https://postgr.es/m/0240c26c-9f84-30ea-fca9-93ab2df5f305%40postgrespro.ru
2020-04-07 23:51:10 +03:00
Fujii Masao 4bd0ad9e44 Prevent archive recovery from scanning non-existent WAL files.
Previously when there were multiple timelines listed in the history file
of the recovery target timeline, archive recovery searched all of them,
starting from the newest timeline to the oldest one, to find the segment
to read. That is, archive recovery had to continuously fail scanning
the segment until it reached the timeline that the segment belonged to.
These scans for non-existent segment could be harmful on the recovery
performance especially when archival area was located on the remote
storage and each scan could take a long time.

To address the issue, this commit changes archive recovery so that
it skips scanning the timeline that the segment to read doesn't belong to.

Author: Kyotaro Horiguchi, tweaked a bit by Fujii Masao
Reviewed-by: David Steele, Pavel Suderevsky, Grigory Smolkin
Discussion: https://postgr.es/m/16159-f5a34a3a04dc67e0@postgresql.org
Discussion: https://postgr.es/m/20200129.120222.1476610231001551715.horikyota.ntt@gmail.com
2020-04-08 00:49:29 +09:00
Noah Misch c6b92041d3 Skip WAL for new relfilenodes, under wal_level=minimal.
Until now, only selected bulk operations (e.g. COPY) did this.  If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY.  See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules.  Maintainers of table access
methods should examine that section.

To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL.  A new GUC,
wal_skip_threshold, guides that choice.  If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold.  Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.

Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode.  Introduce rd_firstRelfilenodeSubid.  Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node.  Make relcache.c retain entries for certain
dropped relations until end of transaction.

Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN.
Future servers accept older WAL, so this bump is discretionary.

Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas.  Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem.  Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs.  Reported by Martijn van Oosterhout.

Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-04-04 12:25:34 -07:00
Peter Eisentraut 552fcebff0 Revert "Improve handling of parameter differences in physical replication"
This reverts commit 246f136e76.

That patch wasn't quite complete enough.

Discussion: https://www.postgresql.org/message-id/flat/E1jIpJu-0007Ql-CL%40gemulon.postgresql.org
2020-04-04 09:08:12 +02:00
Amit Kapila df3b181499 Add infrastructure to track WAL usage.
This allows gathering the WAL generation statistics for each statement
execution.  The three statistics that we collect are the number of WAL
records, the number of full page writes and the amount of WAL bytes
generated.

This helps the users who have write-intensive workload to see the impact
of I/O due to WAL.  This further enables us to see approximately what
percentage of overall WAL is due to full page writes.

In the future, we can extend this functionality to allow us to compute the
the exact amount of WAL data due to full page writes.

This patch in itself is just an infrastructure to compute WAL usage data.
The upcoming patches will expose this data via explain, auto_explain,
pg_stat_statements and verbose (auto)vacuum output.

Author: Kirill Bychik, Julien Rouhaud
Reviewed-by: Dilip Kumar, Fujii Masao and Amit Kapila
Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com
2020-04-04 10:02:08 +05:30
Robert Haas 0d8c9c1210 Generate backup manifests for base backups, and validate them.
A manifest is a JSON document which includes (1) the file name, size,
last modification time, and an optional checksum for each file backed
up, (2) timelines and LSNs for whatever WAL will need to be replayed
to make the backup consistent, and (3) a checksum for the manifest
itself. By default, we use CRC-32C when checksumming data files,
because we are trying to detect corruption and user error, not foil an
adversary. However, pg_basebackup and the server-side BASE_BACKUP
command now have options to select a different algorithm, so users
wanting a cryptographic hash function can select SHA-224, SHA-256,
SHA-384, or SHA-512. Users not wanting file checksums at all can
disable them, or disable generating of the backup manifest altogether.
Using a cryptographic hash function in place of CRC-32C consumes
significantly more CPU cycles, which may slow down backups in some
cases.

A new tool called pg_validatebackup can validate a backup against the
manifest. If no checksums are present, it can still check that the
right files exist and that they have the expected sizes. If checksums
are present, it can also verify that each file has the expected
checksum. Additionally, it calls pg_waldump to verify that the
expected WAL files are present and parseable. Only plain format
backups can be validated directly, but tar format backups can be
validated after extracting them.

Robert Haas, with help, ideas, review, and testing from David Steele,
Stephen Frost, Andrew Dunstan, Rushabh Lathia, Suraj Kharage, Tushar
Ahuja, Rajkumar Raghuwanshi, Mark Dilger, Davinder Singh, Jeevan
Chalke, Amit Kapila, Andres Freund, and Noah Misch.

Discussion: http://postgr.es/m/CA+TgmoZV8dw1H2bzZ9xkKwdrk8+XYa+DC9H=F7heO2zna5T6qg@mail.gmail.com
2020-04-03 15:05:59 -04:00
Tomas Vondra 28cac71bd3 Collect statistics about SLRU caches
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.

This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.

The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.

This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.

Bump catversion as the patch introduces new functions and system view.

Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-02 02:34:21 +02:00
Fujii Masao b0236508d3 Improve the message logged when recovery is paused.
When recovery target is reached and recovery is paused because of
recovery_target_action=pause, executing pg_wal_replay_resume() causes
the standby to promote, i.e., the recovery to end. So, in this case,
the previous message "Execute pg_wal_replay_resume() to continue"
logged was confusing because pg_wal_replay_resume() doesn't cause
the recovery to continue.

This commit improves the message logged when recovery is paused,
and the proper message is output based on what (pg_wal_replay_pause
or recovery_target_action) causes recovery to be paused.

Author: Sergei Kornilov, revised by Fujii Masao
Reviewed-by: Robert Haas
Discussion: https://postgr.es/m/19168211580382043@myt5-b646bde4b8f3.qloud-c.yandex.net
2020-04-01 03:35:13 +09:00
Michael Paquier 616ae3d2b0 Move routine definitions of xlogarchive.c to a new header file
The definitions of the routines defined in xlogarchive.c have been part
of xlog_internal.h which is included by several frontend tools, but all
those routines are only called by the backend.  More cleanup could be
done within xlog_internal.h, but that's already a nice cut.

This will help a follow-up patch for pg_rewind where handling of
restore_command is added for frontends.

Author: Alexey Kondratov, Michael Paquier
Reviewed-by: Álvaro Herrera, Alexander Korotkov
Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1@postgrespro.ru
2020-03-31 15:33:04 +09:00
Peter Eisentraut 246f136e76 Improve handling of parameter differences in physical replication
When certain parameters are changed on a physical replication primary,
this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL
record.  The standby then checks whether its own settings are at least
as big as the ones on the primary.  If not, the standby shuts down
with a fatal error.

The correspondence of settings between primary and standby is required
because those settings influence certain shared memory sizings that
are required for processing WAL records that the primary might send.
For example, if the primary sends a prepared transaction, the standby
must have had max_prepared_transaction set appropriately or it won't
be able to process those WAL records.

However, fatally shutting down the standby immediately upon receipt of
the parameter change record might be a bit of an overreaction.  The
resources related to those settings are not required immediately at
that point, and might never be required if the activity on the primary
does not exhaust all those resources.  If we just let the standby roll
on with recovery, it will eventually produce an appropriate error when
those resources are used.

So this patch relaxes this a bit.  Upon receipt of
XLOG_PARAMETER_CHANGE, we still check the settings but only issue a
warning and set a global flag if there is a problem.  Then when we
actually hit the resource issue and the flag was set, we issue another
warning message with relevant information.  At that point we pause
recovery, so a hot standby remains usable.  We also repeat the last
warning message once a minute so it is harder to miss or ignore.

Reviewed-by: Sergei Kornilov <sk@zsrv.org>
Reviewed-by: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36c7b@2ndquadrant.com
2020-03-30 09:53:45 +02:00
Alvaro Herrera 1e6148032e
Allow walreceiver configuration to change on reload
The parameters primary_conninfo, primary_slot_name and
wal_receiver_create_temp_slot can now be changed with a simple "reload"
signal, no longer requiring a server restart.  This is achieved by
signalling the walreceiver process to terminate and having it start
again with the new values.

Thanks to Andres Freund, Kyotaro Horiguchi, Fujii Masao for discussion.

Author: Sergei Kornilov <sk@zsrv.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/19513901543181143@sas1-19a94364928d.qloud-c.yandex.net
2020-03-27 19:51:37 -03:00
Alvaro Herrera 092c6936de
Set wal_receiver_create_temp_slot PGC_POSTMASTER
Commit 3297308278 gave walreceiver the ability to create and use a
temporary replication slot, and made it controllable by a GUC (enabled
by default) that can be changed with SIGHUP.  That's useful but has two
problems: one, it's possible to cause the origin server to fill its disk
if the slot doesn't advance in time; and also there's a disconnect
between state passed down via the startup process and GUCs that
walreceiver reads directly.

We handle the first problem by setting the option to disabled by
default.  If the user enables it, its on their head to make sure that
disk doesn't fill up.

We handle the second problem by passing the flag via startup rather than
having walreceiver acquire it directly, and making it PGC_POSTMASTER
(which ensures a walreceiver always has the fresh value).  A future
commit can relax this (to PGC_SIGHUP again) by having the startup
process signal walreceiver to shutdown whenever the value changes.

Author: Sergei Kornilov <sk@zsrv.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20200122055510.GH174860@paquier.xyz
2020-03-27 16:20:33 -03:00
Fujii Masao 496ee647ec Prefer standby promotion over recovery pause.
Previously if a promotion was triggered while recovery was paused,
the paused state continued. Also recovery could be paused by executing
pg_wal_replay_pause() even while a promotion was ongoing. That is,
recovery pause had higher priority over a standby promotion.
But this behavior was not desirable because most users basically wanted
the recovery to complete as soon as possible and the server to become
the master when they requested a promotion.

This commit changes recovery so that it prefers a promotion over
recovery pause. That is, if a promotion is triggered while recovery
is paused, the paused state ends and a promotion continues. Also
this commit makes recovery pause functions like pg_wal_replay_pause()
throw an error if they are executed while a promotion is ongoing.

Internally, this commit adds new internal function PromoteIsTriggered()
that returns true if a promotion is triggered. Since the name of
this function and the existing function IsPromoteTriggered() are
confusingly similar, the commit changes the name of IsPromoteTriggered()
to IsPromoteSignaled, as more appropriate name.

Author: Fujii Masao
Reviewed-by: Atsushi Torikoshi, Sergei Kornilov
Discussion: https://postgr.es/m/00c194b2-dbbb-2e8a-5b39-13f14048ef0a@oss.nttdata.com
2020-03-24 12:46:48 +09:00
Michael Paquier e09ad07b21 Move routine building restore_command to src/common/
restore_command has only been used until now by the backend, but there
is a pending patch for pg_rewind to make use of that in the frontend.

Author: Alexey Kondratov
Reviewed-by: Andrey Borodin, Andres Freund, Alvaro Herrera, Alexander
Korotkov, Michael Paquier
Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1@postgrespro.ru
2020-03-24 12:13:36 +09:00
Fujii Masao b8e20d6dab Add wait events for WAL archive and recovery pause.
This commit introduces new wait events BackupWaitWalArchive and
RecoveryPause. The former is reported while waiting for the WAL files
required for the backup to be successfully archived. The latter is
reported while waiting for recovery in pause state to be resumed.

Author: Fujii Masao
Reviewed-by: Michael Paquier, Atsushi Torikoshi, Robert Haas
Discussion: https://postgr.es/m/f0651f8c-9c96-9f29-0ff9-80414a15308a@oss.nttdata.com
2020-03-24 11:12:21 +09:00
Noah Misch de9396326e Revert "Skip WAL for new relfilenodes, under wal_level=minimal."
This reverts commit cb2fd7eac2.  Per
numerous buildfarm members, it was incompatible with parallel query, and
a test case assumed LP64.  Back-patch to 9.5 (all supported versions).

Discussion: https://postgr.es/m/20200321224920.GB1763544@rfd.leadboat.com
2020-03-22 09:24:09 -07:00
Noah Misch cb2fd7eac2 Skip WAL for new relfilenodes, under wal_level=minimal.
Until now, only selected bulk operations (e.g. COPY) did this.  If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY.  See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules.  Maintainers of table access
methods should examine that section.

To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL.  A new GUC,
wal_skip_threshold, guides that choice.  If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold.  Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.

Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode.  Introduce rd_firstRelfilenodeSubid.  Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node.  Make relcache.c retain entries for certain
dropped relations until end of transaction.

Back-patch to 9.5 (all supported versions).  This introduces a new WAL
record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC.  As
always, update standby systems before master systems.  This changes
sizeof(RelationData) and sizeof(IndexStmt), breaking binary
compatibility for affected extensions.  (The most recent commit to
affect the same class of extensions was
089e4d405d0f3b94c74a2c6a54357a84a681754b.)

Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas.  Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem.  Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs.  Reported by Martijn van Oosterhout.

Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-03-21 09:38:26 -07:00
Noah Misch d3e572855b In log_newpage_range(), heed forkNum and page_std arguments.
The function assumed forkNum=MAIN_FORKNUM and page_std=true, ignoring
the actual arguments.  Existing callers passed exactly those values, so
there's no live bug.  Back-patch to v12, where the function first
appeared, because another fix needs this.

Discussion: https://postgr.es/m/20191118045434.GA1173436@rfd.leadboat.com
2020-03-21 09:38:26 -07:00
Fujii Masao 1d253bae57 Rename the recovery-related wait events.
This commit renames RecoveryWalAll and RecoveryWalStream wait events to
RecoveryWalStream and RecoveryRetrieveRetryInterval, respectively,
in order to make the names and what they are more consistent. For example,
previously RecoveryWalAll was reported as a wait event while the recovery
was waiting for WAL from a stream, and which was confusing because the name
was very different from the situation where the wait actually could happen.

The names of macro variables for those wait events also are renamed
accordingly.

This commit also changes the category of RecoveryRetrieveRetryInterval to
Timeout from Activity because the wait event is reported while waiting based
on wal_retrieve_retry_interval.

Author: Fujii Masao
Reviewed-by: Kyotaro Horiguchi, Atsushi Torikoshi
Discussion: https://postgr.es/m/124997ee-096a-5d09-d8da-2c7a57d0816e@oss.nttdata.com
2020-03-19 15:32:55 +09:00
Fujii Masao 1429d3f767 Fix comment in xlog.c.
This commit fixes the comment about SharedHotStandbyActive variable.
The comment was apparently copy-and-pasted.

Author: Atsushi Torikoshi
Discussion: https://postgr.es/m/CACZ0uYEjpqZB9wN2Rwc_RMvDybyYqdbkPuDr1NyxJg4f9yGfMw@mail.gmail.com
2020-03-17 12:06:22 +09:00
Tom Lane 41b45576d5 Remove useless pfree()s at the ends of various ValuePerCall SRFs.
We don't need to manually clean up allocations in a SRF's
multi_call_memory_ctx, because the SRF_RETURN_DONE infrastructure
takes care of that (and also ensures that it will happen even if the
function never gets a final call, which simple manual cleanup cannot
do).

Hence, the code removed by this patch is a waste of code and cycles.
Worse, it gives the impression that cleaning up manually is a thing,
which can lead to more serious errors such as those fixed in
commits 085b6b667 and b4570d33a.  So we should get rid of it.

These are not quite actual bugs though, so I couldn't muster the
enthusiasm to back-patch.  Fix in HEAD only.

Justin Pryzby

Discussion: https://postgr.es/m/20200308173103.GC1357@telsasoft.com
2020-03-16 21:36:53 -04:00
Peter Eisentraut bf68b79e50 Refactor ps_status.c API
The init_ps_display() arguments were mostly lies by now, so to match
typical usage, just use one argument and let the caller assemble it
from multiple sources if necessary.  The only user of the additional
arguments is BackendInitialize(), which was already doing string
assembly on the caller side anyway.

Remove the second argument of set_ps_display() ("force") and just
handle that in init_ps_display() internally.

BackendInitialize() also used to set the initial status as
"authentication", but that was very far from where authentication
actually happened.  So now it's set to "initializing" and then
"authentication" just before the actual call to
ClientAuthentication().

Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Kuntal Ghosh <kuntalghosh.2007@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/c65e5196-4f04-4ead-9353-6088c19615a3@2ndquadrant.com
2020-03-11 16:38:31 +01:00
Peter Eisentraut aaa3aeddee Remove HAVE_WORKING_LINK
Previously, hard links were not used on Windows and Cygwin, but they
support them just fine in currently supported OS versions, so we can
use them there as well.

Since all supported platforms now support hard links, we can remove
the alternative code paths.

Rename durable_link_or_rename() to durable_rename_excl() to make the
purpose more clear without referencing the implementation details.

Discussion: https://www.postgresql.org/message-id/flat/72fff73f-dc9c-4ef4-83e8-d2e60c98df48%402ndquadrant.com
2020-03-11 11:23:04 +01:00
Fujii Masao 71e0d0a737 Tidy up XLogSource code in xlog.c.
This commit replaces 0 used as an initial value of XLogSource variable,
with XLOG_FROM_ANY. Also this commit changes those variable so that
XLogSource instead of int is used as the type for them. These changes
are for code readability and debugger-friendliness.

Author: Kyotaro Horiguchi
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/20200227.124830.2197604521555566121.horikyota.ntt@gmail.com
2020-03-10 09:41:44 +09:00
Fujii Masao ef34ab42a8 Avoid assertion failure with targeted recovery in standby mode.
At the end of recovery, standby mode is turned off to re-fetch the last
valid record from archive or pg_wal. Previously, if recovery target was
reached and standby mode was turned off while the current WAL source
was stream, recovery could try to retrieve WAL file containing the last
valid record unexpectedly from stream even though not in standby mode.
This caused an assertion failure. That is, the assertion test confirms that
WAL file should not be retrieved from stream if standby mode is not true.

This commit moves back the current WAL source to archive if it's stream
even though not in standby mode, to avoid that assertion failure.

This issue doesn't cause the server to crash when built with assertion
disabled. In this case, the attempt to retrieve WAL file from stream not
in standby mode just fails. And then recovery tries to retrieve WAL file
from archive or pg_wal.

Back-patch to all supported branches.

Author: Kyotaro Horiguchi
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/20200227.124830.2197604521555566121.horikyota.ntt@gmail.com
2020-03-09 15:32:32 +09:00
Fujii Masao e65497df8f Report progress of streaming base backup.
This commit adds pg_stat_progress_basebackup view that reports
the progress while an application like pg_basebackup is taking
a base backup. This uses the progress reporting infrastructure
added by c16dc1aca5, adding support for streaming base backup.

Bump catversion.

Author: Fujii Masao
Reviewed-by: Kyotaro Horiguchi, Amit Langote, Sergei Kornilov
Discussion: https://postgr.es/m/9ed8b801-8215-1f3d-62d7-65bff53f6e94@oss.nttdata.com
2020-03-03 12:03:43 +09:00
Tom Lane 3d475515a1 Account explicitly for long-lived FDs that are allocated outside fd.c.
The comments in fd.c have long claimed that all file allocations should
go through that module, but in reality that's not always practical.
fd.c doesn't supply APIs for invoking some FD-producing syscalls like
pipe() or epoll_create(); and the APIs it does supply for non-virtual
FDs are mostly insistent on releasing those FDs at transaction end;
and in some cases the actual open() call is in code that can't be made
to use fd.c, such as libpq.

This has led to a situation where, in a modern server, there are likely
to be seven or so long-lived FDs per backend process that are not known
to fd.c.  Since NUM_RESERVED_FDS is only 10, that meant we had *very*
few spare FDs if max_files_per_process is >= the system ulimit and
fd.c had opened all the files it thought it safely could.  The
contrib/postgres_fdw regression test, in particular, could easily be
made to fall over by running it under a restrictive ulimit.

To improve matters, invent functions Acquire/Reserve/ReleaseExternalFD
that allow outside callers to tell fd.c that they have or want to allocate
a FD that's not directly managed by fd.c.  Add calls to track all the
fixed FDs in a standard backend session, so that we are honestly
guaranteeing that NUM_RESERVED_FDS FDs remain unused below the EMFILE
limit in a backend's idle state.  The coding rules for these functions say
that there's no need to call them in code that just allocates one FD over
a fairly short interval; we can dip into NUM_RESERVED_FDS for such cases.
That means that there aren't all that many places where we need to worry.
But postgres_fdw and dblink must use this facility to account for
long-lived FDs consumed by libpq connections.  There may be other places
where it's worth doing such accounting, too, but this seems like enough
to solve the immediate problem.

Internally to fd.c, "external" FDs are limited to max_safe_fds/3 FDs.
(Callers can choose to ignore this limit, but of course it's unwise
to do so except for fixed file allocations.)  I also reduced the limit
on "allocated" files to max_safe_fds/3 FDs (it had been max_safe_fds/2).
Conceivably a smarter rule could be used here --- but in practice,
on reasonable systems, max_safe_fds should be large enough that this
isn't much of an issue, so KISS for now.  To avoid possible regression
in the number of external or allocated files that can be opened,
increase FD_MINFREE and the lower limit on max_files_per_process a
little bit; we now insist that the effective "ulimit -n" be at least 64.

This seems like pretty clearly a bug fix, but in view of the lack of
field complaints, I'll refrain from risking a back-patch.

Discussion: https://postgr.es/m/E1izCmM-0005pV-Co@gemulon.postgresql.org
2020-02-24 17:28:33 -05:00
Peter Eisentraut 79c2385915 Factor out InitControlFile() from BootStrapXLOG()
Right now this only makes BootStrapXLOG() a bit more manageable, but
in the future there may be external callers.

Discussion: https://www.postgresql.org/message-id/e8f86ba5-48f1-a80a-7f1d-b76bcb9c5c47@2ndquadrant.com
2020-02-22 12:09:27 +01:00
Peter Eisentraut 9745f93afc Reformat code comment
Discussion: https://www.postgresql.org/message-id/e8f86ba5-48f1-a80a-7f1d-b76bcb9c5c47@2ndquadrant.com
2020-02-22 12:09:27 +01:00
Amit Kapila 3dfba9fdf5 Fix typos.
Reported-by: Justin Pryzby
Author: Justin Pryzby
Discussion: https://postgr.es/m/20200206021432.GA24549@telsasoft.com
2020-02-10 09:31:18 +05:30
Alvaro Herrera c9d2977519 Clean up newlines following left parentheses
We used to strategically place newlines after some function call left
parentheses to make pgindent move the argument list a few chars to the
left, so that the whole line would fit under 80 chars.  However,
pgindent no longer does that, so the newlines just made the code
vertically longer for no reason.  Remove those newlines, and reflow some
of those lines for some extra naturality.

Reviewed-by: Michael Paquier, Tom Lane
Discussion: https://postgr.es/m/20200129200401.GA6303@alvherre.pgsql
2020-01-30 13:42:14 -03:00
Alvaro Herrera 4e89c79a52 Remove excess parens in ereport() calls
Cosmetic cleanup, not worth backpatching.

Discussion: https://postgr.es/m/20200129200401.GA6303@alvherre.pgsql
Reviewed-by: Tom Lane, Michael Paquier
2020-01-30 13:32:04 -03:00
Peter Eisentraut dc788668bb Fail if recovery target is not reached
Before, if a recovery target is configured, but the archive ended
before the target was reached, recovery would end and the server would
promote without further notice.  That was deemed to be pretty wrong.
With this change, if the recovery target is not reached, it is a fatal
error.

Based-on-patch-by: Leif Gunnar Erlandsen <leif@lako.no>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/993736dd3f1713ec1f63fc3b653839f5@lako.no
2020-01-29 15:58:14 +01:00
Heikki Linnakangas 30012a04a6 Fix randAccess setting in ReadRecord()
Commit 38a957316d got this backwards.

Author: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/20200128.194408.2260703306774646445.horikyota.ntt@gmail.com
2020-01-28 12:55:30 +02:00
Heikki Linnakangas 38a957316d Refactor XLogReadRecord(), adding XLogBeginRead() function.
The signature of XLogReadRecord() required the caller to pass the starting
WAL position as argument, or InvalidXLogRecPtr to continue reading at the
end of previous record. That's slightly awkward to the callers, as most
of them don't want to randomly jump around in the WAL stream, but start
reading at one position and then read everything from that point onwards.
Remove the 'RecPtr' argument and add a new function XLogBeginRead() to
specify the starting position instead. That's more convenient for the
callers. Also, xlogreader holds state that is reset when you change the
starting position, so having a separate function for doing that feels like
a more natural fit.

This changes XLogFindNextRecord() function so that it doesn't reset the
xlogreader's state to what it was before the call anymore. Instead, it
positions the xlogreader to the found record, like XLogBeginRead().

Reviewed-by: Kyotaro Horiguchi, Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/5382a7a3-debe-be31-c860-cb810c08f366%40iki.fi
2020-01-26 11:39:00 +02:00
Fujii Masao 41c184bc64 Add GUC ignore_invalid_pages.
Detection of WAL records having references to invalid pages
during recovery causes PostgreSQL to raise a PANIC-level error,
aborting the recovery. Setting ignore_invalid_pages to on causes
the system to ignore those WAL records (but still report a warning),
and continue recovery. This behavior may cause crashes, data loss,
propagate or hide corruption, or other serious problems.
However, it may allow you to get past the PANIC-level error,
to finish the recovery, and to cause the server to start up.

Author: Fujii Masao
Reviewed-by: Michael Paquier
Discussion: https://www.postgresql.org/message-id/CAHGQGwHCK6f77yeZD4MHOnN+PaTf6XiJfEB+Ce7SksSHjeAWtg@mail.gmail.com
2020-01-22 11:56:34 +09:00
Amit Kapila 40d964ec99 Allow vacuum command to process indexes in parallel.
This feature allows the vacuum to leverage multiple CPUs in order to
process indexes.  This enables us to perform index vacuuming and index
cleanup with background workers.  This adds a PARALLEL option to VACUUM
command where the user can specify the number of workers that can be used
to perform the command which is limited by the number of indexes on a
table.  Specifying zero as a number of workers will disable parallelism.
This option can't be used with the FULL option.

Each index is processed by at most one vacuum process.  Therefore parallel
vacuum can be used when the table has at least two indexes.

The parallel degree is either specified by the user or determined based on
the number of indexes that the table has, and further limited by
max_parallel_maintenance_workers.  The index can participate in parallel
vacuum iff it's size is greater than min_parallel_index_scan_size.

Author: Masahiko Sawada and Amit Kapila
Reviewed-by: Dilip Kumar, Amit Kapila, Robert Haas, Tomas Vondra,
Mahendra Singh and Sergei Kornilov
Tested-by: Mahendra Singh and Prabhat Sahu
Discussion:
https://postgr.es/m/CAD21AoDTPMgzSkV4E3SFo1CH_x50bf5PqZFQf4jmqjk-C03BWg@mail.gmail.com
https://postgr.es/m/CAA4eK1J-VoR9gzS5E75pcD-OH0mEyCdp8RihcwKrcuw7J-Q0+w@mail.gmail.com
2020-01-20 07:57:49 +05:30
Alvaro Herrera f5d28710c7 Reimplement nullification of walsender timestamp
Make the value null only at pg_stat_activity-output time, as suggested
by Tom Lane, instead of messing with the internal state.  This should
appease buildfarm members with force_parallel_mode=regress, which are
running parallel queries on logical replication walsenders.

The fact that walsenders can run parallel queries should perhaps be
studied more carefully, but for the moment let's get rid of the red
blots in buildfarm.

Backpatch to pg10, like the previous commit.

Discussion: https://postgr.es/m/30804.1578438763@sss.pgh.pa.us
2020-01-08 14:33:49 -03:00
Alvaro Herrera b175bd59fa pg_stat_activity: show NULL stmt start time for walsenders
Returning a non-NULL time is pointless, sinc a walsender is not a
process that would be running normal transactions anyway, but the code
was unintentionally exposing the process start time intermittently,
which was not only bogus but it also confused monitoring systems looking
for idle transactions.  Fix by avoiding all updates in walsenders.

Backpatch to 11, where walsenders started appearing in pg_stat_activity.

Reported-by: Tomas Vondra
Discussion: https://postgr.es/m/20191209234409.exe7osmyalwkt5j4@development
2020-01-07 17:38:48 -03:00
Bruce Momjian 7559d8ebfa Update copyrights for 2020
Backpatch-through: update all files in master, backpatch legal files through 9.4
2020-01-01 12:21:45 -05:00
Alvaro Herrera c4dcd9144b Avoid splitting C string literals with \-newline
Using \ is unnecessary and ugly, so remove that.  While at it, stitch
the literals back into a single line: we've long discouraged splitting
error message literals even when they go past the 80 chars line limit,
to improve greppability.

Leave contrib/tablefunc alone.

Discussion: https://postgr.es/m/20191223195156.GA12271@alvherre.pgsql
2019-12-24 12:44:12 -03:00
Robert Haas 303640199d Fix minor problems with non-exclusive backup cleanup.
The previous coding imagined that it could call before_shmem_exit()
when a non-exclusive backup began and then remove the previously-added
handler by calling cancel_before_shmem_exit() when that backup
ended. However, this only works provided that nothing else in the
system has registered a before_shmem_exit() hook in the interim,
because cancel_before_shmem_exit() is documented to remove a callback
only if it is the latest callback registered. It also only works
if nothing can ERROR out between the time that sessionBackupState
is reset and the time that cancel_before_shmem_exit(), which doesn't
seem to be strictly true.

To fix, leave the handler installed for the lifetime of the session,
arrange to install it just once, and teach it to quietly do nothing if
there isn't a non-exclusive backup in process.

This is a bug, but for now I'm not going to back-patch, because the
consequences are minor. It's possible to cause a spurious warning
to be generated, but that doesn't really matter. It's also possible
to trigger an assertion failure, but production builds shouldn't
have assertions enabled.

Patch by me, reviewed by Kyotaro Horiguchi, Michael Paquier (who
preferred a different approach, but got outvoted), Fujii Masao,
and Tom Lane, and with comments by various others.

Discussion: http://postgr.es/m/CA+TgmobMjnyBfNhGTKQEDbqXYE3_rXWpc4CM63fhyerNCes3mA@mail.gmail.com
2019-12-19 09:06:54 -05:00
Michael Paquier 70116493a8 Remove shadow variables linked to RedoRecPtr in xlog.c
This changes the routines in charge of recycling WAL segments past the
last redo LSN to not use anymore "RedoRecPtr" as a local variable, which
is also available in the context of the session as a static declaration,
replacing it with "lastredoptr".  This confusion has been introduced by
d9fadbf, so backpatch down to v11 like the other commit.

Thanks to Tom Lane, Robert Haas, Alvaro Herrera, Mark Dilger and Kyotaro
Horiguchi for the input provided.

Author: Ranier Vilela
Discussion: https://postgr.es/m/MN2PR18MB2927F7B5F690065E1194B258E35D0@MN2PR18MB2927.namprd18.prod.outlook.com
Backpatch-through: 11
2019-12-18 10:11:13 +09:00
Amit Kapila af3290f5e7 Change overly strict Assert in TransactionGroupUpdateXidStatus.
This Assert thought that an overflowed transaction can never get registered
for the group update.  But that is not true, because even when the number
of children for a transaction got reduced, the overflow flag is not
changed.  And, for group update, we only care about the current number of
children for a transaction that is being committed.

Based on comments by Andres Freund, remove a redundant Assert in
TransactionIdSetPageStatus as we already had a static Assert for the same
condition a few lines earlier.

Reported-by: Vignesh C
Author: Dilip Kumar
Reviewed-by: Amit Kapila
Backpatch-through: 11
Discussion: https://postgr.es/m/CAFiTN-s5=uJw-Z6JC9gcqtBSjXsrHnU63PXBrA=pnBjqnkm5UA@mail.gmail.com
2019-12-17 09:29:22 +05:30
Alvaro Herrera 91fca4bb60 Demote variable from global to local
recoveryDelayUntilTime was introduced by commit 36da3cfb45 as a global
because its method of operation was devilishly intrincate.  Commit
c945af80cf removed all that complexity and could have turned it into a
local variable, but didn't.  Do so now.

Discussion: https://postgr.es/m/20191213200751.GA10731@alvherre.pgsql
Reviewed-by: Michaël Paquier, Daniel Gustafsson
2019-12-16 14:23:56 -03:00
Michael Paquier 68ab982906 Fix thinkos from commit 9989d37
Error messages referring to incorrect WAL segment names could have been
generated for a fsync() failure or when creating a new segment at the
end of recovery.
2019-12-03 18:59:09 +09:00
Michael Paquier 9989d37d1c Remove XLogFileNameP() from the tree
XLogFileNameP() is a wrapper routine able to build a palloc'd string for
a WAL segment name, which is used for error string generation.  There
were several code paths where it gets called in a critical section,
where memory allocation is not allowed.  This results in triggering
an assertion failure instead of generating the wanted error message.

Another, more annoying, problem is that if the allocation to generate
the WAL segment name fails on OOM, then the failure would be escalated
to a PANIC.

This removes the routine and all its callers are replaced with a logic
using a fixed-size buffer.  This way, all the existing mistakes are
fixed and future ones are prevented.

Author: Masahiko Sawada
Reviewed-by: Michael Paquier, Álvaro Herrera
Discussion: https://postgr.es/m/CA+fd4k5gC9H4uoWMLg9K_QfNrnkkdEw+-AFveob9YX7z8JnKTA@mail.gmail.com
2019-12-03 15:06:04 +09:00
Alvaro Herrera 3974c4a724 Remove useless "return;" lines
Discussion: https://postgr.es/m/20191128144653.GA27883@alvherre.pgsql
2019-11-28 16:48:37 -03:00
Alvaro Herrera 0dc8ead463 Refactor WAL file-reading code into WALRead()
XLogReader, walsender and pg_waldump all had their own routines to read
data from WAL files to memory, with slightly different approaches
according to the particular conditions of each environment.  There's a
lot of commonality, so we can refactor that into a single routine
WALRead in XLogReader, and move the differences to a separate (simpler)
callback that just opens the next WAL-segment.  This results in a
clearer (ahem) code flow.

The error reporting needs are covered by filling in a new error-info
struct, WALReadError, and it's the caller's responsibility to act on it.
The backend has WALReadRaiseError() to do so.

We no longer ever need to seek in this interface; switch to using
pg_pread().

Author: Antonin Houska, with contributions from Álvaro Herrera
Reviewed-by: Michaël Paquier, Kyotaro Horiguchi
Discussion: https://postgr.es/m/14984.1554998742@spoje.net
2019-11-25 15:04:54 -03:00
Tom Lane 6b802cfc7f Avoid assertion failure with LISTEN in a serializable transaction.
If LISTEN is the only action in a serializable-mode transaction,
and the session was not previously listening, and the notify queue
is not empty, predicate.c reported an assertion failure.  That
happened because we'd acquire the transaction's initial snapshot
during PreCommit_Notify, which was called *after* predicate.c
expects any such snapshot to have been established.

To fix, just swap the order of the PreCommit_Notify and
PreCommit_CheckForSerializationFailure calls during CommitTransaction.
This will imply holding the notify-insertion lock slightly longer,
but the difference could only be meaningful in serializable mode,
which is an expensive option anyway.

It appears that this is just an assertion failure, with no
consequences in non-assert builds.  A snapshot used only to scan
the notify queue could not have been involved in any serialization
conflicts, so there would be nothing for
PreCommit_CheckForSerializationFailure to do except assign it a
prepareSeqNo and set the SXACT_FLAG_PREPARED flag.  And given no
conflicts, neither of those omissions affect the behavior of
ReleasePredicateLocks.  This admittedly once-over-lightly analysis
is backed up by the lack of field reports of trouble.

Per report from Mark Dilger.  The bug is old, so back-patch to all
supported branches; but the new test case only goes back to 9.6,
for lack of adequate isolationtester infrastructure before that.

Discussion: https://postgr.es/m/3ac7f397-4d5f-be8e-f354-440020675694@gmail.com
Discussion: https://postgr.es/m/13881.1574557302@sss.pgh.pa.us
2019-11-24 15:57:49 -05:00
Peter Eisentraut 2e4db241bf Remove configure --disable-float4-byval
This build option was only useful to maintain compatibility for
version-0 functions, but those are no longer supported, so this option
can be removed.

float4 is now always pass-by-value; the pass-by-reference code path is
completely removed.

Discussion: https://www.postgresql.org/message-id/flat/f3e1e576-2749-bbd7-2d57-3f9dcf75255a@2ndquadrant.com
2019-11-21 18:29:21 +01:00
Fujii Masao 7b8a899bde Make pg_waldump report more detail information about PREPARE TRANSACTION record.
This commit changes xact_desc() so that it reports the detail information about
PREPARE TRANSACTION record, like GID (global transaction identifier),
timestamp at prepare transaction, delete-on-abort/commit relations,
XID of subtransactions, and invalidation messages. These are helpful
when diagnosing 2PC-related troubles.

Author: Fujii Masao
Reviewed-by: Michael Paquier, Andrey Lepikhov, Kyotaro Horiguchi, Julien Rouhaud, Alvaro Herrera
Discussion: https://postgr.es/m/CAHGQGwEvhASad4JJnCv=0dW2TJypZgW_Vpb-oZik2a3utCqcrA@mail.gmail.com
2019-11-13 16:59:17 +09:00
Amit Kapila 14aec03502 Make the order of the header file includes consistent in backend modules.
Similar to commits 7e735035f2 and dddf4cdc33, this commit makes the order
of header file inclusion consistent for backend modules.

In the passing, removed a couple of duplicate inclusions.

Author: Vignesh C
Reviewed-by: Kuntal Ghosh and Amit Kapila
Discussion: https://postgr.es/m/CALDaNm2Sznv8RR6Ex-iJO6xAdsxgWhCoETkaYX=+9DW3q0QCfA@mail.gmail.com
2019-11-12 08:30:16 +05:30
Thomas Munro 695c5977c8 Optimize TransactionIdIsCurrentTransactionId().
If the passed in xid is the current top transaction, we can do a fast
check and exit early.  This should work well for the current heap but
also works very well for proposed AMs that don't use a separate xid
for subtransactions.

Author: Ashwin Agrawal, based on a suggestion from Andres Freund
Reviewed-by: Thomas Munro
Discussion: https://postgr.es/m/CALfoeiv0k3hkEb3Oqk%3DziWqtyk2Jys1UOK5hwRBNeANT_yX%2Bng%40mail.gmail.com
2019-11-11 16:33:04 +13:00
Peter Eisentraut b85e43feb3 More precise errors from initial pg_control check
Use a separate error message for invalid checkpoint location and
invalid state instead of just "invalid data" for both.

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/20191107041630.GK1768@paquier.xyz
2019-11-08 08:03:16 +01:00
Fujii Masao a0c96856e8 Fix assertion failure when running pgbench -s.
If there is the WAL page that the continuation WAL record just fits within
(i.e., the continuation record ends just at the end of the page) and
the LSN in such page is specified with -s option, previously pg_waldump
caused an assertion failure. The cause of this assertion failure was that
XLogFindNextRecord() that pg_waldump -s calls mistakenly handled
such special WAL page.

This commit changes XLogFindNextRecord() so that it can handle
such WAL page correctly.

Back-patch to all supported versions.

Author: Andrey Lepikhov
Reviewed-by: Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/99303554-5dd5-06e6-f943-b3005ccd6edd@postgrespro.ru
2019-11-07 16:31:36 +09:00
Andres Freund 01368e5d9d Split all OBJS style lines in makefiles into one-line-per-entry style.
When maintaining or merging patches, one of the most common sources
for conflicts are the list of objects in makefiles. Especially when
the split across lines has been changed on both sides, which is
somewhat common due to attempting to stay below 80 columns, those
conflicts are unnecessarily laborious to resolve.

By splitting, and alphabetically sorting, OBJS style lines into one
object per line, conflicts should be less frequent, and easier to
resolve when they still occur.

Author: Andres Freund
Discussion: https://postgr.es/m/20191029200901.vww4idgcxv74cwes@alap3.anarazel.de
2019-11-05 14:41:07 -08:00
Michael Paquier 51970fa8df Fix initialization of fake LSN for unlogged relations
9155580 has changed the value of the first fake LSN for unlogged
relations from 1 to FirstNormalUnloggedLSN (aka 1000), GiST requiring a
non-zero LSN on some pages to allow an interlocking logic to work, but
its value was still initialized to 1 at the beginning of recovery or
after running pg_resetwal.  This fixes the initialization for both code
paths.

Author: Takayuki Tsunakawa
Reviewed-by: Dilip Kumar, Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/OSBPR01MB2503CE851940C17DE44AE3D9FE6F0@OSBPR01MB2503.jpnprd01.prod.outlook.com
Backpatch-through: 12
2019-10-27 13:54:12 +09:00
Fujii Masao 3b0c59ac1c Fix typo in xlog.c.
Author: Fujii Masao
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/CAHGQGwH7dtYvOZZ8c0AG5AJwH5pfiRdKaCptY1_RdHy0HYeRfQ@mail.gmail.com
2019-10-24 14:13:36 +09:00
Fujii Masao ec1259e880 Fix failure of archive recovery with recovery_min_apply_delay enabled.
recovery_min_apply_delay parameter is intended for use with streaming
replication deployments. However, the document clearly explains that
the parameter will be honored in all cases if it's specified. So it should
take effect even if in archive recovery. But, previously, archive recovery
with recovery_min_apply_delay enabled always failed, and caused assertion
failure if --enable-caasert is enabled.

The cause of this problem is that; the ownership of recoveryWakeupLatch
that recovery_min_apply_delay uses was taken only when standby mode
is requested. So unowned latch could be used in archive recovery, and
which caused the failure.

This commit changes recovery code so that the ownership of
recoveryWakeupLatch is taken even in archive recovery. Which prevents
archive recovery with recovery_min_apply_delay from failing.

Back-patch to v9.4 where recovery_min_apply_delay was added.

Author: Fujii Masao
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CAHGQGwEyD6HdZLfdWc+95g=VQFPR4zQL4n+yHxQgGEGjaSVheQ@mail.gmail.com
2019-10-18 22:32:18 +09:00
Fujii Masao 9b95a36be8 Make crash recovery ignore recovery_min_apply_delay setting.
In v11 or before, this setting could not take effect in crash recovery
because it's specified in recovery.conf and crash recovery always
starts without recovery.conf. But commit 2dedf4d9a8 integrated
recovery.conf into postgresql.conf and which unexpectedly allowed
this setting to take effect even in crash recovery. This is definitely
not good behavior.

To fix the issue, this commit makes crash recovery always ignore
recovery_min_apply_delay setting.

Back-patch to v12 where the issue was added.

Author: Fujii Masao
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CAHGQGwEyD6HdZLfdWc+95g=VQFPR4zQL4n+yHxQgGEGjaSVheQ@mail.gmail.com
Discussion: https://postgr.es/m/e445616d-023e-a268-8aa1-67b8b335340c@pgmasters.net
2019-10-18 22:24:18 +09:00
Fujii Masao 20961ceaf0 Make crash recovery ignore restore_command and recovery_end_command settings.
In v11 or before, those settings could not take effect in crash recovery
because they are specified in recovery.conf and crash recovery always
starts without recovery.conf. But commit 2dedf4d9a8 integrated
recovery.conf into postgresql.conf and which unexpectedly allowed
those settings to take effect even in crash recovery. This is definitely
not good behavior.

To fix the issue, this commit makes crash recovery always ignore
restore_command and recovery_end_command settings.

Back-patch to v12 where the issue was added.

Author: Fujii Masao
Reviewed-by: Peter Eisentraut
Discussion: https://postgr.es/m/e445616d-023e-a268-8aa1-67b8b335340c@pgmasters.net
2019-10-11 15:47:59 +09:00
Robert Haas 967e276e9f Remove AtSubStart_Notify.
Allocate notify-related state lazily instead. This makes trivial
subtransactions noticeably faster.

Patch by me, reviewed and tested by Dilip Kumar, Kyotaro Horiguchi,
and Jeevan Ladhe.

Discussion: https://postgr.es/m/CA+TgmobE1J22S1eC-6N-je9LgrcwZypkwp+zH6JXo9mc=4Nk3A@mail.gmail.com
2019-10-04 08:19:25 -04:00
Michael Paquier df86e52cac Remove temporary WAL and history files at the end of archive recovery
cbc55da has reworked the order of some actions at the end of archive
recovery.  Unfortunately this overlooked the fact that the startup
process needs to remove RECOVERYXLOG (for temporary WAL segment newly
recovered from archives) and RECOVERYHISTORY (for temporary history
file) at this step, leaving the files around even after recovery ended.

Backpatch to 9.5, like the previous commit.

Author: Sawada Masahiko
Reviewed-by: Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/CAD21AoBO_eDQub6zojFnWtnmutRBWvYf7=cW4Hsqj+U_R26w3Q@mail.gmail.com
Backpatch-through: 9.5
2019-10-02 15:53:07 +09:00
Fujii Masao 7acf8a876b Make crash recovery ignore recovery target settings.
In v11 or before, recovery target settings could not take effect in
crash recovery because they are specified in recovery.conf and
crash recovery always starts without recovery.conf. But commit
2dedf4d9a8 integrated recovery.conf into postgresql.conf and
which unexpectedly allowed recovery target settings to take effect
even in crash recovery. This is definitely not good behavior.

To fix the issue, this commit makes crash recovery always ignore
recovery target settings.

Back-patch to v12.

Author: Peter Eisentraut
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/e445616d-023e-a268-8aa1-67b8b335340c@pgmasters.net
2019-09-30 10:18:15 +09:00
Michael Paquier 6e22813b2d Fix comment in xlogreader.c
This has been introduced by 709d003, that has moved readSegNo, readOff
and readPageTLI into a new structure called WALOpenSegment initialized
separately.

Author: Kyotaro Horiguchi
Discussion: https://postgr.es/m/20190926.110809.248342687.horikyota.ntt@gmail.com
2019-09-26 11:53:37 +09:00
Alvaro Herrera 709d003fbd Rework WAL-reading supporting structs
The state-tracking of WAL reading in various places was pretty messy,
mostly because the ancient physical-replication WAL reading code wasn't
using the XLogReader abstraction.  This led to some untidy code.  Make
it prettier by creating two additional supporting structs,
WALSegmentContext and WALOpenSegment which keep track of WAL-reading
state.  This makes code cleaner, as well as supports more future
cleanup.

Author: Antonin Houska
Reviewed-by: Álvaro Herrera and (older versions) Robert Haas
Discussion: https://postgr.es/m/14984.1554998742@spoje.net
2019-09-24 16:39:53 -03:00
Peter Eisentraut 887248e97e Message style fixes 2019-09-23 13:38:39 +02:00
Peter Eisentraut 48770492c3 Add some const decorations to array constants
Author: Mark G <markg735@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAEeOP_YFVeFjq4zDZLDQbLSRFxBiTpwBQHxCNgGd%2Bp5VztTXyQ%40mail.gmail.com
2019-09-17 22:03:00 +02:00
Peter Eisentraut 862ef372d6 Fix behavior of AND CHAIN outside of explicit transaction blocks
When using COMMIT AND CHAIN or ROLLBACK AND CHAIN not in an explicit
transaction block, the previous implementation would leave a
transaction block active in the ROLLBACK case but not the COMMIT case.
To fix for now, error out when using these commands not in an explicit
transaction block.  This restriction could be lifted if a sensible
definition and implementation is found.

Bug: #15977
Author: fn ln <emuser20140816@gmail.com>
Reviewed-by: Fabien COELHO <coelho@cri.ensmp.fr>
2019-09-08 16:23:03 +02:00
Fujii Masao 946647f845 Make pg_promote() detect postmaster death while waiting for promotion to end.
Previously even if postmaster died and WaitLatch() woke up with that event
while pg_promote() was waiting for the standby promotion to finish,
pg_promote() did nothing special and kept waiting until timeout occurred.
This could cause a busy loop.

This patch make pg_promote() return false immediately when postmaster
dies, to avoid such a busy loop.

Back-patch to v12 where pg_promote() was added.

Author: Fujii Masao
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CAHGQGwEs9ROgSp+QF+YdDU+xP8W=CY1k-_Ov-d_Z3JY+to3eXA@mail.gmail.com
2019-09-06 14:27:25 +09:00
Robert Haas 8b94dab066 Split tuptoaster.c into three separate files.
detoast.c/h contain functions required to detoast a datum, partially
or completely, plus a few other utility functions for examining the
size of toasted datums.

toast_internals.c/h contain functions that are used internally to the
TOAST subsystem but which (mostly) do not need to be accessed from
outside.

heaptoast.c/h contains code that is intrinsically specific to the
heap AM, either because it operates on HeapTuples or is based on the
layout of a heap page.

detoast.c and toast_internals.c are placed in
src/backend/access/common rather than src/backend/access/heap.  At
present, both files still have dependencies on the heap, but that will
be improved in a future commit.

Patch by me, reviewed and tested by Prabhat Sabu, Thomas Munro,
Andres Freund, and Álvaro Herrera.

Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com
2019-09-05 13:15:10 -04:00
Alvaro Herrera 25dcc9d35d Make XLogReaderInvalReadState static
This function is only used by xlogreader.c itself, so there's no need to
export it.  It was introduced by commit 3b02ea4f07 with the apparent
intention that it could be used externally, but I couldn't find any
external code calling it.

I (Álvaro) couldn't resist the urge to sort nearby function prototypes
properly while at it.

Author: Antonin Houska
Discussion: https://postgr.es/m/14984.1554998742@spoje.net
2019-09-03 17:41:43 -04:00
Peter Eisentraut 396e4afdbc Better error messages for short reads/writes in SLRU
This avoids getting a

    Could not read from file ...: Success.

for a short read or write (since errno is not set in that case).
Instead, report a more specific error messages.

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/flat/5de61b6b-8be9-7771-0048-860328efe027%402ndquadrant.com
2019-09-03 08:30:21 +02:00
Michael Paquier c96581abe4 Fix inconsistencies and typos in the tree, take 11
This fixes various typos in docs and comments, and removes some orphaned
definitions.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/5da8e325-c665-da95-21e0-c8a99ea61fbf@gmail.com
2019-08-19 16:21:39 +09:00
Michael Paquier 66bde49d96 Fix inconsistencies and typos in the tree, take 10
This addresses some issues with unnecessary code comments, fixes various
typos in docs and comments, and removes some orphaned structures and
definitions.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/9aabc775-5494-b372-8bcb-4dfc0bd37c68@gmail.com
2019-08-13 13:53:41 +09:00
Michael Paquier 8548ddc61b Fix inconsistencies and typos in the tree, take 9
This addresses more issues with code comments, variable names and
unreferenced variables.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/7ab243e0-116d-3e44-d120-76b3df7abefd@gmail.com
2019-08-05 12:14:58 +09:00
Andres Freund 6384e87be2 Remove superfluous semicolon.
Author: Andres Freund
2019-07-30 22:09:53 -07:00
Michael Paquier eb43f3d193 Fix inconsistencies and typos in the tree
This is numbered take 8, and addresses again a set of issues with code
comments, variable names and unreferenced variables.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/b137b5eb-9c95-9c2f-586e-38aba7d59788@gmail.com
2019-07-29 12:28:30 +09:00
Michael Paquier 23bccc823d Fix inconsistencies and typos in the tree
This is numbered take 7, and addresses a set of issues with code
comments, variable names and unreferenced variables.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/dff75442-2468-f74f-568c-6006e141062f@gmail.com
2019-07-22 10:01:50 +09:00
Michael Paquier 0896ae561b Fix inconsistencies and typos in the tree
This is numbered take 7, and addresses a set of issues around:
- Fixes for typos and incorrect reference names.
- Removal of unneeded comments.
- Removal of unreferenced functions and structures.
- Fixes regarding variable name consistency.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/10bfd4ac-3e7c-40ab-2b2e-355ed15495e8@gmail.com
2019-07-16 13:23:53 +09:00
Thomas Munro 67b9b3ca32 Provide XLogRecGetFullXid().
In order to be able to work with FullTransactionId values during replay
without increasing the size of the WAL, infer the epoch.  In general we
can't do that safely, but during replay we can because we know that
nextFullXid can't advance concurrently.

Prevent frontend code from seeing this new function, due to the above
restriction.  Perhaps in future it will be possible to extract the value
entirely from independent WAL records, and then this restriction can be
lifted.

Author: Thomas Munro, based on earlier code from Andres Freund
Discussion: https://postgr.es/m/CA%2BhUKG%2BmLmuDjMi6o1dxkKvGRL56Y2Rz%2BiXAcrZV03G9ZuFQ8Q%40mail.gmail.com
2019-07-15 17:04:29 +12:00
Michael Paquier 6b8548964b Fix inconsistencies in the code
This addresses a couple of issues in the code:
- Typos and inconsistencies in comments and function declarations.
- Removal of unreferenced function declarations.
- Removal of unnecessary compile flags.
- A cleanup error in regressplans.sh.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/0c991fdf-2670-1997-c027-772a420c4604@gmail.com
2019-07-08 13:15:09 +09:00
Peter Eisentraut 7e9a4c5c3d Use consistent style for checking return from system calls
Use

    if (something() != 0)
        error ...

instead of just

    if (something)
        error ...

The latter is not incorrect, but it's a bit confusing and not the
common style.

Discussion: https://www.postgresql.org/message-id/flat/5de61b6b-8be9-7771-0048-860328efe027%402ndquadrant.com
2019-07-07 15:28:49 +02:00
Peter Eisentraut 6a1cd8b923 Unwind some workarounds for lack of portable int64 format specifier
Because there is no portable int64/uint64 format specifier and we
can't stick macros like INT64_FORMAT into the middle of a translatable
string, we have been using various workarounds that put the number to
be printed into a string buffer first.  Now that we always use our own
sprintf(), we can rely on %lld and %llu to work, so we can use those.

This patch undoes this workaround in a few places where it was
egregiously verbose.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/CAH2-Wz%3DWbNxc5ob5NJ9yqo2RMJ0q4HXDS30GVCobeCvC9A1L9A%40mail.gmail.com
2019-07-04 17:01:43 +02:00
Peter Eisentraut 21f428ebde Don't call data type input functions in GUC check hooks
Instead of calling pg_lsn_in() in check_recovery_target_lsn and
timestamptz_in() in check_recovery_target_time, reorganize the
respective code so that we don't raise any errors in the check hooks.
The previous code tried to use PG_TRY/PG_CATCH to handle errors in a
way that is not safe, so now the code contains no ereport() calls and
can operate safely within the GUC error handling system.

Moreover, since the interpretation of the recovery_target_time string
may depend on the time zone, we cannot do the final processing of that
string until all the GUC processing is done.  Instead,
check_recovery_target_time() now does some parsing for syntax
checking, but the actual conversion to a timestamptz value is done
later in the recovery code that uses it.

Reported-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/flat/20190611061115.njjwkagvxp4qujhp%40alap3.anarazel.de
2019-06-30 10:27:43 +02:00
Michael Paquier 3412030205 Fix more typos and inconsistencies in the tree
Author: Alexander Lakhin
Discussion: https://postgr.es/m/0a5419ea-1452-a4e6-72ff-545b1a5a8076@gmail.com
2019-06-17 16:13:16 +09:00
Noah Misch 31d250e049 Update stale comments, and fix comment typos. 2019-06-08 10:12:26 -07:00
Amit Kapila 92c4abc736 Fix assorted inconsistencies.
There were a number of issues in the recent commits which include typos,
code and comments mismatch, leftover function declarations.  Fix them.

Reported-by: Alexander Lakhin
Author: Alexander Lakhin, Amit Kapila and Amit Langote
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/ef0c0232-0c1d-3a35-63d4-0ebd06e31387@gmail.com
2019-06-08 08:16:38 +05:30
Tom Lane 8255c7a5ee Phase 2 pgindent run for v12.
Switch to 2.1 version of pg_bsd_indent.  This formats
multiline function declarations "correctly", that is with
additional lines of parameter declarations indented to match
where the first line's left parenthesis is.

Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
2019-05-22 13:04:48 -04:00
Tom Lane be76af171c Initial pgindent run for v12.
This is still using the 2.0 version of pg_bsd_indent.
I thought it would be good to commit this separately,
so as to document the differences between 2.0 and 2.1 behavior.

Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
2019-05-22 12:55:34 -04:00
Fujii Masao b8e2170e40 Fix comment for issue_xlog_fsync().
"segno" is the argument for the function, not "log" and "seg".

Author: Antonin Houska
Discussion: https://postgr.es/m/11863.1558361020@spoje.net
2019-05-21 00:44:00 +09:00
Tom Lane 2d7d946cd3 Clean up the behavior and API of catalog.c's is-catalog-relation tests.
The right way for IsCatalogRelation/Class to behave is to return true
for OIDs less than FirstBootstrapObjectId (not FirstNormalObjectId),
without any of the ad-hoc fooling around with schema membership.

The previous code was wrong because (1) it claimed that
information_schema tables were not catalog relations but their toast
tables were, which is silly; and (2) if you dropped and recreated
information_schema, which is a supported operation, the behavior
changed.  That's even sillier.  With this definition, "catalog
relations" are exactly the ones traceable to the postgres.bki data,
which seems like what we want.

With this simplification, we don't actually need access to the pg_class
tuple to identify a catalog relation; we only need its OID.  Hence,
replace IsCatalogClass with "IsCatalogRelationOid(oid)".  But keep
IsCatalogRelation as a convenience function.

This allows fixing some arguably-wrong semantics in contrib/sepgsql and
ReindexRelationConcurrently, which were using an IsSystemNamespace test
where what they really should be using is IsCatalogRelationOid.  The
previous coding failed to protect toast tables of system catalogs, and
also was not on board with the general principle that user-created tables
do not become catalogs just by virtue of being renamed into pg_catalog.
We can also get rid of a messy hack in ReindexMultipleTables.

While we're at it, also rename IsSystemNamespace to IsCatalogNamespace,
because the previous name invited confusion with the more expansive
semantics used by IsSystemRelation/Class.

Also improve the comments in catalog.c.

There are a few remaining places in replication-related code that are
special-casing OIDs below FirstNormalObjectId.  I'm inclined to think
those are wrong too, and if there should be any special case it should
just extend to FirstBootstrapObjectId.  But first we need to debate
whether a FOR ALL TABLES publication should include information_schema.

Discussion: https://postgr.es/m/21697.1557092753@sss.pgh.pa.us
Discussion: https://postgr.es/m/15150.1557257111@sss.pgh.pa.us
2019-05-08 23:27:38 -04:00
Heikki Linnakangas 256be1050c Remove leftover reference to old "flat file" mechanism in a comment.
The flat file mechanism was removed in PostgreSQL 9.0.
2019-05-08 09:32:34 +03:00
Amit Kapila 7db0cde6b5 Revert "Avoid the creation of the free space map for small heap relations".
This feature was using a process local map to track the first few blocks
in the relation.  The map was reset each time we get the block with enough
freespace.  It was discussed that it would be better to track this map on
a per-relation basis in relcache and then invalidate the same whenever
vacuum frees up some space in the page or when FSM is created.  The new
design would be better both in terms of API design and performance.

List of commits reverted, in reverse chronological order:

06c8a5090e  Improve code comments in b0eaa4c51b.
13e8643bfc  During pg_upgrade, conditionally skip transfer of FSMs.
6f918159a9  Add more tests for FSM.
9c32e4c350  Clear the local map when not used.
29d108cdec  Update the documentation for FSM behavior..
08ecdfe7e5  Make FSM test portable.
b0eaa4c51b  Avoid creation of the free space map for small heap relations.

Discussion: https://postgr.es/m/20190416180452.3pm6uegx54iitbt5@alap3.anarazel.de
2019-05-07 09:30:24 +05:30
Alvaro Herrera 9f8b717a80 Message style fixes 2019-04-30 10:33:37 -04:00
Michael Paquier 47ac2033d4 Simplify some ERROR paths clearing wait events and transient files
Transient files and wait events get normally cleaned up when seeing an
exception (be it in the context of a transaction for a backend or
another process like the checkpointer), hence there is little point in
complicating error code paths to do this work.  This shaves a bit of
code, and removes some extra handling with errno which needed to be
preserved during the cleanup steps done.

Reported-by: Masahiko Sawada
Author: Michael Paquier
Reviewed-by: Tom Lane, Masahiko Sawada
Discussion: https://postgr.es/m/CAD21AoDhHYVq5KkXfkaHhmjA-zJYj-e4teiRAJefvXuKJz1tKQ@mail.gmail.com
2019-04-17 09:51:45 +09:00
Thomas Munro f7feb020c3 Fix GetNewTransactionId()'s interaction with xidVacLimit.
Commit ad308058 switched to returning a FullTransactionId, but
failed to load the potentially updated value in the case where
xidVacLimit is reached and we release and reacquire the lock.
Repair, closing bug #15727.

While reviewing that commit, also fix the size computation used
by EstimateTransactionStateSize() and switch to the mul_size()
macro traditionally used in such expressions.

Author: Thomas Munro
Reported-by: Roman Zharkov
Discussion: https://postgr.es/m/15727-0be246e7d852d229%40postgresql.org
2019-04-12 16:47:50 +12:00
Amit Kapila bdf35744bd Avoid counting transaction stats for parallel worker cooperating
transaction.

The transaction that is initiated by the parallel worker to cooperate
with the actual transaction started by the main backend to complete the
query execution should not be counted as a separate transaction.  The
other internal transactions started and committed by the parallel worker
are still counted as separate transactions as we that is what we do in
other places like autovacuum.

This will partially fix the bloat in transaction stats due to additional
transactions performed by parallel workers.  For a complete fix, we need to
decide how we want to show all the transactions that are started internally
for various operations and that is a matter of separate patch.

Reported-by: Haribabu Kommi
Author: Haribabu Kommi
Reviewed-by: Amit Kapila, Jamison Kirk and Rahila Syed
Backpatch-through: 9.6
Discussion: https://postgr.es/m/CAJrrPGc9=jKXuScvNyQ+VNhO0FZk7LLAShAJRyZjnedd2D61EQ@mail.gmail.com
2019-04-10 08:24:15 +05:30
Thomas Munro 3eb77eba5a Refactor the fsync queue for wider use.
Previously, md.c and checkpointer.c were tightly integrated so that
fsync calls could be handed off and processed in the background.
Introduce a system of callbacks and file tags, so that other modules
can hand off fsync work in the same way.

For now only md.c uses the new interface, but other users are being
proposed.  Since there may be use cases that are not strictly SMGR
implementations, use a new function table for sync handlers rather
than extending the traditional SMGR one.

Instead of using a bitmapset of segment numbers for each RelFileNode
in the checkpointer's hash table, make the segment number part of the
key.  This requires sending explicit "forget" requests for every
segment individually when relations are dropped, but suits the file
layout schemes of proposed future users better (ie sparse or high
segment numbers).

Author: Shawn Debnath and Thomas Munro
Reviewed-by: Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
2019-04-04 23:38:38 +13:00
Alvaro Herrera 799e220346 Log all statements from a sample of transactions
This is useful to obtain a view of the different transaction types in an
application, regardless of the durations of the statements each runs.

Author: Adrien Nayrat
Reviewed-by: Masahiko Sawada, Hayato Kuroda, Andres Freund
2019-04-03 18:43:59 -03:00
Heikki Linnakangas 9155580fd5 Generate less WAL during GiST, GIN and SP-GiST index build.
Instead of WAL-logging every modification during the build separately,
first build the index without any WAL-logging, and make a separate pass
through the index at the end, to write all pages to the WAL. This
significantly reduces the amount of WAL generated, and is usually also
faster, despite the extra I/O needed for the extra scan through the index.
WAL generated this way is also faster to replay.

For GiST, the LSN-NSN interlock makes this a little tricky. All pages must
be marked with a valid (i.e. non-zero) LSN, so that the parent-child
LSN-NSN interlock works correctly. We now use magic value 1 for that during
index build. Change the fake LSN counter to begin from 1000, so that 1 is
safely smaller than any real or fake LSN. 2 would've been enough for our
purposes, but let's reserve a bigger range, in case we need more special
values in the future.

Author: Anastasia Lubennikova, Andrey V. Lepikhov
Reviewed-by: Heikki Linnakangas, Dmitry Dolgov
2019-04-03 17:03:15 +03:00
Stephen Frost 4d0e994eed Add support for partial TOAST decompression
When asked for a slice of a TOAST entry, decompress enough to return the
slice instead of decompressing the entire object.

For use cases where the slice is at, or near, the beginning of the entry,
this avoids a lot of unnecessary decompression work.

This changes the signature of pglz_decompress() by adding a boolean to
indicate if it's ok for the call to finish before consuming all of the
source or destination buffers.

Author: Paul Ramsey
Reviewed-By: Rafia Sabih, Darafei Praliaskouski, Regina Obe
Discussion: https://postgr.es/m/CACowWR07EDm7Y4m2kbhN_jnys%3DBBf9A6768RyQdKm_%3DNpkcaWg%40mail.gmail.com
2019-04-02 12:35:32 -04:00
Thomas Munro 475861b261 Add wal_recycle and wal_init_zero GUCs.
On at least ZFS, it can be beneficial to create new WAL files every
time and not to bother zero-filling them.  Since it's not clear which
other filesystems might benefit from one or both of those things,
add individual GUCs to control those two behaviors independently and
make only very general statements in the docs.

Author: Jerry Jelinek, with some adjustments by Thomas Munro
Reviewed-by: Alvaro Herrera, Andres Freund, Tomas Vondra, Robert Haas and others
Discussion: https://postgr.es/m/CACPQ5Fo00QR7LNAcd1ZjgoBi4y97%2BK760YABs0vQHH5dLdkkMA%40mail.gmail.com
2019-04-02 14:37:14 +13:00
Peter Eisentraut cc8d415117 Unified logging system for command-line programs
This unifies the various ad hoc logging (message printing, error
printing) systems used throughout the command-line programs.

Features:

- Program name is automatically prefixed.

- Message string does not end with newline.  This removes a common
  source of inconsistencies and omissions.

- Additionally, a final newline is automatically stripped, simplifying
  use of PQerrorMessage() etc., another common source of mistakes.

- I converted error message strings to use %m where possible.

- As a result of the above several points, more translatable message
  strings can be shared between different components and between
  frontends and backend, without gratuitous punctuation or whitespace
  differences.

- There is support for setting a "log level".  This is not meant to be
  user-facing, but can be used internally to implement debug or
  verbose modes.

- Lazy argument evaluation, so no significant overhead if logging at
  some level is disabled.

- Some color in the messages, similar to gcc and clang.  Set
  PG_COLOR=auto to try it out.  Some colors are predefined, but can be
  customized by setting PG_COLORS.

- Common files (common/, fe_utils/, etc.) can handle logging much more
  simply by just using one API without worrying too much about the
  context of the calling program, requiring callbacks, or having to
  pass "progname" around everywhere.

- Some programs called setvbuf() to make sure that stderr is
  unbuffered, even on Windows.  But not all programs did that.  This
  is now done centrally.

Soft goals:

- Reduces vertical space use and visual complexity of error reporting
  in the source code.

- Encourages more deliberate classification of messages.  For example,
  in some cases it wasn't clear without analyzing the surrounding code
  whether a message was meant as an error or just an info.

- Concepts and terms are vaguely aligned with popular logging
  frameworks such as log4j and Python logging.

This is all just about printing stuff out.  Nothing affects program
flow (e.g., fatal exits).  The uses are just too varied to do that.
Some existing code had wrappers that do some kind of print-and-exit,
and I adapted those.

I tried to keep the output mostly the same, but there is a lot of
historical baggage to unwind and special cases to consider, and I
might not always have succeeded.  One significant change is that
pg_rewind used to write all error messages to stdout.  That is now
changed to stderr.

Reviewed-by: Donald Dong <xdong@csumb.edu>
Reviewed-by: Arthur Zakirov <a.zakirov@postgrespro.ru>
Discussion: https://www.postgresql.org/message-id/flat/6a609b43-4f57-7348-6480-bd022f924310@2ndquadrant.com
2019-04-01 20:01:35 +02:00
Robert Haas c900c15269 Warn more strongly about the dangers of exclusive backup mode.
Especially, warn about the hazards of mishandling the backup_label
file.  Adjust a couple of server messages to be more clear about
the hazards associated with removing backup_label files, too.

David Steele and Robert Haas, reviewed by Laurenz Albe, Martín
Marqués, Peter Eisentraut, and Magnus Hagander.

Discussion: http://postgr.es/m/7d85c387-000e-16f0-e00b-50bf83c22127@pgmasters.net
2019-03-29 08:15:16 -04:00
Thomas Munro 7e69323bf7 Fix typo.
Author: Masahiko Sawada
2019-03-29 10:03:58 +13:00
Thomas Munro ad308058cc Use FullTransactionId for the transaction stack.
Provide GetTopFullTransactionId() and GetCurrentFullTransactionId().
The intended users of these interfaces are access methods that use
xids for visibility checks but don't want to have to go back and
"freeze" existing references some time later before the 32 bit xid
counter wraps around.

Use a new struct to serialize the transaction state for parallel
query, because FullTransactionId doesn't fit into the previous
serialization scheme very well.

Author: Thomas Munro
Reviewed-by: Heikki Linnakangas
Discussion: https://postgr.es/m/CAA4eK1%2BMv%2Bmb0HFfWM9Srtc6MVe160WFurXV68iAFMcagRZ0dQ%40mail.gmail.com
2019-03-28 18:24:43 +13:00
Thomas Munro 2fc7af5e96 Add basic infrastructure for 64 bit transaction IDs.
Instead of inferring epoch progress from xids and checkpoints,
introduce a 64 bit FullTransactionId type and use it to track xid
generation.  This fixes an unlikely bug where the epoch is reported
incorrectly if the range of active xids wraps around more than once
between checkpoints.

The only user-visible effect of this commit is to correct the epoch
used by txid_current() and txid_status(), also visible with
pg_controldata, in those rare circumstances.  It also creates some
basic infrastructure so that later patches can use 64 bit
transaction IDs in more places.

The new type is a struct that we pass by value, as a form of strong
typedef.  This prevents the sort of accidental confusion between
TransactionId and FullTransactionId that would be possible if we
were to use a plain old uint64.

Author: Thomas Munro
Reported-by: Amit Kapila
Reviewed-by: Andres Freund, Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/CAA4eK1%2BMv%2Bmb0HFfWM9Srtc6MVe160WFurXV68iAFMcagRZ0dQ%40mail.gmail.com
2019-03-28 18:12:20 +13:00
Peter Eisentraut 280a408b48 Transaction chaining
Add command variants COMMIT AND CHAIN and ROLLBACK AND CHAIN, which
start new transactions with the same transaction characteristics as the
just finished one, per SQL standard.

Support for transaction chaining in PL/pgSQL is also added.  This
functionality is especially useful when running COMMIT in a loop in
PL/pgSQL.

Reviewed-by: Fabien COELHO <coelho@cri.ensmp.fr>
Discussion: https://www.postgresql.org/message-id/flat/28536681-324b-10dc-ade8-ab46f7645a5a@2ndquadrant.com
2019-03-24 11:33:02 +01:00
Michael Paquier 8b938d36f7 Refactor more code logic to update the control file
ce6afc6 has begun the refactoring work by plugging pg_rewind into a
central routine to update the control file, and left around two extra
copies, with one in xlog.c for the backend and one in pg_resetwal.c.  By
adding an extra option to the central routine in controldata_utils.c to
control if a flush of the control file needs to be done, it is proving
to be straight-forward to make xlog.c and pg_resetwal.c use the central
code path at the condition of moving the wait event tracking there.
Hence, this allows to have only one central code path to update the
control file, shaving the code from the duplicates.

This refactoring actually fixes a problem in pg_resetwal.  Previously,
the control file was first removed before being recreated.  So if a
crash happened between the moment the file was removed and the moment
the file was created, then it would have been possible to not have a
control file anymore in the database folder.

Author: Fabien Coelho
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/alpine.DEB.2.21.1903170935210.2506@lancre
2019-03-18 12:59:35 +09:00
Peter Eisentraut 893d6f8a1f Avoid casting away a const 2019-03-16 10:13:03 +01:00
Thomas Munro bb16aba50c Enable parallel query with SERIALIZABLE isolation.
Previously, the SERIALIZABLE isolation level prevented parallel query
from being used.  Allow the two features to be used together by
sharing the leader's SERIALIZABLEXACT with parallel workers.

An extra per-SERIALIZABLEXACT LWLock is introduced to make it safe to
share, and new logic is introduced to coordinate the early release
of the SERIALIZABLEXACT required for the SXACT_FLAG_RO_SAFE
optimization, as follows:

The first backend to observe the SXACT_FLAG_RO_SAFE flag (set by
some other transaction) will 'partially release' the SERIALIZABLEXACT,
meaning that the conflicts and locks it holds are released, but the
SERIALIZABLEXACT itself will remain active because other backends
might still have a pointer to it.

Whenever any backend notices the SXACT_FLAG_RO_SAFE flag, it clears
its own MySerializableXact variable and frees local resources so that
it can skip SSI checks for the rest of the transaction.  In the
special case of the leader process, it transfers the SERIALIZABLEXACT
to a new variable SavedSerializableXact, so that it can be completely
released at the end of the transaction after all workers have exited.

Remove the serializable_okay flag added to CreateParallelContext() by
commit 9da0cc35, because it's now redundant.

Author: Thomas Munro
Reviewed-by: Haribabu Kommi, Robert Haas, Masahiko Sawada, Kevin Grittner
Discussion: https://postgr.es/m/CAEepm=0gXGYhtrVDWOTHS8SQQy_=S9xo+8oCxGLWZAOoeJ=yzQ@mail.gmail.com
2019-03-15 17:47:04 +13:00
Michael Paquier f2d84a4a6b Adjust error message for partial writes in WAL segments
93473c6 has removed openLogOff, changing on the way the error message
which is used to report partial writes to WAL segments.  The
newly-introduced error message used the offset up to which the write has
happened, keeping always the same total length to write.  This changes
the error message so as the number of bytes left to write are reported.

Reported-by: Michael Paquier
Author: Robert Haas
Discussion: https://postgr.es/m/20190306235251.GA17293@paquier.xyz
2019-03-11 09:31:25 +09:00
Michael Paquier 82a5649fb9 Tighten use of OpenTransientFile and CloseTransientFile
This fixes two sets of issues related to the use of transient files in
the backend:
1) OpenTransientFile() has been used in some code paths with read-write
flags while read-only is sufficient, so switch those calls to be
read-only where necessary.  These have been reported by Joe Conway.
2) When opening transient files, it is up to the caller to close the
file descriptors opened.  In error code paths, CloseTransientFile() gets
called to clean up things before issuing an error.  However in normal
exit paths, a lot of callers of CloseTransientFile() never actually
reported errors, which could leave a file descriptor open without
knowing about it.  This is an issue I complained about a couple of
times, but never had the courage to write and submit a patch, so here we
go.

Note that one frontend code path is impacted by this commit so as an
error is issued when fetching control file data, making backend and
frontend to be treated consistently.

Reported-by: Joe Conway, Michael Paquier
Author: Michael Paquier
Reviewed-by: Álvaro Herrera, Georgios Kokolatos, Joe Conway
Discussion: https://postgr.es/m/20190301023338.GD1348@paquier.xyz
Discussion: https://postgr.es/m/c49b69ec-e2f7-ff33-4f17-0eaa4f2cef27@joeconway.com
2019-03-09 08:50:55 +09:00
Robert Haas 93473c6ac8 Removed unused variable, openLogOff.
Antonin Houska

Discussion: http://postgr.es/m/30413.1551870730@localhost
2019-03-06 09:44:08 -05:00
Michael Paquier effe7d9552 Make release of 2PC identifier and locks consistent in COMMIT PREPARED
When preparing a transaction in two-phase commit, a dummy PGPROC entry
holding the GID used for the transaction is registered, which gets
released once COMMIT PREPARED is run.  Prior releasing its shared memory
state, all the locks taken in the prepared transaction are released
using a dedicated set of callbacks (pgstat and multixact having similar
callbacks), which may cause the locks to be released before the GID is
set free.

Hence, there is a small window where lock conflicts could happen, for
example:
- Transaction A releases its locks, still holding its GID in shared
memory.
- Transaction B held a lock which conflicted with locks of transaction
A.
- Transaction B continues its processing, reusing the same GID as
transaction A.
- Transaction B fails because of a conflicting GID, already in use by
transaction A.

This commit changes the shared memory state release so as post-commit
callbacks and predicate lock cleanup happen consistently with the shared
memory state cleanup for the dummy PGPROC entry.  The race window is
small and 2PC had this issue from the start, so no backpatch is done.
On top if that fixes discussed involved ABI breakages, which are not
welcome in stable branches.

Reported-by: Oleksii Kliukin, Ildar Musin
Diagnosed-by: Oleksii Kliukin, Ildar Musin
Author: Michael Paquier
Reviewed-by: Masahiko Sawada, Oleksii Kliukin
Discussion: https://postgr.es/m/BF9B38A4-2BFF-46E8-BA87-A2D00A8047A6@hintbits.com
2019-02-25 14:19:34 +09:00
Michael Paquier 4c23216002 Fix incorrect function reference in comment of twophase.c
The header block of TwoPhaseGetDummyBackendId mentioned incorrectly
TwoPhaseGetDummyProc.

Reported-by: Oleksii Kliukin
Discussion: https://postgr.es/m/D8336E40-BBE1-4954-98BB-7830D3F5CB36@hintbits.com
2019-02-23 08:40:01 +09:00
Michael Paquier 0dd6ff0ac8 Avoid some unnecessary block reads in WAL reader
When reading a new page internally and depending on the way the WAL
reader facility gets used by plugins, the current implementation of the
WAL reader may finish by reading a block multiple times while it is not
actually necessary as the requested data length may be equal to what has
been already read.  This can happen for any size, but is more likely to
happen at the end of a page.  This can cause performance penalties in
plugins which rely on the block reads to be purely sequential, zlib not
liking backward reads for example.  The new behavior also shaves some
cycles when doing recovery.

Author: Arthur Zakirov
Reviewed-by: Andrey Lepikhov, Michael Paquier, Grigory Smolkin
Discussion: https://postgr.es/m/2ddf4a32-517e-d6f4-d992-4a63b6035bfd@postgrespro.ru
2019-02-18 09:52:02 +09:00
Peter Eisentraut 37d9916020 More unconstify use
Replace casts whose only purpose is to cast away const with the
unconstify() macro.

Discussion: https://www.postgresql.org/message-id/flat/53a28052-f9f3-1808-fed9-460fd43035ab%402ndquadrant.com
2019-02-13 11:50:16 +01:00
Michael Paquier ea92368cd1 Move max_wal_senders out of max_connections for connection slot handling
Since its introduction, max_wal_senders is counted as part of
max_connections when it comes to define how many connection slots can be
used for replication connections with a WAL sender context.  This can
lead to confusion for some users, as it could be possible to block a
base backup or replication from happening because other backend sessions
are already taken for other purposes by an application, and
superuser-only connection slots are not a correct solution to handle
that case.

This commit makes max_wal_senders independent of max_connections for its
handling of PGPROC entries in ProcGlobal, meaning that connection slots
for WAL senders are handled using their own free queue, like autovacuum
workers and bgworkers.

One compatibility issue that this change creates is that a standby now
requires to have a value of max_wal_senders at least equal to its
primary.  So, if a standby created enforces the value of
max_wal_senders to be lower than that, then this could break failovers.
Normally this should not be an issue though, as any settings of a
standby are inherited from its primary as postgresql.conf gets normally
copied as part of a base backup, so parameters would be consistent.

Author: Alexander Kukushkin
Reviewed-by: Kyotaro Horiguchi, Petr Jelínek, Masahiko Sawada, Oleksii
Kliukin
Discussion: https://postgr.es/m/CAFh8B=nBzHQeYAu0b8fjK-AF1X4+_p6GRtwG+cCgs6Vci2uRuQ@mail.gmail.com
2019-02-12 10:07:56 +09:00
Amit Kapila b0eaa4c51b Avoid creation of the free space map for small heap relations, take 2.
Previously, all heaps had FSMs. For very small tables, this means that the
FSM took up more space than the heap did. This is wasteful, so now we
refrain from creating the FSM for heaps with 4 pages or fewer. If the last
known target block has insufficient space, we still try to insert into some
other page before giving up and extending the relation, since doing
otherwise leads to table bloat. Testing showed that trying every page
penalized performance slightly, so we compromise and try every other page.
This way, we visit at most two pages. Any pages with wasted free space
become visible at next relation extension, so we still control table bloat.
As a bonus, directly attempting one or two pages can even be faster than
consulting the FSM would have been.

Once the FSM is created for a heap we don't remove it even if somebody
deletes all the rows from the corresponding relation.  We don't think it is
a useful optimization as it is quite likely that relation will again grow
to the same size.

Author: John Naylor, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Mithun C Y
Discussion: https://www.postgresql.org/message-id/CAJVSVGWvB13PzpbLEecFuGFc5V2fsO736BsdTakPiPAcdMM5tQ@mail.gmail.com
2019-02-04 07:49:15 +05:30
Michael Paquier c93001b3f9 Adjust comment about timeout when waiting for WAL at recovery
A timeout of 5s is used when waiting for WAL to become available at
recovery so as the startup process is able to react promptly if a
trigger file shows up.  However this missed the fact that the startup
process also relies on the timeout to check periodically the status of
any active WAL receiver.

Discussion: https://postgr.es/m/20190131070956.GE13429@paquier.xyz
2019-02-01 10:46:45 +09:00
Tom Lane f09346a9c6 Refactor planner's header files.
Create a new header optimizer/optimizer.h, which exposes just the
planner functions that can be used "at arm's length", without need
to access Paths or the other planner-internal data structures defined
in nodes/relation.h.  This is intended to provide the whole planner
API seen by most of the rest of the system; although FDWs still need
to use additional stuff, and more thought is also needed about just
what selfuncs.c should rely on.

The main point of doing this now is to limit the amount of new
#include baggage that will be needed by "planner support functions",
which I expect to introduce later, and which will be in relevant
datatype modules rather than anywhere near the planner.

This commit just moves relevant declarations into optimizer.h from
other header files (a couple of which go away because everything
got moved), and adjusts #include lists to match.  There's further
cleanup that could be done if we want to decide that some stuff
being exposed by optimizer.h doesn't belong in the planner at all,
but I'll leave that for another day.

Discussion: https://postgr.es/m/11460.1548706639@sss.pgh.pa.us
2019-01-29 15:48:51 -05:00
Amit Kapila a23676503b Revert "Avoid creation of the free space map for small heap relations."
This reverts commit ac88d2962a.
2019-01-28 11:31:44 +05:30
Amit Kapila ac88d2962a Avoid creation of the free space map for small heap relations.
Previously, all heaps had FSMs. For very small tables, this means that the
FSM took up more space than the heap did. This is wasteful, so now we
refrain from creating the FSM for heaps with 4 pages or fewer. If the last
known target block has insufficient space, we still try to insert into some
other page before giving up and extending the relation, since doing
otherwise leads to table bloat. Testing showed that trying every page
penalized performance slightly, so we compromise and try every other page.
This way, we visit at most two pages. Any pages with wasted free space
become visible at next relation extension, so we still control table bloat.
As a bonus, directly attempting one or two pages can even be faster than
consulting the FSM would have been.

Once the FSM is created for a heap we don't remove it even if somebody
deletes all the rows from the corresponding relation.  We don't think it is
a useful optimization as it is quite likely that relation will again grow
to the same size.

Author: John Naylor with design inputs and some code contribution by Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Mithun C Y
Discussion: https://www.postgresql.org/message-id/CAJVSVGWvB13PzpbLEecFuGFc5V2fsO736BsdTakPiPAcdMM5tQ@mail.gmail.com
2019-01-28 08:14:06 +05:30
Michael Paquier c9b75c5838 Simplify restriction handling of two-phase commit for temporary objects
There were two flags used to track the access to temporary tables and
to the temporary namespace of a session which are used to restrict
PREPARE TRANSACTION, however the first control flag is a concept
included in the second.  This removes the flag for temporary table
tracking, keeping around only the one at namespace level.

Author: Michael Paquier
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/20190118053126.GH1883@paquier.xyz
2019-01-26 10:45:23 +09:00
Andres Freund c91560defc Move remaining code from tqual.[ch] to heapam.h / heapam_visibility.c.
Given these routines are heap specific, and that there will be more
generic visibility support in via table AM, it makes sense to move the
prototypes to heapam.h (routines like HeapTupleSatisfiesVacuum will
not be exposed in a generic fashion, because they are too storage
specific).

Similarly, the code in tqual.c is specific to heap, so moving it into
access/heap/ makes sense.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-01-21 17:07:10 -08:00
Michael Paquier c5660e0aa5 Restrict the use of temporary namespace in two-phase transactions
Attempting to use a temporary table within a two-phase transaction is
forbidden for ages.  However, there have been uncovered grounds for
a couple of other object types and commands which work on temporary
objects with two-phase commit.  In short, trying to create, lock or drop
an object on a temporary schema should not be authorized within a
two-phase transaction, as it would cause its state to create
dependencies with other sessions, causing all sorts of side effects with
the existing session or other sessions spawned later on trying to use
the same temporary schema name.

Regression tests are added to cover all the grounds found, the original
report mentioned function creation, but monitoring closer there are many
other patterns with LOCK, DROP or CREATE EXTENSION which are involved.
One of the symptoms resulting in combining both is that the session
which used the temporary schema is not able to shut down completely,
waiting for being able to drop the temporary schema, something that it
cannot complete because of the two-phase transaction involved with
temporary objects.  In this case the client is able to disconnect but
the session remains alive on the backend-side, potentially blocking
connection backend slots from being used.  Other problems reported could
also involve server crashes.

This is back-patched down to v10, which is where 9b013dc has introduced
MyXactFlags, something that this patch relies on.

Reported-by: Alexey Bashtanov
Author: Michael Paquier
Reviewed-by: Masahiko Sawada
Discussion: https://postgr.es/m/5d910e2e-0db8-ec06-dd5f-baec420513c3@imap.cc
Backpatch-through: 10
2019-01-18 09:21:44 +09:00
Peter Eisentraut 0acb3bc33a Change default of recovery_target_timeline to 'latest'
This is what one usually wants for recovery and almost always wants
for a standby.

Discussion: https://www.postgresql.org/message-id/flat/6dd2c23a-4162-8469-410f-bfe146e28c0c@2ndquadrant.com/
Reviewed-by: David Steele <david@pgmasters.net>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
2019-01-13 10:01:05 +01:00
Bruce Momjian 97c39498e5 Update copyright for 2019
Backpatch-through: certain files through 9.4
2019-01-02 12:44:25 -05:00
Michael Paquier 1707a0d2aa Remove configure switch --disable-strong-random
This removes a portion of infrastructure introduced by fe0a0b5 to allow
compilation of Postgres in environments where no strong random source is
available, meaning that there is no linking to OpenSSL and no
/dev/urandom (Windows having its own CryptoAPI).  No systems shipped
this century lack /dev/urandom, and the buildfarm is actually not
testing this switch at all, so just remove it.  This simplifies
particularly some backend code which included a fallback implementation
using shared memory, and removes a set of alternate regression output
files from pgcrypto.

Author: Michael Paquier
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/20181230063219.GG608@paquier.xyz
2019-01-01 20:05:51 +09:00
Michael Paquier 8d3b389ec3 Improve comments and logs in do_pg_stop/start_backup
The function name pg_stop_backup() has been included for ages in some
log messages when stopping the backup, which is confusing for base
backups taken with the replication protocol because this function is
never called.  Some other comments and messages in this area are
improved while on it.

The new wording is based on input and suggestions from several people,
all listed below.

Author: Michael Paquier
Reviewed-by: Peter Eisentraut, Álvaro Herrera, Tom Lane
Discussion: https://postgr.es/m/20181221040510.GA12599@paquier.xyz
2019-01-01 08:53:02 +09:00
Peter Eisentraut 60d99797bf Change "checkpoint starting" message to use "wal"
This catches up with the recent renaming of all user-facing mentions
of "xlog" to "wal".

Discussion: https://www.postgresql.org/message-id/flat/20181129084708.GA9562%40msg.credativ.de
2018-12-30 22:23:01 +01:00
Tom Lane 8528e3d849 Fix failure to check for open() or fsync() failures.
While it seems OK to not be concerned about fsync() failure for a
pre-existing signal file, it's not OK to not even check for open()
failure.  This at least causes complaints from static analyzers,
and I think on some platforms passing -1 to fsync() or close() might
trigger assertion-type failures.  Also add (void) casts to make clear
that we're ignoring fsync's result intentionally.

Oversights in commit 2dedf4d9a, noted by Coverity.
2018-12-26 16:08:17 -05:00
Alvaro Herrera 68f6f2b739 Remove function names from error messages
They are not necessary, and having them there gives useless work for
translators.
2018-12-19 14:53:27 -03:00
Tom Lane ade2d61ed0 Improve detection of child-process SIGPIPE failures.
Commit ffa4cbd62 added logic to detect SIGPIPE failure of a COPY child
process, but it only worked correctly if the SIGPIPE occurred in the
immediate child process.  Depending on the shell in use and the
complexity of the shell command string, we might instead get back
an exit code of 128 + SIGPIPE, representing a shell error exit
reporting SIGPIPE in the child process.

We could just hack up ClosePipeToProgram() to add the extra case,
but it seems like this is a fairly general issue deserving a more
general and better-documented solution.  I chose to add a couple
of functions in src/common/wait_error.c, which is a natural place
to know about wait-result encodings, that will test for either a
specific child-process signal type or any child-process signal failure.
Then, adjust other places that were doing ad-hoc tests of this type
to use the common functions.

In RestoreArchivedFile, this fixes a race condition affecting whether
the process will report an error or just silently proc_exit(1): before,
that depended on whether the intermediate shell got SIGTERM'd itself
or reported a child process failing on SIGTERM.

Like the previous patch, back-patch to v10; we could go further
but there seems no real need to.

Per report from Erik Rijkers.

Discussion: https://postgr.es/m/f3683f87ab1701bea5d86a7742b22432@xs4all.nl
2018-12-16 14:32:14 -05:00
Stephen Frost a243c55326 Cleanup comments in xlog compression
Skipping over the "hole" in full page images in the XLOG code was
described as being a form of compression, but this got a bit confusing
since we now have PGLZ-based compression happening, so adjust the
wording to discuss "removing" the "hole" and keeping the talk about
compression to where we're talking about using PGLZ-based compression of
the full page images.

Reviewed-By: Kyotaro Horiguchi
Discussion: https://postgr.es/m/20181127234341.GM3415@tamriel.snowman.net
2018-12-06 11:05:39 -05:00
Peter Eisentraut 2dedf4d9a8 Integrate recovery.conf into postgresql.conf
recovery.conf settings are now set in postgresql.conf (or other GUC
sources).  Currently, all the affected settings are PGC_POSTMASTER;
this could be refined in the future case by case.

Recovery is now initiated by a file recovery.signal.  Standby mode is
initiated by a file standby.signal.  The standby_mode setting is
gone.  If a recovery.conf file is found, an error is issued.

The trigger_file setting has been renamed to promote_trigger_file as
part of the move.

The documentation chapter "Recovery Configuration" has been integrated
into "Server Configuration".

pg_basebackup -R now appends settings to postgresql.auto.conf and
creates a standby.signal file.

Author: Fujii Masao <masao.fujii@gmail.com>
Author: Simon Riggs <simon@2ndquadrant.com>
Author: Abhijit Menon-Sen <ams@2ndquadrant.com>
Author: Sergei Kornilov <sk@zsrv.org>
Discussion: https://www.postgresql.org/message-id/flat/607741529606767@web3g.yandex.ru/
2018-11-25 16:33:40 +01:00
Thomas Munro cfdf4dc4fc Add WL_EXIT_ON_PM_DEATH pseudo-event.
Users of the WaitEventSet and WaitLatch() APIs can now choose between
asking for WL_POSTMASTER_DEATH and then handling it explicitly, or asking
for WL_EXIT_ON_PM_DEATH to trigger immediate exit on postmaster death.
This reduces code duplication, since almost all callers want the latter.

Repair all code that was previously ignoring postmaster death completely,
or requesting the event but ignoring it, or requesting the event but then
doing an unconditional PostmasterIsAlive() call every time through its
event loop (which is an expensive syscall on platforms for which we don't
have USE_POSTMASTER_DEATH_SIGNAL support).

Assert that callers of WaitLatchXXX() under the postmaster remember to
ask for either WL_POSTMASTER_DEATH or WL_EXIT_ON_PM_DEATH, to prevent
future bugs.

The only process that doesn't handle postmaster death is syslogger.  It
waits until all backends holding the write end of the syslog pipe
(including the postmaster) have closed it by exiting, to be sure to
capture any parting messages.  By using the WaitEventSet API directly
it avoids the new assertion, and as a by-product it may be slightly
more efficient on platforms that have epoll().

Author: Thomas Munro
Reviewed-by: Kyotaro Horiguchi, Heikki Linnakangas, Tom Lane
Discussion: https://postgr.es/m/CAEepm%3D1TCviRykkUb69ppWLr_V697rzd1j3eZsRMmbXvETfqbQ%40mail.gmail.com,
            https://postgr.es/m/CAEepm=2LqHzizbe7muD7-2yHUbTOoF7Q+qkSD5Q41kuhttRTwA@mail.gmail.com
2018-11-23 20:46:34 +13:00
Andres Freund 578b229718 Remove WITH OIDS support, change oid catalog column visibility.
Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.

This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row.  Neither pg_dump nor COPY included the contents of the
oid column by default.

The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.

WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.

Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
  WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
  issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
  restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
  OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
  plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.

The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.

The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such.  This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.

The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.

Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).

The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.

While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.

Catversion bump, for obvious reasons.

Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
2018-11-20 16:00:17 -08:00
Thomas Munro 9ccdd7f66e PANIC on fsync() failure.
On some operating systems, it doesn't make sense to retry fsync(),
because dirty data cached by the kernel may have been dropped on
write-back failure.  In that case the only remaining copy of the
data is in the WAL.  A subsequent fsync() could appear to succeed,
but not have flushed the data.  That means that a future checkpoint
could apparently complete successfully but have lost data.

Therefore, violently prevent any future checkpoint attempts by
panicking on the first fsync() failure.  Note that we already
did the same for WAL data; this change extends that behavior to
non-temporary data files.

Provide a GUC data_sync_retry to control this new behavior, for
users of operating systems that don't eject dirty data, and possibly
forensic/testing uses.  If it is set to on and the write-back error
was transient, a later checkpoint might genuinely succeed (on a
system that does not throw away buffers on failure); if the error is
permanent, later checkpoints will continue to fail.  The GUC defaults
to off, meaning that we panic.

Back-patch to all supported releases.

There is still a narrow window for error-loss on some operating
systems: if the file is closed and later reopened and a write-back
error occurs in the intervening time, but the inode has the bad
luck to be evicted due to memory pressure before we reopen, we could
miss the error.  A later patch will address that with a scheme
for keeping files with dirty data open at all times, but we judge
that to be too complicated to back-patch.

Author: Craig Ringer, with some adjustments by Thomas Munro
Reported-by: Craig Ringer
Reviewed-by: Robert Haas, Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
2018-11-19 17:41:26 +13:00
Michael Paquier 285bd0ac4a Remove unnecessary memcpy when reading WAL record fitting on page
When reading a WAL record, its contents are copied into an intermediate
buffer.  However, doing so is not necessary if the record fits fully
into the current page, saving one memcpy for each such record.  The
allocation handling of the intermediate buffer is also now done only
when a record crosses a page boundary, shaving some extra cycles when
reading a WAL record.

Author: Andrey Lepikhov
Reviewed-by: Kyotaro Horiguchi, Heikki Linnakangas
Discussion: https://postgr.es/m/c2ea54dd-a1d3-80eb-ddbf-7e6f258e615e@postgrespro.ru
2018-11-19 10:25:48 +09:00
Michael Paquier b4721f3950 Initialize TransactionState and user ID consistently at transaction start
If a failure happens when a transaction is starting between the moment
the transaction status is changed from TRANS_DEFAULT to TRANS_START and
the moment the current user ID and security context flags are fetched
via GetUserIdAndSecContext(), or before initializing its basic fields,
then those may get reset to incorrect values when the transaction
aborts, leaving the session in an inconsistent state.

One problem reported is that failing a starting transaction at the first
query of a session could cause several kinds of system crashes on the
follow-up queries.

In order to solve that, move the initialization of the transaction state
fields and the call of GetUserIdAndSecContext() in charge of fetching
the current user ID close to the point where the transaction status is
switched to TRANS_START, where there cannot be any error triggered
in-between, per an idea of Tom Lane.  This properly ensures that the
current user ID, the security context flags and that the basic fields of
TransactionState remain consistent even if the transaction fails while
starting.

Reported-by: Richard Guo
Diagnosed-By: Richard Guo
Author: Michael Paquier
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAN_9JTxECSb=pEPcb0a8d+6J+bDcOZ4=DgRo_B7Y5gRHJUM=Rw@mail.gmail.com
Backpatch-through: 9.4
2018-11-14 16:46:53 +09:00
Andres Freund 450c7defa6 Remove volatiles from {procarray,volatile}.c and fix memory ordering issue.
The use of volatiles in procarray.c largely originated from the time
when postgres did not have reliable compiler and memory
barriers. That's not the case anymore, so we can do better.

Several of the functions in procarray.c can be bottlenecks, and
removal of volatile yields mildly better code.

The new state, with explicit memory barriers, is also more
correct. The previous use of volatile did not actually deliver
sufficient guarantees on weakly ordered machines, in particular the
logic in GetNewTransactionId() does not look safe.  It seems unlikely
to be a problem in practice, but worth fixing.

Thomas and I independently wrote a patch for this.

Reported-By: Andres Freund and Thomas Munro
Author: Andres Freund, with cherrypicked changes from a patch by Thomas Munro
Discussion:
    https://postgr.es/m/20181005172955.wyjb4fzcdzqtaxjq@alap3.anarazel.de
    https://postgr.es/m/CAEepm=1nff0x=7i3YQO16jLA2qw-F9O39YmUew4oq-xcBQBs0g@mail.gmail.com
2018-11-10 16:11:57 -08:00
Thomas Munro c24dcd0cfd Use pg_pread() and pg_pwrite() for data files and WAL.
Cut down on system calls by doing random I/O using offset-based OS
routines where available.  Remove the code for tracking the 'virtual'
seek position.  The only reason left to call FileSeek() was to get
the file's size, so provide a new function FileSize() instead.

Author: Oskari Saarenmaa, Thomas Munro
Reviewed-by: Thomas Munro, Jesper Pedersen, Tom Lane, Alvaro Herrera
Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com
Discussion: https://postgr.es/m/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com
Discussion: https://postgr.es/m/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi
2018-11-07 09:51:50 +13:00
Magnus Hagander fbec7459aa Fix spelling errors and typos in comments
Author: Daniel Gustafsson <daniel@yesql.se>
2018-11-02 13:56:52 +01:00
Andres Freund c4ab62f9ac Fix typo in xlog.c.
Author: Daniel Gustafsson
Discussion: https://postgr.es/m/A6817958-949E-4A5B-895D-FA421B6640C2@yesql.se
2018-10-31 07:50:32 -07:00
Michael Paquier 10074651e3 Add pg_promote function
This function is able to promote a standby with this new SQL-callable
function.  Execution access can be granted to non-superusers so that
failover tools can observe the principle of least privilege.

Catalog version is bumped.

Author: Laurenz Albe
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/6e7c79b3ec916cf49742fb8849ed17cd87aed620.camel@cybertec.at
2018-10-25 09:46:00 +09:00
Tom Lane 13cd7209f7 Simplify use of AllocSetContextCreate() wrapper macro.
We can allow this macro to accept either abbreviated or non-abbreviated
allocation parameters by making use of __VA_ARGS__.  As noted by Andres
Freund, it's unlikely that any compiler would have __builtin_constant_p
but not __VA_ARGS__, so this gives up little or no error checking, and
it avoids a minor but annoying API break for extensions.

With this change, there is no reason for anybody to call
AllocSetContextCreateExtended directly, so in HEAD I renamed it to
AllocSetContextCreateInternal.  It's probably too late for an ABI
break like that in 11, though.

Discussion: https://postgr.es/m/20181012170355.bhxi273skjt6sag4@alap3.anarazel.de
2018-10-12 14:26:56 -04:00
Thomas Munro 212fab9926 Relax transactional restrictions on ALTER TYPE ... ADD VALUE (redux).
Originally committed as 15bc038f (plus some follow-ups), this was
reverted in 28e07270 due to a problem discovered in parallel
workers.  This new version corrects that problem by sending the
list of uncommitted enum values to parallel workers.

Here follows the original commit message describing the change:

To prevent possibly breaking indexes on enum columns, we must keep
uncommitted enum values from getting stored in tables, unless we
can be sure that any such column is new in the current transaction.

Formerly, we enforced this by disallowing ALTER TYPE ... ADD VALUE
from being executed at all in a transaction block, unless the target
enum type had been created in the current transaction.  This patch
removes that restriction, and instead insists that an uncommitted enum
value can't be referenced unless it belongs to an enum type created
in the same transaction as the value.  Per discussion, this should be
a bit less onerous.  It does require each function that could possibly
return a new enum value to SQL operations to check this restriction,
but there aren't so many of those that this seems unmaintainable.

Author: Andrew Dunstan and Tom Lane, with parallel query fix by Thomas Munro
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAEepm%3D0Ei7g6PaNTbcmAh9tCRahQrk%3Dr5ZWLD-jr7hXweYX3yg%40mail.gmail.com
Discussion: https://postgr.es/m/4075.1459088427%40sss.pgh.pa.us
2018-10-09 12:51:01 +13:00
Tom Lane 82ff0cc91d Advance transaction timestamp for intra-procedure transactions.
Per discussion, this behavior seems less astonishing than not doing so.

Peter Eisentraut and Tom Lane

Discussion: https://postgr.es/m/20180920234040.GC29981@momjian.us
2018-10-08 16:16:36 -04:00
Tom Lane 07ee62ce9e Propagate xactStartTimestamp and stmtStartTimestamp to parallel workers.
Previously, a worker process would establish values for these based on
its own start time.  In v10 and up, this can trivially be shown to cause
misbehavior of transaction_timestamp(), timestamp_in(), and related
functions which are (perhaps unwisely?) marked parallel-safe.  It seems
likely that other behaviors might diverge from what happens in the parent
as well.

It's not as trivial to demonstrate problems in 9.6 or 9.5, but I'm sure
it's still possible, so back-patch to all branches containing parallel
worker infrastructure.

In HEAD only, mark now() and statement_timestamp() as parallel-safe
(other affected functions already were).  While in theory we could
still squeeze that change into v11, it doesn't seem important enough
to force a last-minute catversion bump.

Konstantin Knizhnik, whacked around a bit by me

Discussion: https://postgr.es/m/6406dbd2-5d37-4cb6-6eb2-9c44172c7e7c@postgrespro.ru
2018-10-06 12:00:09 -04:00
Amit Kapila a86bf6057e Fix assertion failure when updating full_page_writes for checkpointer.
When the checkpointer receives a SIGHUP signal to update its configuration,
it may need to update the shared memory for full_page_writes and need to
write a WAL record for it.  Now, it is quite possible that the XLOG
machinery has not been initialized by that time and it will lead to
assertion failure while doing that.  Fix is to allow the initialization of
the XLOG machinery outside critical section.

This bug has been introduced by the commit 2c03216d83 which added the XLOG
machinery initialization in RecoveryInProgress code path.

Reported-by: Dilip Kumar
Author: Dilip Kumar
Reviewed-by: Michael Paquier and Amit Kapila
Backpatch-through: 9.5
Discussion: https://postgr.es/m/CAFiTN-u4BA8KXcQUWDPNgaKAjDXC=C2whnzBM8TAcv=stckYUw@mail.gmail.com
2018-09-28 16:40:04 +05:30
Michael Paquier 78ea8b5daa Fix WAL recycling on standbys depending on archive_mode
A restart point or a checkpoint recycling WAL segments treats segments
marked with neither ".done" (archiving is done) or ".ready" (segment is
ready to be archived) in archive_status the same way for archive_mode
being "on" or "always".  While for a primary this is fine, a standby
running a restart point with archive_mode = on would try to mark such a
segment as ready for archiving, which is something that will never
happen except after the standby is promoted.

Note that this problem applies only to WAL segments coming from the
local pg_wal the first time archive recovery is run.  Segments part of a
self-contained base backup are the most common case where this could
happen, however even in this case normally the .done markers would be
most likely part of the backup.  Segments recovered from an archive are
marked as .ready or .done by the startup process, and segments finished
streaming are marked as such by the WAL receiver, so they are handled
already.

Reported-by: Haruka Takatsuka
Author: Michael Paquier
Discussion: https://postgr.es/m/15402-a453c90ed4cf88b2@postgresql.org
Backpatch-through: 9.5, where archive_mode = always has been added.
2018-09-28 11:54:38 +09:00
Michael Paquier 8d28bf500f Rework activation of commit timestamps during recovery
The activation and deactivation of commit timestamp tracking has not
been handled consistently for a primary or standbys at recovery.  The
facility can be activated at three different moments of recovery:
- The beginning, where a primary would use the GUC value for the
decision-making, and where a standby relies on the contents of the
control file.
- When replaying a XLOG_PARAMETER_CHANGE record at redo.
- The end, where both primary and standby rely on the GUC value.

Using the GUC value for a primary at the beginning of recovery causes
problems with commit timestamp access when doing crash recovery.
Particularly, when replaying transaction commits, it could be possible
that an attempt to read commit timestamps is done for a transaction
which committed at a moment when track_commit_timestamp was disabled.

A test case is added to reproduce the failure.  The test works down to
v11 as it takes advantage of transaction commits within procedures.

Reported-by: Hailong Li
Author: Masahiko Sawasa, Michael Paquier
Reviewed-by: Kyotaro Horiguchi
Discussion: https://postgr.es/m/11224478-a782-203b-1f17-e4797b39bdf0@qunar.com
Backpatch-through: 9.5, where commit timestamps have been introduced.
2018-09-26 10:25:54 +09:00
Thomas Munro 6c3c9d4189 Defer restoration of libraries in parallel workers.
Several users of extensions complained of crashes in parallel workers
that turned out to be due to syscache access from their _PG_init()
functions.  Reorder the initialization of parallel workers so that
libraries are restored after the caches are initialized, and inside a
transaction.

This was reported in bug #15350 and elsewhere.  We don't consider it
to be a bug: extensions shouldn't do that, because then they can't be
used in shared_preload_libraries.  However, it's a fairly obscure
hazard and these extensions worked in practice before parallel query
came along.  So let's make it work.  Later commits might add a warning
message and eventually an error.

Back-patch to 9.6, where parallel query landed.

Author: Thomas Munro
Reviewed-by: Amit Kapila
Reported-by: Kieran McCusker, Jimmy
Discussion: https://postgr.es/m/153512195228.1489.8545997741965926448%40wrigleys.postgresql.org
2018-09-20 14:21:18 +12:00
Tom Lane 0d38e4ebb7 Fix minor error message style guide violation.
No periods at the ends of primary error messages, please.

Daniel Gustafsson

Discussion: https://postgr.es/m/43E004C0-18C6-42B4-A313-003B43EB0571@yesql.se
2018-09-19 17:06:40 -04:00
Amit Kapila bc153c941d Attach FPI to the first record after full_page_writes is turned on.
XLogInsert fails to attach a required FPI to the first record after
full_page_writes is turned on by the last checkpoint.  This bug got
introduced in 9.5 due to code rearrangement in commits 2c03216d83 and
2076db2aea.  Fix it by ensuring that XLogInsertRecord performs a
recomputation when the given record is generated with FPW as off but
found that the flag has been turned on while actually inserting the
record.

Reported-by: Kyotaro Horiguchi
Author: Kyotaro Horiguchi
Reviewed-by: Amit Kapila
Backpatch-through: 9.5 where this problem was introduced
Discussion: https://postgr.es/m/20180420.151043.74298611.horiguchi.kyotaro@lab.ntt.co.jp
2018-09-13 15:32:50 +05:30
Michael Paquier 9226a3b89b Remove duplicated words split across lines in comments
This has been detected using some interesting tricks with sed, and the
method used is mentioned in details in the discussion below.

Author: Justin Pryzby
Discussion: https://postgr.es/m/20180908013109.GB15350@telsasoft.com
2018-09-08 12:24:19 -07:00
Michael Paquier 8582b4d044 Improve handling of corrupted two-phase state files at recovery
When a corrupted two-phase state file is found by WAL replay, be it for
crash recovery or archive recovery, then the file is simply skipped and
a WARNING is logged to the user, causing the transaction to be silently
lost.  Facing an on-disk WAL file which is corrupted is as likely to
happen as what is stored in WAL records, but WAL records are already
able to fail hard if there is a CRC mismatch.  On-disk two-phase state
files, on the contrary, are simply ignored if corrupted.  Note that when
restoring the initial two-phase data state at recovery, files newer than
the horizon XID are discarded hence no files present in pg_twophase/
should be torned and have been made durable by a previous checkpoint, so
recovery should never see any corrupted two-phase state file by design.

The situation got better since 978b2f6 which has added two-phase state
information directly in WAL instead of using on-disk files, so the risk
is limited to two-phase transactions which live across at least one
checkpoint for long periods.  Backups having legit two-phase state files
on-disk could also lose silently transactions when restored if things
get corrupted.

This behavior exists since two-phase commit has been introduced, no
back-patch is done for now per the lack of complaints about this
problem.

Author: Michael Paquier
Discussion: https://postgr.es/m/20180709050309.GM1467@paquier.xyz
2018-09-07 11:00:16 -07:00
Peter Eisentraut 98afa68d93 Use C99 designated initializers for some structs
These are just a few particularly egregious cases that were hard to read
and write, and error prone because of many similar adjacent types.

Discussion: https://www.postgresql.org/message-id/flat/4c9f01be-9245-2148-b569-61a8562ef190%402ndquadrant.com
2018-09-07 11:40:03 +02:00
Tom Lane 44cac93464 Avoid using potentially-under-aligned page buffers.
There's a project policy against using plain "char buf[BLCKSZ]" local
or static variables as page buffers; preferred style is to palloc or
malloc each buffer to ensure it is MAXALIGN'd.  However, that policy's
been ignored in an increasing number of places.  We've apparently got
away with it so far, probably because (a) relatively few people use
platforms on which misalignment causes core dumps and/or (b) the
variables chance to be sufficiently aligned anyway.  But this is not
something to rely on.  Moreover, even if we don't get a core dump,
we might be paying a lot of cycles for misaligned accesses.

To fix, invent new union types PGAlignedBlock and PGAlignedXLogBlock
that the compiler must allocate with sufficient alignment, and use
those in place of plain char arrays.

I used these types even for variables where there's no risk of a
misaligned access, since ensuring proper alignment should make
kernel data transfers faster.  I also changed some places where
we had been palloc'ing short-lived buffers, for coding style
uniformity and to save palloc/pfree overhead.

Since this seems to be a live portability hazard (despite the lack
of field reports), back-patch to all supported versions.

Patch by me; thanks to Michael Paquier for review.

Discussion: https://postgr.es/m/1535618100.1286.3.camel@credativ.de
2018-09-01 15:27:17 -04:00
Michael Paquier c186ba135e Ensure correct minimum consistent point on standbys
Startup process has improved its calculation of incorrect minimum
consistent point in 8d68ee6, which ensures that all WAL available gets
replayed when doing crash recovery, and has introduced an incorrect
calculation of the minimum recovery point for non-startup processes,
which can cause incorrect page references on a standby when for example
the background writer flushed a couple of pages on-disk but was not
updating the control file to let a subsequent crash recovery replay to
where it should have.

The only case where this has been reported to be a problem is when a
standby needs to calculate the latest removed xid when replaying a btree
deletion record, so one would need connections on a standby that happen
just after recovery has thought it reached a consistent point.  Using a
background worker which is started after the consistent point is reached
would be the easiest way to get into problems if it connects to a
database.  Having clients which attempt to connect periodically could
also be a problem, but the odds of seeing this problem are much lower.

The fix used is pretty simple, as the idea is to give access to the
minimum recovery point written in the control file to non-startup
processes so as they use a reference, while the startup process still
initializes its own references of the minimum consistent point so as the
original problem with incorrect page references happening post-promotion
with a crash do not show up.

Reported-by: Alexander Kukushkin
Diagnosed-by: Alexander Kukushkin
Author: Michael Paquier
Reviewed-by: Kyotaro Horiguchi, Alexander Kukushkin
Discussion: https://postgr.es/m/153492341830.1368.3936905691758473953@wrigleys.postgresql.org
Backpatch-through: 9.3
2018-08-31 11:03:40 -07:00
Thomas Munro ca1e64feba Improve comment in GetNewObjectId().
The previous comment gave the impression that skipping OIDs before
FirstNormalObjectId was merely an optimization to avoid likely collisions.
In fact other parts of the system have been relying on this threshold to
detect system-created objects since commit 8e18d04d4d, so adjust the
wording.

Author: Thomas Munro
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAEepm%3D33JASACeOayr_W3%3DCSjy2jiPxM-k89axu0akFbHdjnjA%40mail.gmail.com
2018-08-16 17:17:30 +12:00
Michael Paquier 246a6c8f7b Make autovacuum more aggressive to remove orphaned temp tables
Commit dafa084, added in 10, made the removal of temporary orphaned
tables more aggressive.  This commit makes an extra step into the
aggressiveness by adding a flag in each backend's MyProc which tracks
down any temporary namespace currently in use.  The flag is set when the
namespace gets created and can be reset if the temporary namespace has
been created in a transaction or sub-transaction which is aborted.  The
flag value assignment is assumed to be atomic, so this can be done in a
lock-less fashion like other flags already present in PGPROC like
databaseId or backendId, still the fact that the temporary namespace and
table created are still locked until the transaction creating those
commits acts as a barrier for other backends.

This new flag gets used by autovacuum to discard more aggressively
orphaned tables by additionally checking for the database a backend is
connected to as well as its temporary namespace in-use, removing
orphaned temporary relations even if a backend reuses the same slot as
one which created temporary relations in a past session.

The base idea of this patch comes from Robert Haas, has been written in
its first version by Tsunakawa Takayuki, then heavily reviewed by me.

Author: Tsunakawa Takayuki
Reviewed-by: Michael Paquier, Kyotaro Horiguchi, Andres Freund
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8A4DC6@G01JPEXMBYT05
Backpatch: 11-, as PGPROC gains a new flag and we don't want silent ABI
breakages on already released versions.
2018-08-13 11:49:04 +02:00
Peter Geoghegan 4974d7f87e Handle parallel index builds on mapped relations.
Commit 9da0cc3528, which introduced parallel CREATE INDEX, failed to
propagate relmapper.c backend local cache state to parallel worker
processes.  This could result in parallel index builds against mapped
catalog relations where the leader process (participating as a worker)
scans the new, pristine relfilenode, while worker processes scan the
obsolescent relfilenode.  When this happened, the final index structure
was typically not consistent with the owning table's structure.  The
final index structure could contain entries formed from both heap
relfilenodes.  Only rebuilds on mapped catalog relations that occur as
part of a VACUUM FULL or CLUSTER could become corrupt in practice, since
their mapped relation relfilenode swap is what allows the inconsistency
to arise.

On master, fix the problem by propagating the required relmapper.c
backend state as part of standard parallel initialization (Cf. commit
29d58fd3).  On v11, simply disallow builds against mapped catalog
relations by deeming them parallel unsafe.

Author: Peter Geoghegan
Reported-By: "death lock"
Reviewed-By: Tom Lane, Amit Kapila
Bug: #15309
Discussion: https://postgr.es/m/153329671686.1405.18298309097348420351@wrigleys.postgresql.org
Backpatch: 11-, where parallel CREATE INDEX was introduced.
2018-08-10 13:01:34 -07:00
Michael Paquier 5a23c74b63 Reset properly errno before calling write()
6cb3372 enforces errno to ENOSPC when less bytes than what is expected
have been written when it is unset, though it forgot to properly reset
errno before doing a system call to write(), causing errno to
potentially come from a previous system call.

Reported-by: Tom Lane
Author: Michael Paquier
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/31797.1533326676@sss.pgh.pa.us
2018-08-05 05:31:18 +09:00
Michael Paquier d9fadbf131 Fix calculation for WAL segment recycling and removal
Commit 4b0d28de06 has removed the prior checkpoint and related
facilities but has left WAL recycling based on the LSN of the prior
checkpoint, which causes incorrect calculations for WAL removal and
recycling for max_wal_size and min_wal_size.  This commit changes things
so as the base calculation point is the last checkpoint generated.

Reported-by: Kyotaro Horiguchi
Author: Kyotaro Horiguchi
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/20180723.135748.42558387.horiguchi.kyotaro@lab.ntt.co.jp
Backpatch: 11-, where the prior checkpoint has been removed.
2018-07-24 10:32:56 +09:00
Michael Paquier e41d0a1090 Add proper errcodes to new error messages for read() failures
Those would use the default ERRCODE_INTERNAL_ERROR, but for foreseeable
failures an errcode ought to be set, ERRCODE_DATA_CORRUPTED making the
most sense here.

While on the way, fix one errcode_for_file_access missing in origin.c
since the code has been created, and remove one assignment of errno to 0
before calling read(), as this was around to fit with what was present
before 811b6e36 where errno would not be set when not enough bytes are
read.  I have noticed the first one, and Tom has pinged me about the
second one.

Author: Michael Paquier
Reported-by: Tom Lane
Discussion: https://postgr.es/m/27265.1531925836@sss.pgh.pa.us
2018-07-23 09:37:36 +09:00
Michael Paquier 56df07bb9e Make more consistent some error messages for file-related operations
Some error messages which report something about a file operation use
as well context which is already provided within the path being worked
on, making things rather duplicated.  This creates more work for
translators, and does not actually bring clarity.

More could be done, however in a lot of cases the context used is
actually useful, still that patch gets down things with a good cut.

Author: Michael Paquier
Reviewed-by: Kyotaro Horiguchi, Tom Lane
Discussion: https://postgr.es/m/20180718044711.GA8565@paquier.xyz
2018-07-23 09:19:12 +09:00
Tom Lane 3cb646264e Use a ResourceOwner to track buffer pins in all cases.
Historically, we've allowed auxiliary processes to take buffer pins without
tracking them in a ResourceOwner.  However, that creates problems for error
recovery.  In particular, we've seen multiple reports of assertion crashes
in the startup process when it gets an error while holding a buffer pin,
as for example if it gets ENOSPC during a write.  In a non-assert build,
the process would simply exit without releasing the pin at all.  We've
gotten away with that so far just because a failure exit of the startup
process translates to a database crash anyhow; but any similar behavior
in other aux processes could result in stuck pins and subsequent problems
in vacuum.

To improve this, institute a policy that we must *always* have a resowner
backing any attempt to pin a buffer, which we can enforce just by removing
the previous special-case code in resowner.c.  Add infrastructure to make
it easy to create a process-lifespan AuxProcessResourceOwner and clear
out its contents at appropriate times.  Replace existing ad-hoc resowner
management in bgwriter.c and other aux processes with that.  (Thus, while
the startup process gains a resowner where it had none at all before, some
other aux process types are replacing an ad-hoc resowner with this code.)
Also use the AuxProcessResourceOwner to manage buffer pins taken during
StartupXLOG and ShutdownXLOG, even when those are being run in a bootstrap
process or a standalone backend rather than a true auxiliary process.

In passing, remove some other ad-hoc resource owner creations that had
gotten cargo-culted into various other places.  As far as I can tell
that was all unnecessary, and if it had been necessary it was incomplete,
due to lacking any provision for clearing those resowners later.
(Also worth noting in this connection is that a process that hasn't called
InitBufferPoolBackend has no business accessing buffers; so there's more
to do than just add the resowner if we want to touch buffers in processes
not covered by this patch.)

Although this fixes a very old bug, no back-patch, because there's no
evidence of any significant problem in non-assert builds.

Patch by me, pursuant to a report from Justin Pryzby.  Thanks to
Robert Haas and Kyotaro Horiguchi for reviews.

Discussion: https://postgr.es/m/20180627233939.GA10276@telsasoft.com
2018-07-18 12:15:16 -04:00
Heikki Linnakangas 6b387179ba Fix misc typos, mostly in comments.
A collection of typos I happened to spot while reading code, as well as
grepping for common mistakes.

Backpatch to all supported versions, as applicable, to avoid conflicts
when backporting other commits in the future.
2018-07-18 16:17:32 +03:00
Michael Paquier 8bd064f0c7 Fix casting in error message for two-phase file
This error from from 811b6e3, which causes compilation warnings with OSX
10.3 and clang.

Reported by Tom Lane, via buildfarm member longfin.
2018-07-18 09:11:34 +09:00
Michael Paquier 811b6e36a9 Rework error messages around file handling
Some error messages related to file handling are using the code path
context to define their state.  For example, 2PC-related errors are
referring to "two-phase status files", or "relation mapping file" is
used for catalog-to-filenode mapping, however those prove to be
difficult to translate, and are not more helpful than just referring to
the path of the file being worked on.  So simplify all those error
messages by just referring to files with their path used.  In some
cases, like the manipulation of WAL segments, the context is actually
helpful so those are kept.

Calls to the system function read() have also been rather inconsistent
with their error handling sometimes not reporting the number of bytes
read, and some other code paths trying to use an errno which has not
been set.  The in-core functions are using a more consistent pattern
with this patch, which checks for both errno if set or if an
inconsistent read is happening.

So as to care about pluralization when reading an unexpected number of
byte(s), "could not read: read %d of %zu" is used as error message, with
%d field being the output result of read() and %zu the expected size.
This simplifies the work of translators with less variations of the same
message.

Author: Michael Paquier
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/20180520000522.GB1603@paquier.xyz
2018-07-18 08:01:23 +09:00
Robert Haas 32df1c9afa Add subtransaction handling for table synchronization workers.
Since the old logic was completely unaware of subtransactions, a
change made in a subsequently-aborted subtransaction would still cause
workers to be stopped at toplevel transaction commit.  Fix that by
managing a stack of worker lists rather than just one.

Amit Khandekar and Robert Haas

Discussion: http://postgr.es/m/CAJ3gD9eaG_mWqiOTA2LfAug-VRNn1hrhf50Xi1YroxL37QkZNg@mail.gmail.com
2018-07-16 17:33:22 -04:00
Michael Paquier 5fc1008e8a Clean up temporary WAL segments after an instance crash
Temporary WAL segments are created in pg_wal and named as xlogtemp.pid
before being renamed to the real deal when creating a new segment.  If
an instance crashes after the temporary segment is created and before
the rename is done, then the server would finish with unremovable data.

After an instance crash, scan pg_wal and remove any such segments.  With
repetitive unlucky crashes this would contribute to disk bloat and
presents risks of ENOSPC especially with max_wal_size close to the
maximum allowed.

Author: Michael Paquier
Reviewed-by: Yugo Nagata, Heikki Linnakangas
Discussion: https://postgr.es/m/20180514054955.GF1528@paquier.xyz
2018-07-13 06:43:20 +09:00
Peter Eisentraut bcbd940806 Remove dynamic_shared_memory_type=none
PostgreSQL nowadays offers some kind of dynamic shared memory feature on
all supported platforms.  Having the choice of "none" prevents us from
relying on DSM in core features.  So this patch removes the choice of
"none".

Author: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
2018-07-10 18:35:24 +02:00
Alvaro Herrera a22445ff0b Flip argument order in XLogSegNoOffsetToRecPtr
Commit fc49e24fa6 added an input argument after the existing output
argument.  Flip those.

Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20180708182345.imdgovmkffgtihhk@alvherre.pgsql
2018-07-09 14:33:38 -04:00
Michael Paquier cbc55da556 Rework order of end-of-recovery actions to delay timeline history write
A critical failure in some of the end-of-recovery actions before the
end-of-recovery record is written can cause PostgreSQL to react
inconsistently with the rest of the cluster in the event of a crash
before the final record is written.  Two such failures are for example
an error while processing a two-phase state files or when operating on
recovery.conf.  With this commit, the failures are still considered
FATAL, but the write of the timeline history file is delayed as much as
possible so as the window between the moment the file is written and the
end-of-recovery record is generated gets minimized. This way, in the
event of a crash or a failure, the new timeline decided at promotion
will not seem taken by other nodes in the cluster.  It is not really
possible to reduce to zero this window, hence one could still see
failures if a crash happens between the history file write and the
end-of-recovery record, so any future code should be careful when
adding new end-of-recovery actions.  The original report from Magnus
Hagander mentioned a renamed recovery.conf as original end-of-recovery
failure which caused a timeline to be seen as taken but the subsequent
processing on the now-missing recovery.conf cause the startup process to
issue stop on FATAL, which at follow-up startup made the system
inconsistent because of on-disk changes which already happened.

Processing of two-phase state files still needs some work as corrupted
entries are simply ignored now.  This is left as a future item and this
commit fixes the original complain.

Reported-by: Magnus Hagander
Author: Heikki Linnakangas
Reviewed-by: Alexander Korotkov, Michael Paquier, David Steele
Discussion: https://postgr.es/m/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=XxmXLmF6ipNFecbShQ@mail.gmail.com
2018-07-09 10:22:34 +09:00
Michael Paquier 3c64dcb1e3 Prevent references to invalid relation pages after fresh promotion
If a standby crashes after promotion before having completed its first
post-recovery checkpoint, then the minimal recovery point which marks
the LSN position where the cluster is able to reach consistency may be
set to a position older than the first end-of-recovery checkpoint while
all the WAL available should be replayed.  This leads to the instance
thinking that it contains inconsistent pages, causing a PANIC and a hard
instance crash even if all the WAL available has not been replayed for
certain sets of records replayed.  When in crash recovery,
minRecoveryPoint is expected to always be set to InvalidXLogRecPtr,
which forces the recovery to replay all the WAL available, so this
commit makes sure that the local copy of minRecoveryPoint from the
control file is initialized properly and stays as it is while crash
recovery is performed.  Once switching to archive recovery or if crash
recovery finishes, then the local copy minRecoveryPoint can be safely
updated.

Pavan Deolasee has reported and diagnosed the failure in the first
place, and the base fix idea to rely on the local copy of
minRecoveryPoint comes from Kyotaro Horiguchi, which has been expanded
into a full-fledged patch by me.  The test included in this commit has
been written by Álvaro Herrera and Pavan Deolasee, which I have modified
to make it faster and more reliable with sleep phases.

Backpatch down to all supported versions where the bug appears, aka 9.3
which is where the end-of-recovery checkpoint is not run by the startup
process anymore.  The test gets easily supported down to 10, still it
has been tested on all branches.

Reported-by: Pavan Deolasee
Diagnosed-by: Pavan Deolasee
Reviewed-by: Pavan Deolasee, Kyotaro Horiguchi
Author: Michael Paquier, Kyotaro Horiguchi, Pavan Deolasee, Álvaro
Herrera
Discussion: https://postgr.es/m/CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com
2018-07-05 10:46:18 +09:00
Fujii Masao b41669118c Improve the performance of relation deletes during recovery.
When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
previously, during recovery, smgrdounlink() (not smgrdounlinkall()) was
called for each file to delete, which led to scan shared_buffers
multiple times. Obviously this could cause to increase the WAL replay
time very much especially when shared_buffers was huge.

To alleviate this situation, this commit changes the recovery so that
it also calls smgrdounlinkall() only one time to delete multiple
relation files.

This is just fix for oversight of commit 279628a0a7, not new feature.
So, per discussion on pgsql-hackers, we concluded to backpatch this
to all supported versions.

Author: Fujii Masao
Reviewed-by: Michael Paquier, Andres Freund, Thomas Munro, Kyotaro Horiguchi, Takayuki Tsunakawa
Discussion: https://postgr.es/m/CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com
2018-07-05 02:23:46 +09:00
Michael Paquier c55de5e512 Add wait event for fsync of WAL segments
This has been visibly a forgotten spot in the first implementation of
wait events for I/O added by 249cf07, and what has been missing is a
fsync call for WAL segments which is a wrapper reacting on the value of
GUC wal_sync_method.

Reported-by: Konstantin Knizhnik
Author: Konstantin Knizhnik
Reviewed-by: Craig Ringer, Michael Paquier
Discussion: https://postgr.es/m/4a243897-0ad8-f471-aa40-242591f2476e@postgrespro.ru
2018-07-02 22:19:46 +09:00
Andrew Dunstan 1e9c858090 pgindent run prior to branching 2018-06-30 12:25:49 -04:00
Michael Paquier 6cb3372411 Address set of issues with errno handling
System calls mixed up in error code paths are causing two issues which
several code paths have not correctly handled:
1) For write() calls, sometimes the system may return less bytes than
what has been written without errno being set.  Some paths were careful
enough to consider that case, and assumed that errno should be set to
ENOSPC, other calls missed that.
2) errno generated by a system call is overwritten by other system calls
which may succeed once an error code path is taken, causing what is
reported to the user to be incorrect.

This patch uses the brute-force approach of correcting all those code
paths.  Some refactoring could happen in the future, but this is let as
future work, which is not targeted for back-branches anyway.

Author: Michael Paquier
Reviewed-by: Ashutosh Sharma
Discussion: https://postgr.es/m/20180622061535.GD5215@paquier.xyz
2018-06-25 11:19:05 +09:00
Michael Paquier 0aa5e65ab4 Fix typo in comment of commit_ts.c for incorrect reference to CLOG
Author: Shao Bret
2018-06-22 13:30:26 +09:00
Michael Paquier 70b4f82a4b Prevent hard failures of standbys caused by recycled WAL segments
When a standby's WAL receiver stops reading WAL from a WAL stream, it
writes data to the current WAL segment without having priorily zero'ed
the page currently written to, which can cause the WAL reader to read
junk data from a past recycled segment and then it would try to get a
record from it.  While sanity checks in place provide most of the
protection needed, in some rare circumstances, with chances increasing
when a record header crosses a page boundary, then the startup process
could fail violently on an allocation failure, as follows:
FATAL:  invalid memory alloc request size XXX

This is confusing for the user and also unhelpful as this requires in
the worst case a manual restart of the instance, impacting potentially
the availability of the cluster, and this also makes WAL data look like
it is in a corrupted state.

The chances of seeing failures are higher if the connection between the
standby and its root node is unstable, causing WAL pages to be written
in the middle.  A couple of approaches have been discussed, like
zero-ing  new WAL pages within the WAL receiver itself but this has the
disadvantage of impacting performance of any existing instances as this
breaks the sequential writes done by the WAL receiver.  This commit
deals with the problem with a more simple approach, which has no
performance impact without reducing the detection of the problem: if a
record is found with a length higher than 1GB for backends, then do not
try any allocation and report a soft failure which will force the
standby to retry reading WAL.  It could be possible that the allocation
call passes and that an unnecessary amount of memory is allocated,
however follow-up checks on records would just fail, making this
allocation short-lived anyway.

This patch owes a great deal to Tsunakawa Takayuki for reporting the
failure first, and then discussing a couple of potential approaches to
the problem.

Backpatch down to 9.5, which is where palloc_extended has been
introduced.

Reported-by: Tsunakawa Takayuki
Reviewed-by: Tsunakawa Takayuki
Author: Michael Paquier
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
2018-06-18 10:43:27 +09:00
Simon Riggs 15378c1a15 Remove AELs from subxids correctly on standby
Issues relate only to subtransactions that hold AccessExclusiveLocks
when replayed on standby.

Prior to PG10, aborting subtransactions that held an
AccessExclusiveLock failed to release the lock until top level commit or
abort. 49bff5300d fixed that.

However, 49bff5300d also introduced a similar bug where subtransaction
commit would fail to release an AccessExclusiveLock, leaving the lock to
be removed sometimes early and sometimes late. This commit fixes
that bug also. Backpatch to PG10 needed.

Tested by observation. Note need for multi-node isolationtester to improve
test coverage for this and other HS cases.

Reported-by: Simon Riggs
Author: Simon Riggs
2018-06-16 14:03:29 +01:00
Alvaro Herrera ff03112bdc Fix off-by-one bug in XactLogCommitRecord
Commit 1eb6d6527a introduced zeroed alignment bytes in the GID field
of commit/abort WAL records.  Fixup commit cf5a189059 later changed
that representation into a regular cstring with a single terminating
zero byte, but it also introduced an off-by-one mistake.  Fix that.

Author: Nikhil Sontakke
Reported-by: Nikhil Sontakke
Discussion: https://postgr.es/m/CAMGcDxey6dG1DP34_tJMoWPcp5sPJUAL4K5CayUUXLQSx2GQpA@mail.gmail.com
2018-06-15 15:00:41 -04:00
Alvaro Herrera eee381ef5e Fix function code in error report
This bug causes a lseek() failure to be reported as a "could not open"
failure in the error message, muddling bug reports.  I introduced this
copy-and-pasteo in commit 78e1220104.

Noticed while reviewing code for bug report #15221, from lily liang.  In
version 10 the affected function is only used by multixact.c and
commit_ts, and only in corner-case circumstances, neither of which are
involved in the reported bug (a pg_subtrans failure.)

Author: Álvaro Herrera
2018-06-06 14:48:08 -04:00
Magnus Hagander cfb758b6d9 Fix error message on short read of pg_control
Instead of saying "error: success", indicate that we got a working read
but it was too short.
2018-05-18 17:54:18 +02:00
Peter Eisentraut 9effb63e0d Message wording and pluralization improvements 2018-05-17 23:05:27 -04:00
Heikki Linnakangas 0668719801 Fix scenario where streaming standby gets stuck at a continuation record.
If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.

Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.

Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.

Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
2018-05-05 01:34:53 +03:00
Teodor Sigaev 8f9be261f4 Add HOLD_INTERRUPTS section into FinishPreparedTransaction.
If an interrupt arrives in the middle of FinishPreparedTransaction
and any callback decide to call CHECK_FOR_INTERRUPTS (e.g.
RemoveTwoPhaseFile can write a warning with ereport, which checks for
interrupts) then it's possible to leave current GXact undeleted.

Backpatch to all supported branches

Stas Kelvich

Discussion: ihttps://www.postgresql.org/message-id/3AD85097-A3F3-4EBA-99BD-C38EDF8D2949@postgrespro.ru
2018-05-03 20:08:29 +03:00
Tom Lane 9cb7db3f0c In AtEOXact_Files, complain if any files remain unclosed at commit.
This change makes this module act more like most of our other low-level
resource management modules.  It's a caller error if something is not
explicitly closed by the end of a successful transaction, so issue
a WARNING about it.  This would not actually have caught the file leak
bug fixed in commit 231bcd080, because that was in a transaction-abort
path; but it still seems like a good, and pretty cheap, cross-check.

Discussion: https://postgr.es/m/152056616579.4966.583293218357089052@wrigleys.postgresql.org
2018-04-28 17:45:02 -04:00
Tom Lane bdf46af748 Post-feature-freeze pgindent run.
Discussion: https://postgr.es/m/15719.1523984266@sss.pgh.pa.us
2018-04-26 14:47:16 -04:00
Heikki Linnakangas cf5a189059 Fix confusion on the padding of GIDs in on commit and abort records.
Review of commit 1eb6d652: It's pointless to add padding to the GID fields,
when the code that follows assumes that there is no alignment, and uses
memcpy(). Remove the pointless padding.

Update comments to note the new fields in the WAL records.

Reviewed-by: Michael Paquier
Discussion: https://www.postgresql.org/message-id/33b787bf-dc20-1161-54e9-3f3b607bf59d%40iki.fi
2018-04-17 16:10:42 -04:00
Tom Lane d1e9079295 Ignore nextOid when replaying an ONLINE checkpoint.
The nextOid value is from the start of the checkpoint and may well be stale
compared to values from more recent XLOG_NEXTOID records.  Previously, we
adopted it anyway, allowing the OID counter to go backwards during a crash.
While this should be harmless, it contributed to the severity of the bug
fixed in commit 0408e1ed5, by allowing duplicate TOAST OIDs to be assigned
immediately following a crash.  Without this error, that issue would only
have arisen when TOAST objects just younger than a multiple of 2^32 OIDs
were deleted and then not vacuumed in time to avoid a conflict.

Pavan Deolasee

Discussion: https://postgr.es/m/CABOikdOgWT2hHkYG3Wwo2cyZJq2zfs1FH0FgX-=h4OLosXHf9w@mail.gmail.com
2018-04-11 18:11:29 -04:00
Tom Lane af1a949109 Further cleanup of client dependencies on src/include/catalog headers.
In commit 9c0a0de4c, I'd failed to notice that catalog/catalog.h
should also be considered a frontend-unsafe header, because it includes
(and needs) the full form of pg_class.h, not to mention relcache.h.
However, various frontend code was depending on it to get
TABLESPACE_VERSION_DIRECTORY, so refactoring of some sort is called for.

The cleanest answer seems to be to move TABLESPACE_VERSION_DIRECTORY,
as well as the OIDCHARS symbol, to common/relpath.h.  Do that, and mop up
inclusions as necessary.  (I found that quite a few current users of
catalog/catalog.h don't seem to need it at all anymore, apparently as a
result of the refactorings that created common/relpath.[hc].  And
initdb.c needed it only as a route to pg_class_d.h.)

Discussion: https://postgr.es/m/6629.1523294509@sss.pgh.pa.us
2018-04-09 14:39:58 -04:00
Magnus Hagander a228cc13ae Revert "Allow on-line enabling and disabling of data checksums"
This reverts the backend sides of commit 1fde38beaa.
I have, at least for now, left the pg_verify_checksums tool in place, as
this tool can be very valuable without the rest of the patch as well,
and since it's a read-only tool that only runs when the cluster is down
it should be a lot safer.
2018-04-09 19:03:42 +02:00
Stephen Frost da9b580d89 Refactor dir/file permissions
Consolidate directory and file create permissions for tools which work
with the PG data directory by adding a new module (common/file_perm.c)
that contains variables (pg_file_create_mode, pg_dir_create_mode) and
constants to initialize them (0600 for files and 0700 for directories).

Convert mkdir() calls in the backend to MakePGDirectory() if the
original call used default permissions (always the case for regular PG
directories).

Add tests to make sure permissions in PGDATA are set correctly by the
tools which modify the PG data directory.

Authors: David Steele <david@pgmasters.net>,
         Adam Brightwell <adam.brightwell@crunchydata.com>
Reviewed-By: Michael Paquier, with discussion amongst many others.
Discussion: https://postgr.es/m/ad346fe6-b23e-59f1-ecb7-0e08390ad629%40pgmasters.net
2018-04-07 17:45:39 -04:00
Magnus Hagander 1fde38beaa Allow on-line enabling and disabling of data checksums
This makes it possible to turn checksums on in a live cluster, without
the previous need for dump/reload or logical replication (and to turn it
off).

Enabling checkusm starts a background process in the form of a
launcher/worker combination that goes through the entire database and
recalculates checksums on each and every page. Only when all pages have
been checksummed are they fully enabled in the cluster. Any failure of
the process will revert to checksums off and the process has to be
started.

This adds a new WAL record that indicates the state of checksums, so
the process works across replicated clusters.

Authors: Magnus Hagander and Daniel Gustafsson
Review: Tomas Vondra, Michael Banck, Heikki Linnakangas, Andrey Borodin
2018-04-05 22:04:48 +02:00
Magnus Hagander eed1ce72e1 Allow background workers to bypass datallowconn
THis adds a "flags" field to the BackgroundWorkerInitializeConnection()
and BackgroundWorkerInitializeConnectionByOid(). For now only one flag,
BGWORKER_BYPASS_ALLOWCONN, is defined, which allows the worker to ignore
datallowconn.
2018-04-05 19:02:45 +02:00
Tom Lane 4a33bb59df Ensure that WAL pages skipped by a forced WAL switch are zero-filled.
In the previous coding, skipped pages were mostly zeroes, but they still
had valid WAL page headers.  That makes them very much less compressible
than an unbroken string of zeroes would be --- about 10X worse for bzip2
compression, for instance.  We don't need those headers, so tweak the logic
so that we zero them out.

Chapman Flack, reviewed by Daniel Gustafsson

Discussion: https://postgr.es/m/579297F8.7020107@anastigmatix.net
2018-03-30 16:18:18 -04:00
Magnus Hagander 019fa576ca Fix typo in comment
Author: Michael Paquier <michael@paquier.xyz>
2018-03-30 12:35:13 +02:00
Simon Riggs 1eb6d6527a Store 2PC GID in commit/abort WAL recs for logical decoding
Store GID of 2PC in commit/abort WAL records when wal_level = logical.
This allows logical decoding to send the SAME gid to subscribers
across restarts of logical replication.

Track relica origin replay progress for 2PC.

(Edited from patch 0003 in the logical decoding 2PC series.)

Authors: Nikhil Sontakke, Stas Kelvich
Reviewed-by: Simon Riggs, Andres Freund
2018-03-28 17:42:50 +01:00
Tom Lane 442accc3fe Allow memory contexts to have both fixed and variable ident strings.
Originally, we treated memory context names as potentially variable in
all cases, and therefore always copied them into the context header.
Commit 9fa6f00b1 rethought this a little bit and invented a distinction
between fixed and variable names, skipping the copy step for the former.
But we can make things both simpler and more useful by instead allowing
there to be two parts to a context's identification, a fixed "name" and
an optional, variable "ident".  The name supplied in the context create
call is now required to be a compile-time-constant string in all cases,
as it is never copied but just pointed to.  The "ident" string, if
wanted, is supplied later.  This is needed because typically we want
the ident to be stored inside the context so that it's cleaned up
automatically on context deletion; that means it has to be copied into
the context before we can set the pointer.

The cost of this approach is basically just an additional pointer field
in struct MemoryContextData, which isn't much overhead, and is bought
back entirely in the AllocSet case by not needing a headerSize field
anymore, since we no longer have to cope with variable header length.
In addition, we can simplify the internal interfaces for memory context
creation still further, saving a few cycles there.  And it's no longer
true that a custom identifier disqualifies a context from participating
in aset.c's freelist scheme, so possibly there's some win on that end.

All the places that were using non-compile-time-constant context names
are adjusted to put the variable info into the "ident" instead.  This
allows more effective identification of those contexts in many cases;
for example, subsidary contexts of relcache entries are now identified
by both type (e.g. "index info") and relname, where before you got only
one or the other.  Contexts associated with PL function cache entries
are now identified more fully and uniformly, too.

I also arranged for plancache contexts to use the query source string
as their identifier.  This is basically free for CachedPlanSources, as
they contained a copy of that string already.  We pay an extra pstrdup
to do it for CachedPlans.  That could perhaps be avoided, but it would
make things more fragile (since the CachedPlanSource is sometimes
destroyed first).  I suspect future improvements in error reporting will
require CachedPlans to have a copy of that string anyway, so it's not
clear that it's worth moving mountains to avoid it now.

This also changes the APIs for context statistics routines so that the
context-specific routines no longer assume that output goes straight
to stderr, nor do they know all details of the output format.  This
is useful immediately to reduce code duplication, and it also allows
for external code to do something with stats output that's different
from printing to stderr.

The reason for pushing this now rather than waiting for v12 is that
it rethinks some of the API changes made by commit 9fa6f00b1.  Seems
better for extension authors to endure just one round of API changes
not two.

Discussion: https://postgr.es/m/CAB=Je-FdtmFZ9y9REHD7VsSrnCkiBhsA4mdsLKSPauwXtQBeNA@mail.gmail.com
2018-03-27 16:46:51 -04:00
Tom Lane feb8254518 Improve style guideline compliance of assorted error-report messages.
Per the project style guide, details and hints should have leading
capitalization and end with a period.  On the other hand, errcontext should
not be capitalized and should not end with a period.  To support well
formatted error contexts in dblink, extend dblink_res_error() to take a
format+arguments rather than a hardcoded string.

Daniel Gustafsson

Discussion: https://postgr.es/m/B3C002C8-21A0-4F53-A06E-8CAB29FCF295@yesql.se
2018-03-22 17:33:10 -04:00
Robert Haas 8a8c4f3b32 Fix typo in comment.
Michael Paquier

Discussion: http://postgr.es/m/20180205071404.GB17337@paquier.xyz
2018-03-22 13:36:14 -04:00
Peter Eisentraut 4120864b9e Change transaction state debug strings to match enum symbols
In some cases, these were different for no apparent reason, making
debugging unnecessarily mysterious.

Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
2018-03-16 13:18:06 -04:00
Peter Eisentraut 81148856b0 Improve savepoint error messages
Include the savepoint name in the error message and rephrase it a bit to
match common style.

Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
2018-03-16 13:18:06 -04:00
Peter Eisentraut ec87efde8d Simplify parse representation of savepoint commands
Instead of embedding the savepoint name in a list and then requiring
complex code to unpack it, just add another struct field to store it
directly.

Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
2018-03-16 13:18:06 -04:00
Peter Eisentraut 04700b685f Rename TransactionChain functions
We call this thing a "transaction block" everywhere except in a few
functions, where it is mysteriously called a "transaction chain".  In
the SQL standard, a transaction chain is something different.  So rename
these functions to match the common terminology.

Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
2018-03-16 13:18:06 -04:00
Peter Eisentraut 8d47a90862 Update function comments
After a6542a4b68, some function comments
were misplaced.  Fix that.

Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
2018-03-16 13:18:05 -04:00
Peter Eisentraut 7923118c16 Minor comment fix 2018-02-17 20:45:02 -05:00
Magnus Hagander f8437c819a Fix typo in comment 2018-02-16 12:46:41 +01:00
Tom Lane 9a725f7b5c Silence assorted "variable may be used uninitialized" warnings.
All of these are false positives, but in each case a fair amount of
analysis is needed to see that, and it's not too surprising that not all
compilers are smart enough.  (In particular, in the logtape.c case, a
compiler lacking the knowledge provided by the Assert would almost surely
complain, so that this warning will be seen in any non-assert build.)

Some of these are of long standing while others are pretty recent,
but it only seems worth fixing them in HEAD.

Jaime Casanova, tweaked a bit by me

Discussion: https://postgr.es/m/CAJGNTeMcYAMJdPAom52dppLMtF-UnEZi0dooj==75OEv1EoBZA@mail.gmail.com
2018-02-14 16:06:49 -05:00
Robert Haas b98a7cd58f Update out-of-date comment in StartupXLOG.
Commit 4b0d28de06 should have updated
this comment, but did not.

Thomas Munro

Discussion: http://postgr.es/m/CAEepm=0iJ8aqQcF9ij2KerAkuHF3SwrVTzjMdm1H4w++nfBf9A@mail.gmail.com
2018-02-07 08:48:04 -05:00
Tom Lane 957ff087c8 Be more wary about shm_toc_lookup failure.
Commit 445dbd82a basically missed the point of commit d46633506,
which was that we shouldn't allow shm_toc_lookup() failure to lead
to a core dump or assertion crash, because the odds of such a
failure should never be considered negligible.  It's correct that
we can't expect the PARALLEL_KEY_ERROR_QUEUE TOC entry to be there
if we have no workers.  But if we have no workers, we're not going
to do anything in this function with the lookup result anyway,
so let's just skip it.  That lets the code use the easy-to-prove-safe
noError=false case, rather than anything requiring effort to review.

Back-patch to v10, like the previous commit.

Discussion: https://postgr.es/m/3647.1517601675@sss.pgh.pa.us
2018-02-02 18:26:07 -05:00
Robert Haas 9da0cc3528 Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds.  Testing
to date shows that this can often be 2-3x faster than a serial
index build.

The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature.  We can
refine it as we get more experience.

Peter Geoghegan with some help from Rushabh Lathia.  While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature.  Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.

Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 13:32:44 -05:00
Robert Haas 9222c0d9ed Add new function WaitForParallelWorkersToAttach.
Once this function has been called, we know that all workers have
started and attached to their error queues -- so if any of them
subsequently exit uncleanly, we'll be sure to throw an ERROR promptly.
Otherwise, users of the ParallelContext machinery must be careful not
to wait forever for a worker that has failed to start.  Parallel query
manages to work without needing this for reasons explained in new
comments added by this patch, but it's a useful primitive for other
parallel operations, such as the pending patch to make creating a
btree index run in parallel.

Amit Kapila, revised by me.  Additional review by Peter Geoghegan.

Discussion: http://postgr.es/m/CAA4eK1+e2MzyouF5bg=OtyhDSX+=Ao=3htN=T-r_6s3gCtKFiw@mail.gmail.com
2018-02-02 09:00:59 -05:00
Robert Haas 28e04155f1 Update obsolete sentence in README.parallel.
Since 9.6, heavyweight locking is not an abstract and unhandled
concern of the parallel machinery, but rather something to which
we have a specific approach.
2018-01-23 11:22:47 -05:00
Robert Haas 2badb5afb8 Report an ERROR if a parallel worker fails to start properly.
Commit 28724fd90d fixed things so that
if a background worker fails to start due to fork() failure or because
it is terminated before startup succeeds, BGWH_STOPPED will be
reported.  However, that only helps if the code that uses the
background worker machinery notices the change in status, and the code
in parallel.c did not.

To fix that, do two things.  First, make sure that when a worker
exits, it triggers the leader to read from error queues.  That way, if
a worker which has attached to an error queue exits uncleanly, the
leader is sure to throw some error, either the contents of the
ErrorResponse sent by the worker, or "lost connection to parallel
worker" if it exited without sending one.  To cover the case where
the worker never starts up in the first place or exits before
attaching to the error queue, the ParallelContext now keeps track
of which workers have sent at least one message via the error
queue.  A worker which sends no messages by the time the parallel
operation finishes will be checked to see whether it exited before
attaching to the error queue; if so, a new error message, "parallel
worker failed to initialize", will be reported.  If not, we'll
continue to wait until it either starts up and exits cleanly, starts
up and exits uncleanly, or fails to start, and then take the
appropriate action.

Patch by me, reviewed by Amit Kapila.

Discussion: http://postgr.es/m/CA+TgmoYnBgXgdTu6wk5YPdWhmgabYc9nY_pFLq=tB=FSLYkD8Q@mail.gmail.com
2018-01-23 11:03:03 -05:00
Robert Haas 29d58fd3ad Transfer state pertaining to pending REINDEX operations to workers.
This will allow the pending patch for parallel CREATE INDEX to work
on system catalogs, and to provide the same level of protection
against use of user indexes while they are being rebuilt that we
have for non-parallel CREATE INDEX.

Patch by me, reviewed by Peter Geoghegan.

Discussion: http://postgr.es/m/CA+TgmoYN-YQU9JsGQcqFLovZ-C+Xgp1_xhJQad=cunGG-_p5gg@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wzkv4UNkXYhqQRqk-u9rS7h5c-4cCW+EqQ8K_WSeS43aZg@mail.gmail.com
2018-01-19 07:48:54 -05:00
Simon Riggs 6271fceb8a Add TIMELINE to backup_label file
Allows new test to confirm timelines match

Author: Michael Paquier
Reviewed-by: David Steele
2018-01-06 12:24:19 +00:00
Bruce Momjian 9d4649ca49 Update copyright for 2018
Backpatch-through: certain files through 9.3
2018-01-02 23:30:12 -05:00
Simon Riggs 2958a672b1 Extend near-wraparound hints to include replication slots
Author: Feike Steenbergen
Reviewed-by: Michael Paquier
2017-12-29 14:01:25 +00:00
Robert Haas cce1ecfc77 Adjust assertion in GetCurrentCommandId.
currentCommandIdUsed is only used to skip redundant increments of the
command counter, and CommandCounterIncrement() is categorically denied
under parallelism anyway.  Therefore, it's OK for
GetCurrentCommandId() to mark the counter value used, as long as it
happens in the leader, not a worker.

Prior to commit e9baa5e9fa, the slightly
incorrect check didn't matter, but now it does.  A test case added by
commit 1804284042 uncovered the problem
by accident; it caused failures with force_parallel_mode=on/regress.

Report and review by Andres Freund.  Patch by me.

Discussion: http://postgr.es/m/20171221143106.5lhtygohvmazli3x@alap3.anarazel.de
2017-12-21 13:19:59 -05:00
Robert Haas 59d1e2b95a Cancel CV sleep during subtransaction abort.
Generally, error recovery paths that need to do things like
LWLockReleaseAll and pgstat_report_wait_end also need to call
ConditionVariableCancelSleep, but AbortSubTransaction was missed.

Since subtransaction abort might destroy up the DSM segment that
contains the ConditionVariable stored in cv_sleep_target, this
can result in a crash for anything using condition variables.

Reported and diagnosed by Andres Freund.

Discussion: http://postgr.es/m/20171221110048.rxk6464azzl5t2fi@alap3.anarazel.de
2017-12-21 09:24:30 -05:00
Fujii Masao 56a95ee511 Fix bug in cancellation of non-exclusive backup to avoid assertion failure.
Previously an assertion failure occurred when pg_stop_backup() for
non-exclusive backup was aborted while it's waiting for WAL files to
be archived. This assertion failure happened in do_pg_abort_backup()
which was called when a non-exclusive backup was canceled.
do_pg_abort_backup() assumes that there is at least one non-exclusive
backup running when it's called. But pg_stop_backup() can be canceled
even after it marks the end of non-exclusive backup (e.g.,
during waiting for WAL archiving). This broke the assumption that
do_pg_abort_backup() relies on, and which caused an assertion failure.

This commit changes do_pg_abort_backup() so that it does nothing
when non-exclusive backup has been already marked as completed.
That is, the asssumption is also changed, and do_pg_abort_backup()
now can handle even the case where it's called when there is
no running backup.

Backpatch to 9.6 where SQL-callable non-exclusive backup was added.

Author: Masahiko Sawada and Michael Paquier
Reviewed-By: Robert Haas and Fujii Masao
Discussion: https://www.postgresql.org/message-id/CAD21AoD2L1Fu2c==gnVASMyFAAaq3y-AQ2uEVj-zTCGFFjvmDg@mail.gmail.com
2017-12-19 03:46:14 +09:00
Tom Lane 9fa6f00b13 Rethink MemoryContext creation to improve performance.
This patch makes a number of interrelated changes to reduce the overhead
involved in creating/deleting memory contexts.  The key ideas are:

* Include the AllocSetContext header of an aset.c context in its first
malloc request, rather than allocating it separately in TopMemoryContext.
This means that we now always create an initial or "keeper" block in an
aset, even if it never receives any allocation requests.

* Create freelists in which we can save and recycle recently-destroyed
asets (this idea is due to Robert Haas).

* In the common case where the name of a context is a constant string,
just store a pointer to it in the context header, rather than copying
the string.

The first change eliminates a palloc/pfree cycle per context, and
also avoids bloat in TopMemoryContext, at the price that creating
a context now involves a malloc/free cycle even if the context never
receives any allocations.  That would be a loser for some common
usage patterns, but recycling short-lived contexts via the freelist
eliminates that pain.

Avoiding copying constant strings not only saves strlen() and strcpy()
overhead, but is an essential part of the freelist optimization because
it makes the context header size constant.  Currently we make no
attempt to use the freelist for contexts with non-constant names.
(Perhaps someday we'll need to think harder about that, but in current
usage, most contexts with custom names are long-lived anyway.)

The freelist management in this initial commit is pretty simplistic,
and we might want to refine it later --- but in common workloads that
will never matter because the freelists will never get full anyway.

To create a context with a non-constant name, one is now required to
call AllocSetContextCreateExtended and specify the MEMCONTEXT_COPY_NAME
option.  AllocSetContextCreate becomes a wrapper macro, and it includes
a test that will complain about non-string-literal context name
parameters on gcc and similar compilers.

An unfortunate side effect of making AllocSetContextCreate a macro is
that one is now *required* to use the size parameter abstraction macros
(ALLOCSET_DEFAULT_SIZES and friends) with it; the pre-9.6 habit of
writing out individual size parameters no longer works unless you
switch to AllocSetContextCreateExtended.

Internally to the memory-context-related modules, the context creation
APIs are simplified, removing the rather baroque original design whereby
a context-type module called mcxt.c which then called back into the
context-type module.  That saved a bit of code duplication, but not much,
and it prevented context-type modules from exercising control over the
allocation of context headers.

In passing, I converted the test-and-elog validation of aset size
parameters into Asserts to save a few more cycles.  The original thought
was that callers might compute size parameters on the fly, but in practice
nobody does that, so it's useless to expend cycles on checking those
numbers in production builds.

Also, mark the memory context method-pointer structs "const",
just for cleanliness.

Discussion: https://postgr.es/m/2264.1512870796@sss.pgh.pa.us
2017-12-13 13:55:16 -05:00
Peter Eisentraut af9f8b7ca3 Fix mistake in comment
Reported-by: Masahiko Sawada <sawada.mshk@gmail.com>
2017-12-08 11:23:36 -05:00
Tom Lane 066bc21c0e Simplify do_pg_start_backup's API by opening pg_tblspc internally.
do_pg_start_backup() expects its callers to pass in an open DIR pointer
for the pg_tblspc directory, but there's no apparent advantage in that.
It complicates the callers without adding any flexibility, and there's no
robustness advantage, since we surely have to be prepared for errors during
the scan of pg_tblspc anyway.  In fact, by holding an extra kernel resource
during operations like the preliminary checkpoint, we might be making
things a fraction more failure-prone not less.  Hence, remove that argument
and open the directory just for the duration of the actual scan.

Discussion: https://postgr.es/m/28752.1512413887@sss.pgh.pa.us
2017-12-04 18:37:54 -05:00
Tom Lane 2069e6faa0 Clean up assorted messiness around AllocateDir() usage.
This patch fixes a couple of low-probability bugs that could lead to
reporting an irrelevant errno value (and hence possibly a wrong SQLSTATE)
concerning directory-open or file-open failures.  It also fixes places
where we took shortcuts in reporting such errors, either by using elog
instead of ereport or by using ereport but forgetting to specify an
errcode.  And it eliminates a lot of just plain redundant error-handling
code.

In service of all this, export fd.c's formerly-static function
ReadDirExtended, so that external callers can make use of the coding
pattern

	dir = AllocateDir(path);
	while ((de = ReadDirExtended(dir, path, LOG)) != NULL)

if they'd like to treat directory-open failures as mere LOG conditions
rather than errors.  Also fix FreeDir to be a no-op if we reach it
with dir == NULL, as such a coding pattern would cause.

Then, remove code at many call sites that was throwing an error or log
message for AllocateDir failure, as ReadDir or ReadDirExtended can handle
that job just fine.  Aside from being a net code savings, this gets rid of
a lot of not-quite-up-to-snuff reports, as mentioned above.  (In some
places these changes result in replacing a custom error message such as
"could not open tablespace directory" with more generic wording "could not
open directory", but it was agreed that the custom wording buys little as
long as we report the directory name.)  In some other call sites where we
can't just remove code, change the error reports to be fully
project-style-compliant.

Also reorder code in restoreTwoPhaseData that was acquiring a lock
between AllocateDir and ReadDir; in the unlikely but surely not
impossible case that LWLockAcquire changes errno, AllocateDir failures
would be misreported.  There is no great value in opening the directory
before acquiring TwoPhaseStateLock, so just do it in the other order.

Also fix CheckXLogRemoved to guarantee that it preserves errno,
as quite a number of call sites are implicitly assuming.  (Again,
it's unlikely but I think not impossible that errno could change
during a SpinLockAcquire.  If so, this function was broken for its
own purposes as well as breaking callers.)

And change a few places that were using not-per-project-style messages,
such as "could not read directory" when "could not open directory" is
more correct.

Back-patch the exporting of ReadDirExtended, in case we have occasion
to back-patch some fix that makes use of it; it's not needed right now
but surely making it global is pretty harmless.  Also back-patch the
restoreTwoPhaseData and CheckXLogRemoved fixes.  The rest of this is
essentially cosmetic and need not get back-patched.

Michael Paquier, with a bit of additional work by me

Discussion: https://postgr.es/m/CAB7nPqRpOCxjiirHmebEFhXVTK7V5Jvw4bz82p7Oimtsm3TyZA@mail.gmail.com
2017-12-04 17:02:56 -05:00
Robert Haas eaedf0df71 Update typedefs.list and re-run pgindent
Discussion: http://postgr.es/m/CA+TgmoaA9=1RWKtBWpDaj+sF3Stgc8sHgf5z=KGtbjwPLQVDMA@mail.gmail.com
2017-11-29 09:24:24 -05:00
Robert Haas 445dbd82a3 Fix ReinitializeParallelDSM to tolerate finding no error queues.
Commit d466335064 changed things so
that shm_toc_lookup would fail with an error rather than silently
returning NULL in the hope that such failures would be reported
in a useful way rather than via a system crash.  However, it
overlooked the fact that the lookup of PARALLEL_KEY_ERROR_QUEUE
in ReinitializeParallelDSM is expected to fail when no DSM segment
was created in the first place; in that case, we end up with a
backend-private memory segment that still contains an entry for
PARALLEL_KEY_FIXED but no others.  Consequently a benign failure
to initialize parallelism can escalate into an elog(ERROR);
repair.

Discussion: http://postgr.es/m/CA+Tgmob8LFw55DzH1QEREpBEA9RJ_W_amhBFCVZ6WMwUhVpOqg@mail.gmail.com
2017-11-28 12:15:38 -05:00
Robert Haas 487a0c1518 Fix typo.
Masahiko Sawada

Discussion: http://postgr.es/m/CAD21AoCq_QG6UEo6yb-purmhLQjDLsCA2_B+8Wh3ah9P2-XmrQ@mail.gmail.com
2017-11-28 08:19:01 -05:00
Simon Riggs 59af8d4384 Pad XLogReaderState's per-buffer data_bufsz more aggressively.
Originally, we palloc'd this buffer just barely big enough to hold the
largest xlog backup block seen so far. We now MAXALIGN the palloc size.

The original coding could result in many repeated palloc cycles, in the
worst case where we see a series ofgradually larger xlog records.  We
ameliorate that similarly to 8735978e7a
by imposing a minimum buffer size of BLCKSZ.

Discussion: https://postgr.es/m/E1eHa4J-0006hI-Q8@gemulon.postgresql.org
2017-11-27 09:34:05 +00:00
Tom Lane 8735978e7a Pad XLogReaderState's main_data buffer more aggressively.
Originally, we palloc'd this buffer just barely big enough to hold the
largest xlog record seen so far.  It turns out that that can result in
valgrind complaints, because some compilers will emit code that assumes
it can safely fetch padding bytes at the end of a struct, and those
padding bytes were unallocated so far as aset.c was concerned.  We can
fix that by MAXALIGN'ing the palloc request size, ensuring that it is big
enough to include any possible padding that might've been omitted from
the on-disk record.

An additional objection to the original coding is that it could result in
many repeated palloc cycles, in the worst case where we see a series of
gradually larger xlog records.  We can ameliorate that cheaply by
imposing a minimum buffer size that's large enough for most xlog records.
BLCKSZ/2 was chosen after a bit of discussion.

In passing, remove an obsolete comment in struct xl_heap_new_cid that the
combocid field is free due to alignment considerations.  Perhaps that was
true at some point, but it's not now.

Back-patch to 9.5 where this code came in.

Discussion: https://postgr.es/m/E1eHa4J-0006hI-Q8@gemulon.postgresql.org
2017-11-26 15:17:24 -05:00
Peter Eisentraut 0e1539ba0d Add some const decorations to prototypes
Reviewed-by: Fabien COELHO <coelho@cri.ensmp.fr>
2017-11-10 13:38:57 -05:00
Peter Eisentraut 2eb4a831e5 Change TRUE/FALSE to true/false
The lower case spellings are C and C++ standard and are used in most
parts of the PostgreSQL sources.  The upper case spellings are only used
in some files/modules.  So standardize on the standard spellings.

The APIs for ICU, Perl, and Windows define their own TRUE and FALSE, so
those are left as is when using those APIs.

In code comments, we use the lower-case spelling for the C concepts and
keep the upper-case spelling for the SQL concepts.

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
2017-11-08 11:37:28 -05:00
Simon Riggs 4b0d28de06 Remove secondary checkpoint
Previously server reserved WAL for last two checkpoints,
which used too much disk space for small servers.

Bumps PG_CONTROL_VERSION

Author: Simon Riggs <simon@2ndQuadrant.com>
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
2017-11-07 12:56:30 -05:00
Robert Haas 846fcc8516 Fix problems with the "role" GUC and parallel query.
Without this fix, dropping a role can sometimes result in parallel
query failures in sessions that have used "SET ROLE" to assume the
dropped role, even if that setting isn't active any more.

Report by Pavan Deolasee.  Patch by Amit Kapila, reviewed by me.

Discussion: http://postgr.es/m/CABOikdOomRcZsLsLK+Z+qENM1zxyaWnAvFh3MJZzZnnKiF+REg@mail.gmail.com
2017-10-29 12:58:40 +05:30
Andres Freund 31079a4a8e Replace remaining uses of pq_sendint with pq_sendint{8,16,32}.
pq_sendint() remains, so extension code doesn't unnecessarily break.

Author: Andres Freund
Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbekae4@alap3.anarazel.de
2017-10-11 21:00:46 -07:00
Tom Lane 5fa6b0d102 Remove unnecessary PG_TRY overhead for CurrentResourceOwner changes.
resowner/README contained advice to use a PG_TRY block to restore the
old CurrentResourceOwner value anywhere that that variable is transiently
changed.  That advice was only inconsistently followed, however, and
on reflection it seems like unnecessary overhead.  We don't bother
with such a convention for transient CurrentMemoryContext changes,
on the grounds that any (sub)transaction abort will start out by
resetting CurrentMemoryContext to what it wants.  But the same is
true of CurrentResourceOwner, so there seems no need to treat it
differently.

Hence, remove PG_TRY blocks that exist only to restore CurrentResourceOwner
before re-throwing the error.  There are a couple of places that restore
it along with some other actions, and I left those alone; the restore is
probably unnecessary but no noticeable gain will result from removing it.

Discussion: https://postgr.es/m/5236.1507583529@sss.pgh.pa.us
2017-10-11 17:44:09 -04:00
Tom Lane 6b87416c9a Fix access-off-end-of-array in clog.c.
Sloppy loop coding in set_status_by_pages() resulted in fetching one array
element more than it should from the subxids[] array.  The odds of this
resulting in SIGSEGV are pretty small, but we've certainly seen that happen
with similar mistakes elsewhere.  While at it, we can get rid of an extra
TransactionIdToPage() calculation per loop.

Per report from David Binderman.  Back-patch to all supported branches,
since this code is quite old.

Discussion: https://postgr.es/m/HE1PR0802MB2331CBA919CBFFF0C465EB429C710@HE1PR0802MB2331.eurprd08.prod.outlook.com
2017-10-06 12:20:23 -04:00
Tom Lane fe9ba28ee8 Fix typo in README.
s/BeginInternalSubtransaction/BeginInternalSubTransaction/
2017-10-05 15:06:01 -04:00
Peter Eisentraut 5373bc2a08 Add background worker type
Add bgw_type field to background worker structure.  It is intended to be
set to the same value for all workers of the same type, so they can be
grouped in pg_stat_activity, for example.

The backend_type column in pg_stat_activity now shows bgw_type for a
background worker.  The ps listing also no longer calls out that a
process is a background worker but just show the bgw_type.  That way,
being a background worker is more of an implementation detail now that
is not shown to the user.  However, most log messages still refer to
'background worker "%s"'; otherwise constructing sensible and
translatable log messages would become tricky.

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
2017-09-29 11:08:24 -04:00