Commit Graph

35913 Commits

Author SHA1 Message Date
Peter Eisentraut 53095b5fe6 Remove obsolete cygwin.h hack
The version being checked for is 20 years old.

Reviewed-by: Marco Atzeri <marco.atzeri@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/aa266ede-baaa-f4e6-06cf-5b1737610e9a%402ndquadrant.com
2020-08-15 11:27:41 +02:00
Peter Eisentraut bacda6a327 Remove obsolete HAVE_BUGGY_SOLARIS_STRTOD
Fixed more than 10 years ago.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://www.postgresql.org/message-id/flat/aa266ede-baaa-f4e6-06cf-5b1737610e9a%402ndquadrant.com
2020-08-15 11:27:41 +02:00
Amit Kapila b48cac3b10 Mark a few logical decoding related variables with PGDLLIMPORT.
Commit 7259736a6e added two variables CheckXidAlive and bsysscan to detect
concurrent aborts and used these in inline functions that are part of the
API that can be used by extensions. So it is better to mark them with
PGDLLIMPORT.

Reported-by: Thomas Munro
Discussion: https://postgr.es/m/CA+hUKGJ7+HYupd=Pz9+QrXa-C_YnbC4rAYu8V+=OKg=UgdzMeg@mail.gmail.com
2020-08-15 08:34:48 +05:30
Tom Lane 1e7629d2c9 Be more careful about the shape of hashable subplan clauses.
nodeSubplan.c expects that the testexpr for a hashable ANY SubPlan
has the form of one or more OpExprs whose LHS is an expression of the
outer query's, while the RHS is an expression over Params representing
output columns of the subquery.  However, the planner only went as far
as verifying that the clauses were all binary OpExprs.  This works
99.99% of the time, because the clauses have the right shape when
emitted by the parser --- but it's possible for function inlining to
break that, as reported by PegoraroF10.  To fix, teach the planner
to check that the LHS and RHS contain the right things, or more
accurately don't contain the wrong things.  Given that this has been
broken for years without anyone noticing, it seems sufficient to just
give up hashing when it happens, rather than go to the trouble of
commuting the clauses back again (which wouldn't necessarily work
anyway).

While poking at that, I also noticed that nodeSubplan.c had a baked-in
assumption that the number of hash clauses is identical to the number
of subquery output columns.  Again, that's fine as far as parser output
goes, but it's not hard to break it via function inlining.  There seems
little reason for that assumption though --- AFAICS, the only thing
it's buying us is not having to store the number of hash clauses
explicitly.  Adding code to the planner to reject such cases would take
more code than getting nodeSubplan.c to cope, so I fixed it that way.

This has been broken for as long as we've had hashable SubPlans,
so back-patch to all supported branches.

Discussion: https://postgr.es/m/1549209182255-0.post@n3.nabble.com
2020-08-14 22:14:03 -04:00
Andres Freund 73487a60fc snapshot scalability: Move subxact info to ProcGlobal, remove PGXACT.
Similar to the previous changes this increases the chance that data
frequently needed by GetSnapshotData() stays in l2 cache. In many
workloads subtransactions are very rare, and this makes the check for
that considerably cheaper.

As this removes the last member of PGXACT, there is no need to keep it
around anymore.

On a larger 2 socket machine this and the two preceding commits result
in a ~1.07x performance increase in read-only pgbench. For read-heavy
mixed r/w workloads without row level contention, I see about 1.1x.

Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-14 15:33:35 -07:00
Andres Freund 5788e258bb snapshot scalability: Move PGXACT->vacuumFlags to ProcGlobal->vacuumFlags.
Similar to the previous commit this increases the chance that data
frequently needed by GetSnapshotData() stays in l2 cache. As we now
take care to not unnecessarily write to ProcGlobal->vacuumFlags, there
should be very few modifications to the ProcGlobal->vacuumFlags array.

Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-14 15:33:35 -07:00
Andres Freund 941697c3c1 snapshot scalability: Introduce dense array of in-progress xids.
The new array contains the xids for all connected backends / in-use
PGPROC entries in a dense manner (in contrast to the PGPROC/PGXACT
arrays which can have unused entries interspersed).

This improves performance because GetSnapshotData() always needs to
scan the xids of all live procarray entries and now there's no need to
go through the procArray->pgprocnos indirection anymore.

As the set of running top-level xids changes rarely, compared to the
number of snapshots taken, this substantially increases the likelihood
of most data required for a snapshot being in l2 cache.  In
read-mostly workloads scanning the xids[] array will sufficient to
build a snapshot, as most backends will not have an xid assigned.

To keep the xid array dense ProcArrayRemove() needs to move entries
behind the to-be-removed proc's one further up in the array. Obviously
moving array entries cannot happen while a backend sets it
xid. I.e. locking needs to prevent that array entries are moved while
a backend modifies its xid.

To avoid locking ProcArrayLock in GetNewTransactionId() - a fairly hot
spot already - ProcArrayAdd() / ProcArrayRemove() now needs to hold
XidGenLock in addition to ProcArrayLock. Adding / Removing a procarray
entry is not a very frequent operation, even taking 2PC into account.

Due to the above, the dense array entries can only be read or modified
while holding ProcArrayLock and/or XidGenLock. This prevents a
concurrent ProcArrayRemove() from shifting the dense array while it is
accessed concurrently.

While the new dense array is very good when needing to look at all
xids it is less suitable when accessing a single backend's xid. In
particular it would be problematic to have to acquire a lock to access
a backend's own xid. Therefore a backend's xid is not just stored in
the dense array, but also in PGPROC. This also allows a backend to
only access the shared xid value when the backend had acquired an
xid.

The infrastructure added in this commit will be used for the remaining
PGXACT fields in subsequent commits. They are kept separate to make
review easier.

Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-14 15:33:35 -07:00
Alvaro Herrera 2ba5b2db79
pg_dump: fix dependencies on FKs to partitioned tables
Parallel-restoring a foreign key that references a partitioned table
with several levels of partitions can fail:

pg_restore: while PROCESSING TOC:
pg_restore: from TOC entry 6684; 2606 29166 FK CONSTRAINT fk fk_a_fkey postgres
pg_restore: error: could not execute query: ERROR:  there is no unique constraint matching given keys for referenced table "pk"
Command was: ALTER TABLE fkpart3.fk
    ADD CONSTRAINT fk_a_fkey FOREIGN KEY (a) REFERENCES fkpart3.pk(a);

This happens in parallel restore mode because some index partitions
aren't yet attached to the topmost partitioned index that the FK uses,
and so the index is still invalid.  The current code marks the FK as
dependent on the first level of index-attach dump objects; the bug is
fixed by recursively marking the FK on their children.

Backpatch to 12, where FKs to partitioned tables were introduced.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/3170626.1594842723@sss.pgh.pa.us
Backpatch: 12-master
2020-08-14 17:33:31 -04:00
Peter Geoghegan 914140e85a Fix obsolete comment in xlogutils.c.
Oversight in commit 2c03216d83.
2020-08-14 11:09:08 -07:00
Tom Lane 0038f94387 Fix postmaster's behavior during smart shutdown.
Up to now, upon receipt of a SIGTERM ("smart shutdown" command), the
postmaster has immediately killed all "optional" background processes,
and subsequently refused to launch new ones while it's waiting for
foreground client processes to exit.  No doubt this seemed like an OK
policy at some point; but it's a pretty bad one now, because it makes
for a seriously degraded environment for the remaining clients:

* Parallel queries are killed, and new ones fail to launch. (And our
parallel-query infrastructure utterly fails to deal with the case
in a reasonable way --- it just hangs waiting for workers that are
not going to arrive.  There is more work needed in that area IMO.)

* Autovacuum ceases to function.  We can tolerate that for awhile,
but if bulk-update queries continue to run in the surviving client
sessions, there's eventually going to be a mess.  In the worst case
the system could reach a forced shutdown to prevent XID wraparound.

* The bgwriter and walwriter are also stopped immediately, likely
resulting in performance degradation.

Hence, let's rearrange things so that the only immediate change in
behavior is refusing to let in new normal connections.  Once the last
normal connection is gone, shut everything down as though we'd received
a "fast" shutdown.  To implement this, remove the PM_WAIT_BACKUP and
PM_WAIT_READONLY states, instead staying in PM_RUN or PM_HOT_STANDBY
while normal connections remain.  A subsidiary state variable tracks
whether or not we're letting in new connections in those states.

This also allows having just one copy of the logic for killing child
processes in smart and fast shutdown modes.  I moved that logic into
PostmasterStateMachine() by inventing a new state PM_STOP_BACKENDS.

Back-patch to 9.6 where parallel query was added.  In principle
this'd be a good idea in 9.5 as well, but the risk/reward ratio
is not as good there, since lack of autovacuum is not a problem
during typical uses of smart shutdown.

Per report from Bharath Rupireddy.

Patch by me, reviewed by Thomas Munro

Discussion: https://postgr.es/m/CALj2ACXAZ5vKxT9P7P89D87i3MDO9bfS+_bjMHgnWJs8uwUOOw@mail.gmail.com
2020-08-14 13:26:57 -04:00
Heikki Linnakangas 5bdf694568 Fix typo in test comment. 2020-08-14 10:40:50 +03:00
Andres Freund 1f51c17c68 snapshot scalability: Move PGXACT->xmin back to PGPROC.
Now that xmin isn't needed for GetSnapshotData() anymore, it leads to
unnecessary cacheline ping-pong to have it in PGXACT, as it is updated
considerably more frequently than the other PGXACT members.

After the changes in dc7420c2c9, this is a very straight-forward change.

For highly concurrent, snapshot acquisition heavy, workloads this change alone
can significantly increase scalability. E.g. plain pgbench on a smaller 2
socket machine gains 1.07x for read-only pgbench, 1.22x for read-only pgbench
when submitting queries in batches of 100, and 2.85x for batches of 100
'SELECT';.  The latter numbers are obviously not to be expected in the
real-world, but micro-benchmark the snapshot computation
scalability (previously spending ~80% of the time in GetSnapshotData()).

Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-13 16:25:21 -07:00
Alvaro Herrera a811ea5bde
Handle new HOT chains in index-build table scans
When a table is scanned by heapam_index_build_range_scan (née
IndexBuildHeapScan) and the table lock being held allows concurrent data
changes, it is possible for new HOT chains to sprout in a page that were
unknown when the scan of a page happened.  This leads to an error such
as
  ERROR:  failed to find parent tuple for heap-only tuple at (X,Y) in table "tbl"
because the root tuple was not present when we first obtained the list
of the page's root tuples.  This can be fixed by re-obtaining the list
of root tuples, if we see that a heap-only tuple appears to point to a
non-existing root.

This was reported by Anastasia as occurring for BRIN summarization
(which exists since 9.5), but I think it could theoretically also happen
with CREATE INDEX CONCURRENTLY (much older) or REINDEX CONCURRENTLY
(very recent).  It seems a happy coincidence that BRIN forces us to
backpatch this all the way to 9.5.

Reported-by: Anastasia Lubennikova <a.lubennikova@postgrespro.ru>
Diagnosed-by: Anastasia Lubennikova <a.lubennikova@postgrespro.ru>
Co-authored-by: Anastasia Lubennikova <a.lubennikova@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/602d8487-f0b2-5486-0088-0f372b2549fa@postgrespro.ru
Backpatch: 9.5 - master
2020-08-13 17:33:49 -04:00
Andres Freund b8443eae72 Fix out-of-date version reference, grammar.
Time appears to be passing fast.

Reported-By: Peter Geoghegan <pg@bowt.ie>
2020-08-12 17:04:51 -07:00
Andres Freund dc7420c2c9 snapshot scalability: Don't compute global horizons while building snapshots.
To make GetSnapshotData() more scalable, it cannot not look at at each proc's
xmin: While snapshot contents do not need to change whenever a read-only
transaction commits or a snapshot is released, a proc's xmin is modified in
those cases. The frequency of xmin modifications leads to, particularly on
higher core count systems, many cache misses inside GetSnapshotData(), despite
the data underlying a snapshot not changing. That is the most
significant source of GetSnapshotData() scaling poorly on larger systems.

Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons /
thresholds as it has so far. But we don't really have to: The horizons don't
actually change that much between GetSnapshotData() calls. Nor are the horizons
actually used every time a snapshot is built.

The trick this commit introduces is to delay computation of accurate horizons
until there use and using horizon boundaries to determine whether accurate
horizons need to be computed.

The use of RecentGlobal[Data]Xmin to decide whether a row version could be
removed has been replaces with new GlobalVisTest* functions.  These use two
thresholds to determine whether a row can be pruned:
1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed
   are definitely still visible.
2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
   definitely be removed
GetSnapshotData() updates definitely_needed to be the xmin of the computed
snapshot.

When testing whether a row can be removed (with GlobalVisTestIsRemovableXid())
and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID <
definitely_needed) the boundaries can be recomputed to be more accurate. As it
is not cheap to compute accurate boundaries, we limit the number of times that
happens in short succession.  As the boundaries used by
GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by
GetSnapshotData()), it is likely that further test can benefit from an earlier
computation of accurate horizons.

To avoid regressing performance when old_snapshot_threshold is set (as that
requires an accurate horizon to be computed), heap_page_prune_opt() doesn't
unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the
computation of the limited horizon, and the triggering of errors (with
SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove
tuples.

This commit just removes the accesses to PGXACT->xmin from
GetSnapshotData(), but other members of PGXACT residing in the same
cache line are accessed. Therefore this in itself does not result in a
significant improvement. Subsequent commits will take advantage of the
fact that GetSnapshotData() now does not need to access xmins anymore.

Note: This contains a workaround in heap_page_prune_opt() to keep the
snapshot_too_old tests working. While that workaround is ugly, the tests
currently are not meaningful, and it seems best to address them separately.

Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-12 16:03:49 -07:00
Alvaro Herrera 1f42d35a1d
BRIN: Handle concurrent desummarization properly
If a page range is desummarized at just the right time concurrently with
an index walk, BRIN would raise an error indicating index corruption.
This is scary and unhelpful; silently returning that the page range is
not summarized is sufficient reaction.

This bug was introduced by commit 975ad4e602 as additional protection
against a bug whose actual fix was elsewhere.  Backpatch equally.

Reported-By: Anastasia Lubennikova <a.lubennikova@postgrespro.ru>
Diagnosed-By: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/2588667e-d07d-7e10-74e2-7e1e46194491@postgrespro.ru
Backpatch: 9.5 - master
2020-08-12 15:33:36 -04:00
Tom Lane 3546cf8a7a Improve comments for postmaster.c's BackendList.
This had gotten a little disjointed over time, and some of the grammar
was sloppy.  Rewrite for more clarity.

In passing, re-pgindent some recently added comments.

No code changes.
2020-08-12 11:54:16 -04:00
Andres Freund 3bd7f9969a Track latest completed xid as a FullTransactionId.
The reason for doing so is that a subsequent commit will need that to
avoid wraparound issues. As the subsequent change is large this was
split out for easier review.

The reason this is not a perfect straight-forward change is that we do
not want track 64bit xids in the procarray or the WAL. Therefore we
need to advance lastestCompletedXid in relation to 32 bit xids. The
code for that is now centralized in MaintainLatestCompletedXid*.

Author: Andres Freund
Reviewed-By: Thomas Munro, Robert Haas, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-11 17:41:18 -07:00
Andres Freund fea10a6434 Rename VariableCacheData.nextFullXid to nextXid.
Including Full in variable names duplicates the type information and
leads to overly long names. As FullTransactionId cannot accidentally
be casted to TransactionId that does not seem necessary.

Author: Andres Freund
Discussion: https://postgr.es/m/20200724011143.jccsyvsvymuiqfxu@alap3.anarazel.de
2020-08-11 12:07:14 -07:00
Michael Paquier 1f75b45413 Improve tab completion of REINDEX in psql
This allows the tab completion of REINDEX to handle an optional
parenthesized list of options.  This case is more complicated than
VACUUM or ANALYZE because of CONCURRENTLY and the different object types
to consider with the reindex.

Author: Justin Pryzby
Reviewed-by: Alexey Kondratov, Michael Paquier
Discussion: https://postgr.es/m/20200403182712.GR14618@telsasoft.com
2020-08-11 14:37:38 +09:00
Peter Eisentraut 1784f278a6 Replace remaining StrNCpy() by strlcpy()
They are equivalent, except that StrNCpy() zero-fills the entire
destination buffer instead of providing just one trailing zero.  For
all but a tiny number of callers, that's just overhead rather than
being desirable.

Remove StrNCpy() as it is now unused.

In some cases, namestrcpy() is the more appropriate function to use.
While we're here, simplify the API of namestrcpy(): Remove the return
value, don't check for NULL input.  Nothing was using that anyway.
Also, remove a few unused name-related functions.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/44f5e198-36f6-6cdb-7fa9-60e34784daae%402ndquadrant.com
2020-08-10 23:20:37 +02:00
Noah Misch 11da97024a Empty search_path in logical replication apply worker and walsender.
This is like CVE-2018-1058 commit
582edc369c.  Today, a malicious user of a
publisher or subscriber database can invoke arbitrary SQL functions
under an identity running replication, often a superuser.  This fix may
cause "does not exist" or "no schema has been selected to create in"
errors in a replication process.  After upgrading, consider watching
server logs for these errors.  Objects accruing schema qualification in
the wake of the earlier commit are unlikely to need further correction.
Back-patch to v10, which introduced logical replication.

Security: CVE-2020-14349
2020-08-10 09:22:54 -07:00
Noah Misch e078fb5d4e Move connect.h from fe_utils to src/include/common.
Any libpq client can use the header.  Clients include backend components
postgres_fdw, dblink, and logical replication apply worker.  Back-patch
to v10, because another fix needs this.  In released branches, just copy
the header and keep the original.
2020-08-10 09:22:54 -07:00
Tom Lane 7eeb1d9861 Make contrib modules' installation scripts more secure.
Hostile objects located within the installation-time search_path could
capture references in an extension's installation or upgrade script.
If the extension is being installed with superuser privileges, this
opens the door to privilege escalation.  While such hazards have existed
all along, their urgency increases with the v13 "trusted extensions"
feature, because that lets a non-superuser control the installation path
for a superuser-privileged script.  Therefore, make a number of changes
to make such situations more secure:

* Tweak the construction of the installation-time search_path to ensure
that references to objects in pg_catalog can't be subverted; and
explicitly add pg_temp to the end of the path to prevent attacks using
temporary objects.

* Disable check_function_bodies within installation/upgrade scripts,
so that any security gaps in SQL-language or PL-language function bodies
cannot create a risk of unwanted installation-time code execution.

* Adjust lookup of type input/receive functions and join estimator
functions to complain if there are multiple candidate functions.  This
prevents capture of references to functions whose signature is not the
first one checked; and it's arguably more user-friendly anyway.

* Modify various contrib upgrade scripts to ensure that catalog
modification queries are executed with secure search paths.  (These
are in-place modifications with no extension version changes, since
it is the update process itself that is at issue, not the end result.)

Extensions that depend on other extensions cannot be made fully secure
by these methods alone; therefore, revert the "trusted" marking that
commit eb67623c9 applied to earthdistance and hstore_plperl, pending
some better solution to that set of issues.

Also add documentation around these issues, to help extension authors
write secure installation scripts.

Patch by me, following an observation by Andres Freund; thanks
to Noah Misch for review.

Security: CVE-2020-14350
2020-08-10 10:44:42 -04:00
Peter Geoghegan d129c07499 Correct nbtree page split lock coupling comment.
There is no reason to distinguish between readers and writers here.
2020-08-09 12:01:15 -07:00
Tom Lane 1b9cde5124 Check for fseeko() failure in pg_dump's _tarAddFile().
Coverity pointed out, not unreasonably, that we checked fseeko's
result at every other call site but these.  Failure to seek in the
temp file (note this is NOT pg_dump's output file) seems quite
unlikely, and even if it did happen the file length cross-check
further down would probably detect the problem.  Still, that's a
poor excuse for not checking the result of a system call.
2020-08-09 12:39:07 -04:00
Tom Lane 1c164ef3d2 Remove useless Assert.
Testing that an unsigned variable is >= 0 is pretty pointless,
as noted by Coverity and numerous buildfarm members.

In passing, add comment about new uses of "volatile" --- Coverity
doesn't much like that either, but it seems probably necessary.
2020-08-09 11:32:31 -04:00
Alvaro Herrera 470687b4a5
walsnd: Don't set waiting_for_ping_response spuriously
Ashutosh Bapat noticed that when logical walsender needs to wait for
WAL, and it realizes that it must send a keepalive message to
walreceiver to update the sent-LSN, which *does not* request a reply
from walreceiver, it wrongly sets the flag that it's going to wait for
that reply.  That means that any future would-be sender of feedback
messages ends up not sending a feedback message, because they all
believe that a reply is expected.

With built-in logical replication there's not much harm in this, because
WalReceiverMain will send a ping-back every wal_receiver_timeout/2
anyway; but with other logical replication systems (e.g. pglogical) it
can cause significant pain.

This problem was introduced in commit 41d5f8ad73, where the
request-reply flag was changed from true to false to WalSndKeepalive,
without at the same time removing the line that sets
waiting_for_ping_response.

Just removing that line would be a sufficient fix, but it seems better
to shift the responsibility of setting the flag to WalSndKeepalive
itself instead of requiring caller to do it; this is clearly less
error-prone.

Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reported-by: Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com>
Backpatch: 9.5 and up
Discussion: https://postgr.es/m/20200806225558.GA22401@alvherre.pgsql
2020-08-08 12:31:55 -04:00
Peter Eisentraut a13421c96c Add some const decorations 2020-08-08 07:31:52 +02:00
Amit Kapila 7259736a6e Implement streaming mode in ReorderBuffer.
Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke stream API methods added by commit 45fdc9738b.
However, sometimes if we have incomplete toast or speculative insert we
spill to the disk because we can't generate the complete tuple and stream.
And, as soon as we get the complete tuple we stream the transaction
including the serialized changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages at each command end. These
features are added by commits 0bead9af48 and c55040ccd0 respectively.

Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such a sqlerrcode aborts the decoding of the current
transaction and continue with the decoding of other transactions.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

We also provide a new option via SQL APIs to fetch the changes being
streamed.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila, Nikhil Sontakke
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
2020-08-08 07:47:06 +05:30
Peter Geoghegan 0a7d771f0f Make nbtree split REDO locking match original execution.
Make the nbtree page split REDO routine consistent with original
execution in its approach to acquiring and releasing buffer locks (at
least for pages on the tree level of the page being split).  This brings
btree_xlog_split() in line with btree_xlog_unlink_page(), which was
taught to couple buffer locks by commit 9a9db08a.

Note that the precise order in which we both acquire and release sibling
buffer locks in btree_xlog_split() now matches original execution
exactly (the precise order in which the locks are released probably
doesn't matter much, but we might as well be consistent about it).

The rule for nbtree REDO routines from here on is that same-level locks
should be acquired in an order that's consistent with original
execution.  It's not practical to have a similar rule for cross-level
page locks, since for the most part original execution holds those locks
for a period that spans multiple atomic actions/WAL records.  It's also
not necessary, because clearly the cross-level lock coupling is only
truly needed during original execution because of the presence of
concurrent inserters.

This is not a bug fix (unlike the similar aforementioned commit, commit
9a9db08a).  The immediate reason to tighten things up in this area is to
enable an upcoming enhancement to contrib/amcheck that allows it to
verify that sibling links are in agreement with only an AccessShareLock
(this check produced false positives when run on a replica server on
account of the inconsistency fixed by this commit).  But that's not the
only reason to be stricter here.

It is generally useful to make locking on replicas be as close to what
happens during original execution as practically possible.  It makes it
less likely that hard to catch bugs will slip in in the future.  The
previous state of affairs seems to be a holdover from before the
introduction of Hot Standby, when buffer lock acquisitions during
recovery were totally unnecessary.  See also: commit 3bbf668d, which
tightened things up in this area a few years after the introduction of
Hot Standby.

Discussion: https://postgr.es/m/CAH2-Wz=465cJj11YXD9RKH8z=nhQa2dofOZ_23h67EXUGOJ00Q@mail.gmail.com
2020-08-07 15:27:56 -07:00
Alvaro Herrera cea3d55898
Remove PROC_IN_ANALYZE and derived flags
These flags are unused and always have been.

Discussion: https://postgr.es/m/20200805235549.GA8118@alvherre.pgsql
2020-08-07 17:24:40 -04:00
Tom Lane 6f0b632f08 Support testing of cases where table schemas change after planning.
We have various cases where we allow DDL on tables to be performed with
less than full AccessExclusiveLock.  This requires concurrent queries
to be able to cope with the DDL change mid-flight, but up to now we had
no repeatable way to test such cases.  To improve that, invent a test
module that allows halting a backend after planning and then resuming
execution once we've done desired actions in another session.  (The same
approach could be used to inject delays in other places, if there's a
suitable hook available.)

This commit includes a single test case, which is meant to exercise the
previously-untestable ExecCreatePartitionPruneState code repaired by
commit 7a980dfc6.  We'd probably not bother with this if that were the
only foreseen benefit, but I expect additional test cases will use this
infrastructure in the future.

Test module by Andy Fan, partition-addition test case by me.

Discussion: https://postgr.es/m/20200802181131.GA27754@telsasoft.com
2020-08-07 14:30:47 -04:00
Peter Geoghegan 3df92bbd1d Rename nbtree split REDO routine variables.
Make the nbtree page split REDO routine variable names consistent with
_bt_split() (which handles the original execution of page splits).
These names make the code easier to follow by making the distinction
between the original page and the left half of the split clear.  (The
left half of the split page is a temp page that REDO creates to replace
the origpage contents.)

Also reduce the elevel used when adding a new high key to the temp page
from PANIC to ERROR to be consistent.  We already only raise an ERROR
when data item PageAddItem() temp page calls fail.
2020-08-07 09:53:27 -07:00
Etsuro Fujita 199cec9779 Fix yet another issue with step generation in partition pruning.
Commit 13838740f fixed some issues with step generation in partition
pruning, but there was yet another one: get_steps_using_prefix() assumes
that clauses in the passed-in prefix list are sorted in ascending order
of their partition key numbers, but the caller failed to ensure this for
range partitioning, which led to an assertion failure in debug builds.
Adjust the caller function to arrange the clauses in the prefix list in
the required order for range partitioning.

Back-patch to v11, like the previous commit.

Patch by me, reviewed by Amit Langote.

Discussion: https://postgr.es/m/CAPmGK16jkXiFG0YqMbU66wte-oJTfW6D1HaNvQf%3D%2B5o9%3Dm55wQ%40mail.gmail.com
2020-08-07 14:45:00 +09:00
David Rowley d5e96520ff Fix bogus EXPLAIN output for Hash Aggregate
9bdb300de modified the EXPLAIN output for Hash Aggregate to show details
from parallel workers. However, it neglected to consider that a given
parallel worker may not have assisted with the given Hash Aggregate. This
can occur when workers fail to start or during Parallel Append with
enable_partitionwise_join enabled when only a single worker is working on
a non-parallel aware sub-plan. It could also happen if a worker simply
wasn't fast enough to get any work done before other processes went and
finished all the work.

The bogus output came from the fact that ExplainOpenWorker() skipped
showing any details for non-initialized workers but show_hashagg_info()
did show details from the worker.  This meant that the worker properties
that were shown were not properly attributed to the worker that they
belong to.

In passing, we also now don't show Hash Aggregate properties for the
leader process when it did not contribute any work to the Hash Aggregate.
This can occur either during Parallel Append when only a parallel worker
worked on a given sub plan or with parallel_leader_participation set to
off.  This aims to make the behavior of Hash Aggregate's EXPLAIN output
more similar to Sort's.

Reported-by: Justin Pryzby
Discussion: https://postgr.es/m/20200805012105.GZ28072%40telsasoft.com
Backpatch-through: 13, where the original breakage was introduced
2020-08-07 10:22:18 +12:00
Robert Haas bab150045b Register llvm_shutdown using on_proc_exit, not before_shmem_exit.
This seems more correct, because other before_shmem_exit calls may
expect the infrastructure that is needed to run queries and access the
database to be working, and also because this cleanup has nothing to
do with shared memory.

There are no known user-visible consequences to this, though, apart
from what was previous fixed by commit
303640199d and back-patched as commit
bcbc27251d and commit
f7013683d9, so for now, no back-patch.

Bharath Rupireddy

Discussion: http://postgr.es/m/CALj2ACWk7j4F2v2fxxYfrroOF=AdFNPr1WsV+AGtHAFQOqm_pw@mail.gmail.com
2020-08-06 14:13:03 -04:00
Tom Lane 7a980dfc6c Fix matching of sub-partitions when a partitioned plan is stale.
Since we no longer require AccessExclusiveLock to add a partition,
the executor may see that a partitioned table has more partitions
than the planner saw.  ExecCreatePartitionPruneState's code for
matching up the partition lists in such cases was faulty, and would
misbehave if the planner had successfully pruned any partitions from
the query.  (Thus, trouble would occur only if a partition addition
happens concurrently with a query that uses both static and dynamic
partition pruning.)  This led to an Assert failure in debug builds,
and probably to crashes or query misbehavior in production builds.

To repair the bug, just explicitly skip zeroes in the plan's
relid_map[] list.  I also made some cosmetic changes to make the code
more readable (IMO anyway).  Also, convert the cross-checking Assert
to a regular test-and-elog, since it's now apparent that this logic
is more fragile than one would like.

Currently, there's no way to repeatably exercise this code, except
with manual use of a debugger to stop the backend between planning
and execution.  Hence, no test case in this patch.  We oughta do
something about that testability gap, but that's for another day.

Amit Langote and Tom Lane, per report from Justin Pryzby.  Oversight
in commit 898e5e329; backpatch to v12 where that appeared.

Discussion: https://postgr.es/m/20200802181131.GA27754@telsasoft.com
2020-08-05 15:38:55 -04:00
Alexander Korotkov f47b5e1395 Remove btree page items after page unlink
Currently, page unlink leaves remaining items "as is", but replay of
corresponding WAL-record re-initializes page leaving it with no items.
For the sake of consistency, this commit makes primary delete all the items
during page unlink as well.

Thanks to this change, we now don't mask contents of deleted btree page for
WAL consistency checking.

Discussion: https://postgr.es/m/CAPpHfdt_OTyQpXaPJcWzV2N-LNeNJseNB-K_A66qG%3DL518VTFw%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Peter Geoghegan
2020-08-05 02:16:13 +03:00
Tom Lane 0f76294260 Increase hard-wired timeout values in ecpg regression tests.
A couple of test cases had connect_timeout=14, a value that seems
to have been plucked from a hat.  While it's more than sufficient
for normal cases, slow/overloaded buildfarm machines can get a timeout
failure here, as per recent report from "sungazer".  Increase to 180
seconds, which is in line with our typical timeouts elsewhere in
the regression tests.

Back-patch to 9.6; the code looks different in 9.5, and this doesn't
seem to be quite worth the effort to adapt to that.

Report: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-08-04%2007%3A12%3A22
2020-08-04 15:20:31 -04:00
Michael Paquier dd877998d4 Make new SSL TAP test for channel_binding more robust
The test would fail in an environment including a certificate file in
~/.postgresql/.  bdd6e9b fixed a similar failure, and d6e612f introduced
the same problem again with a new test.

Author: Kyotaro Horiguchi
Discussion: https://postgr.es/m/20200804.120033.31225582282178001.horikyota.ntt@gmail.com
Backpatch-through: 13
2020-08-04 14:36:01 +09:00
Peter Geoghegan 9a9db08ae4 Fix replica backward scan race condition.
It was possible for the logic used by backward scans (which must reason
about concurrent page splits/deletions in its own peculiar way) to
become confused when running on a replica.  Concurrent replay of a WAL
record that describes the second phase of page deletion could cause
_bt_walk_left() to get confused.  btree_xlog_unlink_page() simply failed
to adhere to the same locking protocol that we use on the primary, which
is obviously wrong once you consider these two disparate functions
together.  This bug is present in all stable branches.

More concretely, the problem was that nothing stopped _bt_walk_left()
from observing inconsistencies between the deletion's target page and
its original sibling pages when running on a replica.  This is true even
though the second phase of page deletion is supposed to work as a single
atomic action.  Queries running on replicas raised "could not find left
sibling of block %u in index %s" can't-happen errors when they went back
to their scan's "original" page and observed that the page has not been
marked deleted (even though it really was concurrently deleted).

There is no evidence that this actually happened in the real world.  The
issue came to light during unrelated feature development work.  Note
that _bt_walk_left() is the only code that cares about the difference
between a half-dead page and a fully deleted page that isn't also
exclusively used by nbtree VACUUM (unless you include contrib/amcheck
code).  It seems very likely that backward scans are the only thing that
could become confused by the inconsistency.  Even amcheck's complex
bt_right_page_check_scankey() dance was unaffected.

To fix, teach btree_xlog_unlink_page() to lock the left sibling, target,
and right sibling pages in that order before releasing any locks (just
like _bt_unlink_halfdead_page()).  This is the simplest possible
approach.  There doesn't seem to be any opportunity to be more clever
about lock acquisition in the REDO routine, and it hardly seems worth
the trouble in any case.

This fix might enable contrib/amcheck verification of leaf page sibling
links with only an AccessShareLock on the relation.  An amcheck patch
from Andrey Borodin was rejected back in January because it clashed with
btree_xlog_unlink_page()'s lax approach to locking pages.  It now seems
likely that the real problem was with btree_xlog_unlink_page(), not the
patch.

This is a low severity, low likelihood bug, so no backpatch.

Author: Michail Nikolaev
Diagnosed-By: Michail Nikolaev
Discussion: https://postgr.es/m/CANtu0ohkR-evAWbpzJu54V8eCOtqjJyYp3PQ_SGoBTRGXWhWRw@mail.gmail.com
2020-08-03 15:54:38 -07:00
Peter Geoghegan a451b7d442 Add nbtree page deletion assertion.
Add a documenting assertion that's similar to the nearby assertion added
by commit cd8c73a3.  This conveys that the entire call to _bt_pagedel()
does no work if it isn't possible to get a descent stack for the initial
scanblkno page.
2020-08-03 13:04:42 -07:00
Tom Lane 9e496768b8 Remove unnecessary "DISTINCT" in psql's queries for \dAc and \dAf.
A moment's examination of these queries is sufficient to see that
they do not produce duplicate rows, unless perhaps there's
catalog corruption.  Using DISTINCT anyway is inefficient and
confusing; moreover it sets a poor example for anyone who
refers to psql -E output to see how to query the catalogs.
2020-08-03 14:02:35 -04:00
Tom Lane 5f28b21eb3 Fix behavior of ecpg's "EXEC SQL elif name".
This ought to work much like C's "#elif defined(name)"; but the code
implemented it in a way equivalent to endif followed by ifdef, so that
it didn't matter whether any previous branch of the IF construct had
succeeded.  Fix that; add some test cases covering elif and nested IFs;
and improve the documentation, which also seemed a bit confused.

AFAICS the code has been like this since the feature was added in 1999
(commit b57b0e044).  So while it's surely wrong, there might be code
out there relying on the current behavior.  Hence, don't back-patch
into stable branches.  It seems all right to fix it in v13 though.

Per report from Ashutosh Sharma.  Reviewed by Ashutosh Sharma and
Michael Meskes.

Discussion: https://postgr.es/m/CAE9k0P=dQk9X0cU2tN49S7a9tv733-e1pVdpB1P-pWJ5PdTktg@mail.gmail.com
2020-08-03 09:46:12 -04:00
Michael Paquier b8fdee7d0c Add %P to log_line_prefix for parallel group leader
This is useful for monitoring purposes with log parsing.  Similarly to
pg_stat_activity, the leader's PID is shown only for active parallel
workers, minimizing the log footprint for the leaders as the equivalent
shared memory field is set as long as a backend is alive.

Author: Justin Pryzby
Reviewed-by: Álvaro Herrera, Michael Paquier, Julien Rouhaud, Tom Lane
Discussion: https://postgr.es/m/20200315111831.GA21492@telsasoft.com
2020-08-03 13:38:48 +09:00
Thomas Munro f44b9b625b Fix rare failure in LDAP tests.
Instead of writing a query to psql's stdin, use -c.  This avoids a
failure where psql exits before we write, seen a few times on the build
farm.  Thanks to Tom Lane for the suggestion.

Back-patch to 11, where the LDAP tests arrived.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/CA%2BhUKGLFmW%2BHQYPeKiwSp5sdFFHtFViCpw4Mh6yAgEx74r5-Cw%40mail.gmail.com
2020-08-03 12:49:36 +12:00
Thomas Munro 63e9aa6879 Correct comment in simplehash.h.
Post-commit review for commit 84c0e4b9.

Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAApHDvptBx_%2BUPAzY0uXzopbvPVGKPeZ6Hoy8rnPcWz20Cr0Bw%40mail.gmail.com
2020-08-03 12:23:05 +12:00
Tom Lane 533020d050 Fix minor issues in psql's new \dAc and related commands.
The type-name pattern in \dAc and \dAf was matched only to the actual
pg_type.typname string, which is fairly user-unfriendly in cases where
that is not what's shown to the user by format_type (compare "_int4"
and "integer[]").  Make this code match what \dT does, i.e. match the
pattern against either typname or format_type() output.  Also fix its
broken handling of schema-name restrictions.  (IOW, make these
processSQLNamePattern calls match \dT's.)  While here, adjust
whitespace to make the query a little prettier in -E output, too.

Also improve some inaccuracies and shaky grammar in the related
documentation.

Noted while working on a patch for intarray's opclasses; I wondered
why I couldn't get a match to "integer*" for the input type name.
2020-08-02 17:00:26 -04:00
David Rowley 6ee3b5fb99 Use int64 instead of long in incremental sort code
Windows 64bit has 4-byte long values which is not suitable for tracking
disk space usage in the incremental sort code. Let's just make all these
fields int64s.

Author: James Coleman
Discussion: https://postgr.es/m/CAApHDvpky%2BUhof8mryPf5i%3D6e6fib2dxHqBrhp0Qhu0NeBhLJw%40mail.gmail.com
Backpatch-through: 13, where the incremental sort code was added
2020-08-02 14:24:46 +12:00
Noah Misch cd5e82256d Change XID and mxact limits to warn at 40M and stop at 3M.
We have edge-case bugs when assigning values in the last few dozen pages
before the wrap limit.  We may introduce similar bugs in the future.  At
default BLCKSZ, this makes such bugs unreachable outside of single-user
mode.  Also, when VACUUM began to consume mxacts, multiStopLimit did not
change to compensate.

pg_upgrade may fail on a cluster that was already printing "must be
vacuumed" warnings.  Follow the warning's instructions to clear the
warning, then run pg_upgrade again.  One can still, peacefully consume
98% of XIDs or mxacts, so DBAs need not change routine VACUUM settings.

Discussion: https://postgr.es/m/20200621083513.GA3074645@rfd.leadboat.com
2020-08-01 15:31:01 -07:00
Tom Lane 9f9682783b Invent "amadjustmembers" AM method for validating opclass members.
This allows AM-specific knowledge to be applied during creation of
pg_amop and pg_amproc entries.  Specifically, the AM knows better than
core code which entries to consider as required or optional.  Giving
the latter entries the appropriate sort of dependency allows them to
be dropped without taking out the whole opclass or opfamily; which
is something we'd like to have to correct obsolescent entries in
extensions.

This callback also opens the door to performing AM-specific validity
checks during opclass creation, rather than hoping than an opclass
developer will remember to test with "amvalidate".  For the most part
I've not actually added any such checks yet; that can happen in a
follow-on patch.  (Note that we shouldn't remove any tests from
"amvalidate", as those are still needed to cross-check manually
constructed entries in the initdb data.  So adding tests to
"amadjustmembers" will be somewhat duplicative, but it seems like
a good idea anyway.)

Patch by me, reviewed by Alexander Korotkov, Hamid Akhtar, and
Anastasia Lubennikova.

Discussion: https://postgr.es/m/4578.1565195302@sss.pgh.pa.us
2020-08-01 17:12:47 -04:00
Thomas Munro e2b37d9e7c Use pg_pread() and pg_pwrite() in slru.c.
This avoids lseek() system calls at every SLRU I/O, as was
done for relation files in commit c24dcd0c.

Reviewed-by: Ashwin Agrawal <aagrawal@pivotal.io>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKG%2Biqke4uTRFj8D8uEUUgj%2BRokPSp%2BCWM6YYzaaamG9Wvg%40mail.gmail.com
Discussion: https://postgr.es/m/CA%2BhUKGJ%2BoHhnvqjn3%3DHro7xu-YDR8FPr0FL6LF35kHRX%3D_bUzg%40mail.gmail.com
2020-08-02 00:23:35 +12:00
Michael Paquier 022350b849 Minimize slot creation for multi-inserts of pg_shdepend
When doing multiple insertions in pg_shdepend for the copy of
dependencies from a template database in CREATE DATABASE, the same
number of slots would have been created and used all the time.  As the
number of items to insert is not known in advance, this makes most of
the slots created for nothing.  This improves the slot handling so as
slot creation only happens when needed, minimizing the overhead of the
operation.

Author: Michael Paquier
Reviewed-by: Daniel Gustafsson
Discussion: https://postgr.es/m/20200731024148.GB3317@paquier.xyz
2020-08-01 11:49:13 +09:00
Thomas Munro 84c0e4b9bc Improve programmer docs for simplehash and dynahash.
When reading the code it's not obvious when one should prefer dynahash
over simplehash and vice-versa, so, for programmer-friendliness, add
comments to inform that decision.

Show sample simplehash method signatures.

Author: James Coleman <jtc331@gmail.com>
Discussion: https://postgr.es/m/CAAaqYe_dOF39gAJ8rL-a3YO3Qo96MHMRQ2whFjK5ZcU6YvMQSA%40mail.gmail.com
2020-08-01 12:16:15 +12:00
Tom Lane 3d2376d55c Fix oversight in ALTER TYPE: typmodin/typmodout must propagate to arrays.
If a base type supports typmods, its array type does too, with the
same interpretation.  Hence changes in pg_type.typmodin/typmodout
must be propagated to the array type.

While here, improve AlterTypeRecurse to not recurse to domains if
there is nothing we'd need to change.

Oversight in fe30e7ebf.  Back-patch to v13 where that came in.
2020-07-31 17:11:28 -04:00
Tom Lane 78e73e8754 Fix recently-introduced performance problem in ts_headline().
The new hlCover() algorithm that I introduced in commit c9b0c678d
turns out to potentially take O(N^2) or worse time on long documents,
if there are many occurrences of individual query words but few or no
substrings that actually satisfy the query.  (One way to hit this
behavior is with a "common_word & rare_word" type of query.)  This
seems unavoidable given the original goal of checking every substring
of the document, so we have to back off that idea.  Fortunately, it
seems unlikely that anyone would really want headlines spanning all of
a long document, so we can avoid the worse-than-linear behavior by
imposing a maximum length of substring that we'll consider.

For now, just hard-wire that maximum length as a multiple of max_words
times max_fragments.  Perhaps at some point somebody will argue for
exposing it as a ts_headline parameter, but I'm hesitant to make such
a feature addition in a back-patched bug fix.

I also noted that the hlFirstIndex() function I'd added in that
commit was unnecessarily stupid: it really only needs to check whether
a HeadlineWordEntry's item pointer is null or not.  This wouldn't make
all that much difference in typical cases with queries having just
a few terms, but a cycle shaved is a cycle earned.

In addition, add a CHECK_FOR_INTERRUPTS call in TS_execute_recurse.
This ensures that hlCover's loop is cancellable if it manages to take
a long time, and it may protect some other TS_execute callers as well.

Back-patch to 9.6 as the previous commit was.  I also chose to add the
CHECK_FOR_INTERRUPTS call to 9.5.  The old hlCover() algorithm seems
to avoid the O(N^2) behavior, at least on the test case I tried, but
nonetheless it's not very quick on a long document.

Per report from Stephen Frost.

Discussion: https://postgr.es/m/20200724160535.GW12375@tamriel.snowman.net
2020-07-31 11:43:12 -04:00
Thomas Munro 7be04496a9 Fix compiler warning from Clang.
Per build farm.

Discussion: https://postgr.es/m/20200731062626.GD3317%40paquier.xyz
2020-07-31 19:08:09 +12:00
Thomas Munro 84b1c63ad4 Preallocate some DSM space at startup.
Create an optional region in the main shared memory segment that can be
used to acquire and release "fast" DSM segments, and can benefit from
huge pages allocated at cluster startup time, if configured.  Fall back
to the existing mechanisms when that space is full.  The size is
controlled by a new GUC min_dynamic_shared_memory, defaulting to 0.

Main region DSM segments initially contain whatever garbage the memory
held last time they were used, rather than zeroes.  That change revealed
that DSA areas failed to initialize themselves correctly in memory that
wasn't zeroed first, so fix that problem.

Discussion: https://postgr.es/m/CA%2BhUKGLAE2QBv-WgGp%2BD9P_J-%3Dyne3zof9nfMaqq1h3EGHFXYQ%40mail.gmail.com
2020-07-31 17:49:58 +12:00
Michael Paquier 7b1110d2fd Fix comment in instrument.h
local_blks_dirtied tracks the number of local blocks dirtied, not shared
ones.

Author: Kirk Jamison
Discussion: https://postgr.es/m/OSBPR01MB2341760686DC056DE89D2AB9EF710@OSBPR01MB2341.jpnprd01.prod.outlook.com
2020-07-31 14:17:29 +09:00
Thomas Munro c5315f4f44 Cache smgrnblocks() results in recovery.
Avoid repeatedly calling lseek(SEEK_END) during recovery by caching
the size of each fork.  For now, we can't use the same technique in
other processes, because we lack a shared invalidation mechanism.

Do this by generalizing the pre-existing caching used by FSM and VM
to support all forks.

Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
2020-07-31 14:29:52 +12:00
Michael Paquier e3931d01f3 Use multi-inserts for pg_attribute and pg_shdepend
For pg_attribute, this allows to insert at once a full set of attributes
for a relation (roughly 15% of WAL reduction in extreme cases).  For
pg_shdepend, this reduces the work done when creating new shared
dependencies from a database template.  The number of slots used for the
insertion is capped at 64kB of data inserted for both, depending on the
number of items to insert and the length of the rows involved.

More can be done for other catalogs, like pg_depend.  This part requires
a different approach as the number of slots to use depends also on the
number of entries discarded as pinned dependencies.  This is also
related to the rework or dependency handling for ALTER TABLE and CREATE
TABLE, mainly.

Author: Daniel Gustafsson
Reviewed-by: Andres Freund, Michael Paquier
Discussion: https://postgr.es/m/20190213182737.mxn6hkdxwrzgxk35@alap3.anarazel.de
2020-07-31 10:54:26 +09:00
Jeff Davis fd734f387d Use pg_bitutils for HyperLogLog.
Using pg_leftmost_one_post32() yields substantial performance benefits.

Backpatching to version 13 because HLL is used for HashAgg
improvements in 9878b643, which was also backpatched to 13.

Reviewed-by: Peter Geoghegan
Discussion: https://postgr.es/m/CAH2-WzkGvDKVDo+0YvfvZ+1CE=iCi88DCOGFF3i1hTGGaxcKPw@mail.gmail.com
Backpatch-through: 13
2020-07-30 09:14:23 -07:00
Michael Paquier f1af75c5f2 Include partitioned tables for tab completion of VACUUM in psql
The relkinds that support indexing are the same as the ones supporting
VACUUM, so the code gets refactored a bit with the completion query used
for CLUSTER, but there is no change for CLUSTER in this commit.

Author: Justin Pryzby
Reviewed-by: Fujii Masao, Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/20200728170408.GI20393@telsasoft.com
2020-07-30 16:57:37 +09:00
Thomas Munro e7591fd3ca Introduce a WaitEventSet for the stats collector.
This avoids avoids some epoll/kqueue system calls for every wait.

Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com
2020-07-30 17:44:28 +12:00
Thomas Munro e2d394df5d Use WaitLatch() for condition variables.
Previously, condition_variable.c created a long lived WaitEventSet to
avoid extra system calls.  WaitLatch() now uses something similar
internally, so there is no point in wasting an extra kernel descriptor.

Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com
2020-07-30 17:42:45 +12:00
Thomas Munro 3347c982ba Use a long lived WaitEventSet for WaitLatch().
Create LatchWaitSet at backend startup time, and use it to implement
WaitLatch().  This avoids repeated epoll/kqueue setup and teardown
system calls.

Reorder SubPostmasterMain() slightly so that we restore the postmaster
pipe and Windows signal emulation before we reach InitPostmasterChild(),
to make this work in EXEC_BACKEND builds.

Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com
2020-07-30 17:40:00 +12:00
Peter Geoghegan d6c08e29e7 Add hash_mem_multiplier GUC.
Add a GUC that acts as a multiplier on work_mem.  It gets applied when
sizing executor node hash tables that were previously size constrained
using work_mem alone.

The new GUC can be used to preferentially give hash-based nodes more
memory than the generic work_mem limit.  It is intended to enable admin
tuning of the executor's memory usage.  Overall system throughput and
system responsiveness can be improved by giving hash-based executor
nodes more memory (especially over sort-based alternatives, which are
often much less sensitive to being memory constrained).

The default value for hash_mem_multiplier is 1.0, which is also the
minimum valid value.  This means that hash-based nodes continue to apply
work_mem in the traditional way by default.

hash_mem_multiplier is generally useful.  However, it is being added now
due to concerns about hash aggregate performance stability for users
that upgrade to Postgres 13 (which added disk-based hash aggregation in
commit 1f39bce0).  While the old hash aggregate behavior risked
out-of-memory errors, it is nevertheless likely that many users actually
benefited.  Hash agg's previous indifference to work_mem during query
execution was not just faster; it also accidentally made aggregation
resilient to grouping estimate problems (at least in cases where this
didn't create destabilizing memory pressure).

hash_mem_multiplier can provide a certain kind of continuity with the
behavior of Postgres 12 hash aggregates in cases where the planner
incorrectly estimates that all groups (plus related allocations) will
fit in work_mem/hash_mem.  This seems necessary because hash-based
aggregation is usually much slower when only a small fraction of all
groups can fit.  Even when it isn't possible to totally avoid hash
aggregates that spill, giving hash aggregation more memory will reliably
improve performance (the same cannot be said for external sort
operations, which appear to be almost unaffected by memory availability
provided it's at least possible to get a single merge pass).

The PostgreSQL 13 release notes should advise users that increasing
hash_mem_multiplier can help with performance regressions associated
with hash aggregation.  That can be taken care of by a later commit.

Author: Peter Geoghegan
Reviewed-By: Álvaro Herrera, Jeff Davis
Discussion: https://postgr.es/m/20200625203629.7m6yvut7eqblgmfo@alap3.anarazel.de
Discussion: https://postgr.es/m/CAH2-WzmD%2Bi1pG6rc1%2BCjc4V6EaFJ_qSuKCCHVnH%3DoruqD-zqow%40mail.gmail.com
Backpatch: 13-, where disk-based hash aggregation was introduced.
2020-07-29 14:14:58 -07:00
Fujii Masao b5310e4ff6 Remove non-fast promotion.
When fast promotion was supported in 9.3, non-fast promotion became
undocumented feature and it's basically not available for ordinary users.
However we decided not to remove non-fast promotion at that moment,
to leave it for a release or two for debugging purpose or as an emergency
method because fast promotion might have some issues, and then to
remove it later. Now, several versions were released since that decision
and there is no longer reason to keep supporting non-fast promotion.
Therefore this commit removes non-fast promotion.

Author: Fujii Masao
Reviewed-by: Hamid Akhtar, Kyotaro Horiguchi
Discussion: https://postgr.es/m/76066434-648f-f567-437b-54853b43398f@oss.nttdata.com
2020-07-29 21:24:26 +09:00
Jeff Davis 9878b643f3 HashAgg: use better cardinality estimate for recursive spilling.
Use HyperLogLog to estimate the group cardinality in a spilled
partition. This estimate is used to choose the number of partitions if
we recurse.

The previous behavior was to use the number of tuples in a spilled
partition as the estimate for the number of groups, which lead to
overpartitioning. That could cause the number of batches to be much
higher than expected (with each batch being very small), which made it
harder to interpret EXPLAIN ANALYZE results.

Reviewed-by: Peter Geoghegan
Discussion: https://postgr.es/m/a856635f9284bc36f7a77d02f47bbb6aaf7b59b3.camel@j-davis.com
Backpatch-through: 13
2020-07-28 23:16:28 -07:00
Michael Paquier f2130e77da Fix incorrect print format in json.c
Oid is unsigned, so %u needs to be used and not %d.  The code path
involved here is not normally reachable, so no backpatch is done.

Author: Justin Pryzby
Discussion: https://postgr.es/m/20200728015523.GA27308@telsasoft.com
2020-07-29 14:44:32 +09:00
Thomas Munro cb04ad4985 Move syncscan.c to src/backend/access/common.
Since the tableam.c code needs to make use of the syncscan.c routines
itself, and since other block-oriented AMs might also want to use it one
day, it didn't make sense for it to live under src/backend/access/heap.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGLCnG%3DNEAByg6bk%2BCT9JZD97Y%3DAxKhh27Su9FeGWOKvDg%40mail.gmail.com
2020-07-29 16:59:33 +12:00
Peter Geoghegan c49c74d192 Rename another "hash_mem" local variable.
Missed by my commit 564ce621.

Backpatch: 13-, where disk-based hash aggregation was introduced.
2020-07-28 17:59:16 -07:00
Peter Geoghegan b1d79127ed Correct obsolete UNION hash aggs comment.
Oversight in commit 1f39bce0, which added disk-based hash aggregation.

Backpatch: 13-, where disk-based hash aggregation was introduced.
2020-07-28 17:14:07 -07:00
David Rowley 0e3e1c4e1c Make EXPLAIN ANALYZE of HashAgg more similar to Hash Join
There were various unnecessary differences between Hash Agg's EXPLAIN
ANALYZE output and Hash Join's.  Here we modify the Hash Agg output so
that it's better aligned to Hash Join's.

The following changes have been made:
1. Start batches counter at 1 instead of 0.
2. Always display the "Batches" property, even when we didn't spill to
   disk.
3. Use the text "Batches" instead of "HashAgg Batches" for text format.
4. Use the text "Memory Usage" instead of "Peak Memory Usage" for text
   format.
5. Include "Batches" before "Memory Usage" in both text and non-text
   formats.

In passing also modify the "Planned Partitions" property so that we show
it regardless of if the value is 0 or not for non-text EXPLAIN formats.
This was pointed out by Justin Pryzby and probably should have been part
of 40efbf870.

Reviewed-by: Justin Pryzby, Jeff Davis
Discussion: https://postgr.es/m/CAApHDvrshRnA6C0VFnu7Fb9TVvgGo80PUMm5+2DiaS1gEkPvtw@mail.gmail.com
Backpatch-through: 13, where HashAgg batching was introduced
2020-07-29 11:42:21 +12:00
Amit Kapila 45fdc9738b Extend the logical decoding output plugin API with stream methods.
This adds seven methods to the output plugin API, adding support for
streaming changes of large in-progress transactions.

* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.

This commit simply adds these new APIs and the upcoming patch to "allow
the streaming mode in ReorderBuffer" will use these APIs.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
2020-07-28 08:09:44 +05:30
Etsuro Fujita 13838740f6 Fix some issues with step generation in partition pruning.
In the case of range partitioning, get_steps_using_prefix() assumes that
the passed-in prefix list contains at least one clause for each of the
partition keys earlier than one specified in the passed-in
step_lastkeyno, but the caller (ie, gen_prune_steps_from_opexps())
didn't take it into account, which led to a server crash or incorrect
results when the list contained no clauses for such partition keys, as
reported in bug #16500 and #16501 from Kobayashi Hisanori.  Update the
caller to call that function only when the list created there contains
at least one clause for each of the earlier partition keys in the case
of range partitioning.

While at it, fix some other issues:

* The list to pass to get_steps_using_prefix() is allowed to contain
  multiple clauses for the same partition key, as described in the
  comment for that function, but that function actually assumed that the
  list contained just a single clause for each of middle partition keys,
  which led to an assertion failure when the list contained multiple
  clauses for such partition keys.  Update that function to match the
  comment.
* In the case of hash partitioning, partition keys are allowed to be
  NULL, in which case the list to pass to get_steps_using_prefix()
  contains no clauses for NULL partition keys, but that function treats
  that case as like the case of range partitioning, which led to the
  assertion failure.  Update the assertion test to take into account
  NULL partition keys in the case of hash partitioning.
* Fix a typo in a comment in get_steps_using_prefix_recurse().
* gen_partprune_steps() failed to detect self-contradiction from
  strict-qual clauses and an IS NULL clause for the same partition key
  in some cases, producing incorrect partition-pruning steps, which led
  to incorrect results of partition pruning, but didn't cause any
  user-visible problems fortunately, as the self-contradiction is
  detected later in the query planning.  Update that function to detect
  the self-contradiction.

Per bug #16500 and #16501 from Kobayashi Hisanori.  Patch by me, initial
diagnosis for the reported issue and review by Dmitry Dolgov.
Back-patch to v11, where partition pruning was introduced.

Discussion: https://postgr.es/m/16500-d1613f2a78e1e090%40postgresql.org
Discussion: https://postgr.es/m/16501-5234a9a0394f6754%40postgresql.org
2020-07-28 11:00:00 +09:00
Peter Geoghegan bcbf9446a2 Remove hashagg_avoid_disk_plan GUC.
Note: This GUC was originally named enable_hashagg_disk when it appeared
in commit 1f39bce0, which added disk-based hash aggregation.  It was
subsequently renamed in commit 92c58fd9.

Author: Peter Geoghegan
Reviewed-By: Jeff Davis, Álvaro Herrera
Discussion: https://postgr.es/m/9d9d1e1252a52ea1bad84ea40dbebfd54e672a0f.camel%40j-davis.com
Backpatch: 13-, where disk-based hash aggregation was introduced.
2020-07-27 17:53:19 -07:00
Michael Paquier e971357961 Fix handling of structure for bytea data type in ECPG
Some code paths dedicated to bytea used the structure for varchar.  This
did not lead to any actual bugs, as bytea and varchar have the same
definition, but it could become a trap if one of these definitions
changes for a new feature or a bug fix.

Issue introduced by 050710b.

Author: Shenhao Wang
Reviewed-by: Vignesh C, Michael Paquier
Discussion: https://postgr.es/m/07ac7dee1efc44f99d7f53a074420177@G08CNEXMBPEKD06.g08.fujitsu.local
Backpatch-through: 12
2020-07-27 10:28:06 +09:00
Jeff Davis 200f6100a9 Fix LookupTupleHashEntryHash() pipeline-stall issue.
Refactor hash lookups in nodeAgg.c to improve performance.

Author: Andres Freund and Jeff Davis
Discussion: https://postgr.es/m/20200612213715.op4ye4q7gktqvpuo%40alap3.anarazel.de
Backpatch-through: 13
2020-07-26 15:09:46 -07:00
David Rowley 56788d2156 Allocate consecutive blocks during parallel seqscans
Previously we would allocate blocks to parallel workers during a parallel
sequential scan 1 block at a time.  Since other workers were likely to
request a block before a worker returns for another block number to work
on, this could lead to non-sequential I/O patterns in each worker which
could cause the operating system's readahead to perform poorly or not at
all.

Here we change things so that we allocate consecutive "chunks" of blocks
to workers and have them work on those until they're done, at which time
we allocate another chunk for the worker.  The size of these chunks is
based on the size of the relation.

Initial patch here was by Thomas Munro which showed some good improvements
just having a fixed chunk size of 64 blocks with a simple ramp-down near
the end of the scan. The revisions of the patch to make the chunk size
based on the relation size and the adjusted ramp-down in powers of two was
done by me, along with quite extensive benchmarking to determine the
optimal chunk sizes.

For the most part, benchmarks have shown significant performance
improvements for large parallel sequential scans on Linux, FreeBSD and
Windows using SSDs.  It's less clear how this affects the performance of
cloud providers.  Tests done so far are unable to obtain stable enough
performance to provide meaningful benchmark results.  It is possible that
this could cause some performance regressions on more obscure filesystems,
so we may need to later provide users with some ability to get something
closer to the old behavior.  For now, let's leave that until we see that
it's really required.

Author: Thomas Munro, David Rowley
Reviewed-by: Ranier Vilela, Soumyadeep Chakraborty, Robert Haas
Reviewed-by: Amit Kapila, Kirk Jamison
Discussion: https://postgr.es/m/CA+hUKGJ_EErDv41YycXcbMbCBkztA34+z1ts9VQH+ACRuvpxig@mail.gmail.com
2020-07-26 21:02:45 +12:00
Michael Paquier 11a68e4b53 Tweak behavior of pg_stat_activity.leader_pid
The initial implementation of leader_pid in pg_stat_activity added by
b025f32 took the approach to strictly print what a PGPROC entry
includes.  In short, if a backend has been involved in parallel query at
least once, leader_pid would remain set as long as the backend is alive.
For a parallel group leader, this means that the field would always be
set after it participated at least once in parallel query, and after
more discussions this could be confusing if using for example a
connection pooler.

This commit changes the data printed so as leader_pid becomes always
NULL for a parallel group leader, showing up a non-NULL value only for
the parallel workers, and actually as long as a parallel query is
running as workers are shut down once the query has completed.

This does not change the definition of any catalog, so no catalog bump
is needed.  Per discussion with Justin Pryzby, Álvaro Herrera, Julien
Rouhaud and me.

Discussion: https://postgr.es/m/20200721035145.GB17300@paquier.xyz
Backpatch-through: 13
2020-07-26 16:32:11 +09:00
Noah Misch 15e4419722 Remove optimization for RAND_poll() failing.
The loop to generate seed data will exit on RAND_status(), so we don't
need to handle the case of RAND_poll() failing separately.  Failures
here are rare, so this a code cleanup, essentially.

Daniel Gustafsson, reviewed by David Steele and Michael Paquier.

Discussion: https://postgr.es/m/9B038FA5-23E8-40D0-B932-D515E1D8F66A@yesql.se
2020-07-25 14:50:59 -07:00
Noah Misch ce4939ff70 Use RAND_poll() for seeding randomness after fork().
OpenSSL deprecated RAND_cleanup(), and OpenSSL 1.1.0 made it into a
no-op.  Replace it with RAND_poll(), per an OpenSSL community
recommendation.  While this has no user-visible consequences under
OpenSSL defaults, it might help under non-default settings.

Daniel Gustafsson, reviewed by David Steele and Michael Paquier.

Discussion: https://postgr.es/m/9B038FA5-23E8-40D0-B932-D515E1D8F66A@yesql.se
2020-07-25 14:50:59 -07:00
Tom Lane 0a0727ccfc Improve performance of binary COPY FROM through better buffering.
At least on Linux and macOS, fread() turns out to have far higher
per-call overhead than one could wish.  Reading 64KB of data at a time
and then parceling it out with our own memcpy logic makes binary COPY
from a file significantly faster --- around 30% in simple testing for
cases with narrow text columns (on Linux ... even more on macOS).

In binary COPY from frontend, there's no per-call fread(), and this
patch introduces an extra layer of memcpy'ing, but it still manages
to eke out a small win.  Apparently, the control-logic overhead in
CopyGetData() is enough to be worth avoiding for small fetches.

Bharath Rupireddy and Amit Langote, reviewed by Vignesh C,
cosmetic tweaks by me

Discussion: https://postgr.es/m/CALj2ACU5Bz06HWLwqSzNMN=Gupoj6Rcn_QVC+k070V4em9wu=A@mail.gmail.com
2020-07-25 16:34:35 -04:00
Tom Lane 8a37951eeb Mark built-in coercion functions as leakproof where possible.
Making these leakproof seems helpful since (for example) if you have a
function f(int8) that is leakproof, you don't want it to effectively
become non-leakproof when you apply it to an int4 or int2 column.
But that's what happens today, since the implicit up-coercion will
not be leakproof.

Most of the coercion functions that visibly can't throw errors are
functions that convert numeric datatypes to other, wider ones.
Notable is that float4_numeric and float8_numeric can be marked
leakproof; before commit a57d312a7 they could not have been.
I also marked the functions that coerce strings to "name" as leakproof;
that's okay today because they truncate silently, but if we ever
reconsidered that behavior then they could no longer be leakproof.

I desisted from marking rtrim1() as leakproof; it appears so right now,
but the code seems a little too complex and perhaps subject to change,
since it's shared with other SQL functions.

Discussion: https://postgr.es/m/459322.1595607431@sss.pgh.pa.us
2020-07-25 12:54:58 -04:00
Amit Kapila 2a2494229a Fix buffer usage stats for nodes above Gather Merge.
Commit 85c9d347 addressed a similar problem for Gather and Gather
Merge nodes but forgot to account for nodes above parallel nodes.  This
still works for nodes above Gather node because we shut down the workers
for Gather node as soon as there are no more tuples.  We can do a similar
thing for Gather Merge as well but it seems better to account for stats
during nodes shutdown after completing the execution.

Reported-by: Stéphane Lorek, Jehan-Guillaume de Rorthais
Author: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Reviewed-by: Amit Kapila
Backpatch-through: 10, where it was introduced
Discussion: https://postgr.es/m/20200718160206.584532a2@firost
2020-07-25 10:20:39 +05:30
Tom Lane 79d6d1a277 Replace TS_execute's TS_EXEC_CALC_NOT flag with TS_EXEC_SKIP_NOT.
It's fairly silly that ignoring NOT subexpressions is TS_execute's
default behavior.  It's wrong on its face and it encourages errors
of omission.  Moreover, the only two remaining callers that aren't
specifying CALC_NOT are in ts_headline calculations, and it's very
arguable that those are bugs: if you've specified "!foo" in your
query, why would you want to get a headline that includes "foo"?

Hence, rip that out and change the default behavior to be to calculate
NOT accurately.  As a concession to the slim chance that there is still
somebody somewhere who needs the incorrect behavior, provide a new
SKIP_NOT flag to explicitly request that.

Back-patch into v13, mainly because it seems better to change this
at the same time as the previous commit's rejiggering of TS_execute
related APIs.  Any outside callers affected by this change are
probably also affected by that one.

Discussion: https://postgr.es/m/CALT9ZEE-aLotzBg-pOp2GFTesGWVYzXA3=mZKzRDa_OKnLF7Mg@mail.gmail.com
2020-07-24 15:43:56 -04:00
Tom Lane 2f2007fbb2 Fix assorted bugs by changing TS_execute's callback API to ternary logic.
Text search sometimes failed to find valid matches, for instance
'!crew:A'::tsquery might fail to locate 'crew:1B'::tsvector during
an index search.  The root of the issue is that TS_execute's callback
functions were not changed to use ternary (yes/no/maybe) reporting
when we made the search logic itself do so.  It's somewhat annoying
to break that API, but on the other hand we now see that any code
using plain boolean logic is almost certainly broken since the
addition of phrase search.  There seem to be very few outside callers
of this code anyway, so we'll just break them intentionally to get
them to adapt.

This allows removal of tsginidx.c's private re-implementation of
TS_execute, since that's now entirely duplicative.  It's also no
longer necessary to avoid use of CALC_NOT in tsgistidx.c, since
the underlying callbacks can now do something reasonable.

Back-patch into v13.  We can't change this in stable branches,
but it seems not quite too late to fix it in v13.

Tom Lane and Pavel Borisov

Discussion: https://postgr.es/m/CALT9ZEE-aLotzBg-pOp2GFTesGWVYzXA3=mZKzRDa_OKnLF7Mg@mail.gmail.com
2020-07-24 15:26:51 -04:00
Peter Eisentraut 25244b8972 Rename configure.in to configure.ac
The new name has been preferred by Autoconf for a long time.  Future
versions of Autoconf will warn about the old name.

Discussion: https://www.postgresql.org/message-id/flat/e796c185-5ece-8569-248f-dd3799701be1%402ndquadrant.com
2020-07-24 10:42:08 +02:00
Thomas Munro 42dee8b8e3 Fix error message.
Remove extra space.  Back-patch to all releases, like commit 7897e3bb.

Author: Lu, Chenyang <lucy.fnst@cn.fujitsu.com>
Discussion: https://postgr.es/m/795d03c6129844d3803e7eea48f5af0d%40G08CNEXMBPEKD04.g08.fujitsu.local
2020-07-23 21:10:49 +12:00
Amit Kapila c55040ccd0 WAL Log invalidations at command end with wal_level=logical.
When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.

This patch is required to allow the streaming of in-progress transactions
in logical decoding.  The actual work to allow streaming will be committed
as a separate patch.

We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not need
to be changed.

The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.

The invalidations are decoded and accumulated in top-transaction, and then
executed during replay.  This obviates the need to decode the
invalidations as part of a commit record.

Bump XLOG_PAGE_MAGIC, since this introduces XLOG_XACT_INVALIDATIONS.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
2020-07-23 08:34:48 +05:30
Tom Lane a57d312a77 Support infinity and -infinity in the numeric data type.
Add infinities that behave the same as they do in the floating-point
data types.  Aside from any intrinsic usefulness these may have,
this closes an important gap in our ability to convert floating
values to numeric and/or replace float-based APIs with numeric.

The new values are represented by bit patterns that were formerly
not used (although old code probably would take them for NaNs).
So there shouldn't be any pg_upgrade hazard.

Patch by me, reviewed by Dean Rasheed and Andrew Gierth

Discussion: https://postgr.es/m/606717.1591924582@sss.pgh.pa.us
2020-07-22 19:19:44 -04:00
Thomas Munro a5073871ea Fix conversion table generator scripts.
convutils.pm used implicit conversion of undefined value to integer
zero.  Some of conversion scripts are susceptible to regexp greediness.
Fix, avoiding whitespace changes in the output.  Also update ICU URLs
that moved.

No need to back-patch, because the output of these scripts is also in
the source tree so we shouldn't need to rerun them on back-branches.

Author: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ7SEGLbj%3D%3DTQCcyKRA9aqj8%2B6L%3DexSq1y25TA%3DWxLziQ%40mail.gmail.com
2020-07-22 16:50:03 +12:00
Michael Paquier e47c2602aa Fix comment in sha2.h
An incorrect reference to SHA-1 was present.

Author: Daniel Gustafsson
Discussion: https://postgr.es/m/FE26C953-FA87-4BB9-9105-AA1F8705B0D0@yesql.se
2020-07-22 10:16:21 +09:00
Tom Lane bd0d893aa7 neqjoinsel must now pass through collation to eqjoinsel.
Since commit 044c99bc5, eqjoinsel passes the passed-in collation
to any operators it invokes.  However, neqjoinsel failed to pass
on whatever collation it got, so that if we invoked a
collation-dependent operator via that code path, we'd get "could not
determine which collation to use for string comparison" or the like.

Per report from Justin Pryzby.  Back-patch to v12, like the previous
commit.

Discussion: https://postgr.es/m/20200721191606.GL5748@telsasoft.com
2020-07-21 19:41:03 -04:00
Peter Geoghegan 4a70f829d8 Add nbtree Valgrind buffer lock checks.
Holding just a buffer pin (with no buffer lock) on an nbtree buffer/page
provides very weak guarantees, especially compared to heapam, where it's
often safe to read a page while only holding a buffer pin.  This commit
has Valgrind enforce the following rule: it is never okay to access an
nbtree buffer without holding both a pin and a lock on the buffer.

A draft version of this patch detected questionable code that was
cleaned up by commits fa7ff642 and 7154aa16.  The code in question used
to access an nbtree buffer page's special/opaque area with no buffer
lock (only a buffer pin).  This practice (which isn't obviously unsafe)
is hereby formally disallowed in nbtree.  There doesn't seem to be any
reason to allow it, and banning it keeps things simple for Valgrind.

The new checks are implemented by adding custom nbtree client requests
(located in LockBuffer() wrapper functions); these requests are
"superimposed" on top of the generic bufmgr.c Valgrind client requests
added by commit 1e0dfd16.  No custom resource management cleanup code is
needed to undo the effects of marking buffers as non-accessible under
this scheme.

Author: Peter Geoghegan
Reviewed-By: Anastasia Lubennikova, Georgios Kokolatos
Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
2020-07-21 15:50:58 -07:00
Tom Lane 670c0a1d47 Weaken type-OID-matching checks in array_recv and record_recv.
Rather than always insisting on an exact match of the type OID in the
data to the element type or column type we expect, complain only when
both OIDs fall within the manually-assigned range.  This acknowledges
the reality that user-defined types don't have stable OIDs, while
still preserving some of the mistake-detection value of the old test.

(It's not entirely clear whether to error if one OID is manually
assigned and the other isn't.  But perhaps that case could arise in
cross-version cases where a former extension type has been imported
into core, so I let it pass.)

This change allows us to remove the prohibition on binary transfer
of user-defined arrays and composites in the recently-landed support
for binary logical replication (commit 9de77b545).  We can just
unconditionally drop that check, since if the client has asked for
binary transfer it must be >= v14 and must have this change.

Discussion: https://postgr.es/m/CADK3HH+R3xMn=8t3Ct+uD+qJ1KD=Hbif5NFMJ+d5DkoCzp6Vgw@mail.gmail.com
2020-07-21 15:19:46 -04:00
Tom Lane fc032bed2f Be more careful about marking catalog columns NOT NULL by default.
The bug fixed in commit 72eab84a5 would not have occurred if initdb
had a less surprising rule about which columns should be marked
NOT NULL by default.  Let's make that rule be strictly that the
column must be fixed-width and its predecessors must be fixed-width
and NOT NULL, removing the hacky and unsafe exceptions for oidvector
and int2vector.

Since we do still want all existing oidvector and int2vector columns
to be marked NOT NULL, we have to put BKI_FORCE_NOT_NULL labels on
them.  But making this less magic and more documented seems like a
good idea, even if it's a shade more verbose.

I didn't bump catversion since the initial catalog contents are
not actually changed by this patch.  Note however that the
contents of postgres.bki do change, and feeding an old copy of
that to a new backend will produce wrong results.

Discussion: https://postgr.es/m/204760.1595181800@sss.pgh.pa.us
2020-07-21 13:03:48 -04:00
Tom Lane 3e66019f15 Assert that we don't insert nulls into attnotnull catalog columns.
The executor checks for this error, and so does the bootstrap catalog
loader, but we never checked for it in retail catalog manipulations.
The folly of that has now been exposed, so let's add assertions
checking it.  Checking in CatalogTupleInsert[WithInfo] and
CatalogTupleUpdate[WithInfo] should be enough to cover this.

Back-patch to v10; the aforesaid functions didn't exist before that,
and it didn't seem worth adapting the patch to the oldest branches.
But given the risk of JIT crashes, I think we certainly need this
as far back as v11.

Pre-v13, we have to explicitly exclude pg_subscription.subslotname
and pg_subscription_rel.srsublsn from the checks, since they are
mismarked.  (Even if we change our mind about applying BKI_FORCE_NULL
in the branch tips, it doesn't seem wise to have assertions that
would fire in existing databases.)

Discussion: https://postgr.es/m/298837.1595196283@sss.pgh.pa.us
2020-07-21 12:38:08 -04:00