Commit Graph

8829 Commits

Author SHA1 Message Date
Thomas Munro d614aae02e Improve comment in sync.h.
Per off-list complaint from Andres Freund.
2019-04-10 12:49:49 +12:00
Noah Misch 617dc6d299 Avoid "could not reattach" by providing space for concurrent allocation.
We've long had reports of intermittent "could not reattach to shared
memory" errors on Windows.  Buildfarm member dory fails that way when
PGSharedMemoryReAttach() execution overlaps with creation of a thread
for the process's "default thread pool".  Fix that by providing a second
region to receive asynchronous allocations that would otherwise intrude
into UsedShmemSegAddr.  In pgwin32_ReserveSharedMemoryRegion(), stop
trying to free reservations landing at incorrect addresses; the caller's
next step has been to terminate the affected process.  Back-patch to 9.4
(all supported versions).

Reviewed by Tom Lane.  He also did much of the prerequisite research;
see commit bcbf2346d6.

Discussion: https://postgr.es/m/20190402135442.GA1173872@rfd.leadboat.com
2019-04-08 21:39:00 -07:00
Andres Freund 6421011ea2 tableam: comment and formatting fixes.
Author: Heikki Linnakangas
Discussion: https://postgr.es/m/9a7fb9cc-2419-5db7-8840-ddc10c93f122@iki.fi
2019-04-08 16:24:36 -07:00
Fujii Masao 119dcfad98 Add vacuum_truncate reloption.
vacuum_truncate controls whether vacuum tries to truncate off
any empty pages at the end of the table. Previously vacuum always
tried to do the truncation. However, the truncation could cause
some problems; for example, ACCESS EXCLUSIVE lock needs to
be taken on the table during the truncation and can cause
the query cancellation on the standby even if hot_standby_feedback
is true. Setting this reloption to false can be helpful to avoid
such problems.

Author: Tsunakawa Takayuki
Reviewed-By: Julien Rouhaud, Masahiko Sawada, Michael Paquier, Kirk Jamison and Fujii Masao
Discussion: https://postgr.es/m/CAHGQGwE5UqFqSq1=kV3QtTUtXphTdyHA-8rAj4A=Y+e4kyp3BQ@mail.gmail.com
2019-04-08 16:43:57 +09:00
Andres Freund 41f5e04aec Fix a number of issues around modifying a previously updated row.
This commit fixes three, unfortunately related, issues:

1) Since 5db6df0c01, the introduction of DML via tableam, it was
   possible to trigger "ERROR: unexpected table_lock_tuple status: 1"
   when updating a row that was previously updated in the same
   transaction - but only when the previously updated row was before
   updated in a concurrent transaction (and READ COMMITTED was
   used). The reason for that was that that case simply wasn't
   expected. Fixing that lead to:

2) Even before the above commit, there were error checks (introduced
   in 6868ed7491) preventing a row being updated by different
   commands within the same statement (say in a function called by an
   UPDATE) - but that check wasn't performed when the row was first
   updated in a concurrent transaction - instead the second update was
   silently skipped in that case. After this change we throw the same
   error as we'd without the concurrent transaction.

3) The error messages (introduced in 6868ed7491) preventing such
   updates emitted the same error message for both DELETE and
   UPDATE ("tuple to be updated was already modified by an operation
   triggered by the current command"). While that could be changed
   separately, it made it hard to write tests that verify the correct
   correct behavior of the code.

This commit changes heap's implementation of table_lock_tuple() to
return TM_SelfModified instead of TM_Invisible (previously loosely
modeled after EvalPlanQualFetch), and teaches nodeModifyTable.c to
handle that in response to table_lock_tuple() and not just in response
to table_(delete|update).

Additionally it fixes the wrong error message (see 3 above). The
comment for table_lock_tuple() is also adjusted to state that
TM_Deleted won't return information in TM_FailureData - it'll not
always be available.

This also adds tests to ensure that DELETE/UPDATE correctly error out
when affecting a row that concurrently was modified by another
transaction.

Author: Andres Freund
Reported-By: Tom Lane, when investigating a bug bug fix to another bug
    by Amit Langote
Discussion: https://postgr.es/m/19321.1554567786@sss.pgh.pa.us
2019-04-07 22:14:47 -07:00
Peter Eisentraut 03f9e5cba0 Report progress of REINDEX operations
This uses the same infrastructure that the CREATE INDEX progress
reporting uses.  Add a column to pg_stat_progress_create_index to
report the OID of the index being worked on.  This was not necessary
for CREATE INDEX, but it's useful for REINDEX.

Also edit the phase descriptions a bit to be more consistent with the
source code comments.

Discussion: https://www.postgresql.org/message-id/ef6a6757-c36a-9e81-123f-13b19e36b7d7%402ndquadrant.com
2019-04-07 12:35:29 +02:00
Peter Eisentraut 106f2eb664 Cast pg_stat_progress_cluster.cluster_index_relid to oid
It's tracked internally as bigint, but when presented to the user it
should be oid.
2019-04-07 10:31:32 +02:00
Michael Paquier 249d649996 Add support TCP user timeout in libpq and the backend server
Similarly to the set of parameters for keepalive, a connection parameter
for libpq is added as well as a backend GUC, called tcp_user_timeout.

Increasing the TCP user timeout is useful to allow a connection to
survive extended periods without end-to-end connection, and decreasing
it allows application to fail faster.  By default, the parameter is 0,
which makes the connection use the system default, and follows a logic
close to the keepalive parameters in its handling.  When connecting
through a Unix-socket domain, the parameters have no effect.

Author: Ryohei Nagaura
Reviewed-by: Fabien Coelho, Robert Haas, Kyotaro Horiguchi, Kirk
Jamison, Mikalai Keida, Takayuki Tsunakawa, Andrei Yahorau
Discussion: https://postgr.es/m/EDA4195584F5064680D8130B1CA91C45367328@G01JPEXMBYT04
2019-04-06 15:23:37 +09:00
Tom Lane 959d00e9db Use Append rather than MergeAppend for scanning ordered partitions.
If we need ordered output from a scan of a partitioned table, but
the ordering matches the partition ordering, then we don't need to
use a MergeAppend to combine the pre-ordered per-partition scan
results: a plain Append will produce the same results.  This
both saves useless comparison work inside the MergeAppend proper,
and allows us to start returning tuples after istarting up just
the first child node not all of them.

However, all is not peaches and cream, because if some of the
child nodes have high startup costs then there will be big
discontinuities in the tuples-returned-versus-elapsed-time curve.
The planner's cost model cannot handle that (yet, anyway).
If we model the Append's startup cost as being just the first
child's startup cost, we may drastically underestimate the cost
of fetching slightly more tuples than are available from the first
child.  Since we've had bad experiences with over-optimistic choices
of "fast start" plans for ORDER BY LIMIT queries, that seems scary.
As a klugy workaround, set the startup cost estimate for an ordered
Append to be the sum of its children's startup costs (as MergeAppend
would).  This doesn't really describe reality, but it's less likely
to cause a bad plan choice than an underestimated startup cost would.
In practice, the cases where we really care about this optimization
will have child plans that are IndexScans with zero startup cost,
so that the overly conservative estimate is still just zero.

David Rowley, reviewed by Julien Rouhaud and Antonin Houska

Discussion: https://postgr.es/m/CAKJS1f-hAqhPLRk_RaSFTgYxd=Tz5hA7kQ2h4-DhJufQk8TGuw@mail.gmail.com
2019-04-05 19:20:43 -04:00
Alvaro Herrera 9f06d79ef8 Add facility to copy replication slots
This allows the user to create duplicates of existing replication slots,
either logical or physical, and even changing properties such as whether
they are temporary or the output plugin used.

There are multiple uses for this, such as initializing multiple replicas
using the slot for one base backup; when doing investigation of logical
replication issues; and to select a different output plugins.

Author: Masahiko Sawada
Reviewed-by: Michael Paquier, Andres Freund, Petr Jelinek
Discussion: https://postgr.es/m/CAD21AoAm7XX8y_tOPP6j4Nzzch12FvA1wPqiO690RCk+uYVstg@mail.gmail.com
2019-04-05 18:05:18 -03:00
Noah Misch 82150a05be Revert "Consistently test for in-use shared memory."
This reverts commits 2f932f71d9,
16ee6eaf80 and
6f0e190056.  The buildfarm has revealed
several bugs.  Back-patch like the original commits.

Discussion: https://postgr.es/m/20190404145319.GA1720877@rfd.leadboat.com
2019-04-05 00:00:52 -07:00
Andres Freund ea97e440b8 Harden tableam against nonexistant / wrong kind of AMs.
Previously it was allowed to set default_table_access_method to an
empty string. That makes sense for default_tablespace, where that was
copied from, as it signals falling back to the database's default
tablespace. As there is no equivalent for table AMs, forbid that.

Also make sure to throw a usable error when creating a table using an
index AM, by using get_am_type_oid() to implement get_table_am_oid()
instead of a separate copy. Previously we'd error out only later, in
GetTableAmRoutine().

Thirdly remove GetTableAmRoutineByAmId() - it was only used in an
earlier version of 8586bf7ed8.

Add tests for the above (some for index AMs as well).
2019-04-04 17:39:39 -07:00
Andres Freund 86b85044e8 tableam: Add table_multi_insert() and revamp/speed-up COPY FROM buffering.
This adds table_multi_insert(), and converts COPY FROM, the only user
of heap_multi_insert, to it.

A simple conversion of COPY FROM use slots would have yielded a
slowdown when inserting into a partitioned table for some
workloads. Different partitions might need different slots (both slot
types and their descriptors), and dropping / creating slots when
there's constant partition changes is measurable.

Thus instead revamp the COPY FROM buffering for partitioned tables to
allow to buffer inserts into multiple tables, flushing only when
limits are reached across all partition buffers. By only dropping
slots when there've been inserts into too many different partitions,
the aforementioned overhead is gone. By allowing larger batches, even
when there are frequent partition changes, we actuall speed such cases
up significantly.

By using slots COPY of very narrow rows into unlogged / temporary
might slow down very slightly (due to the indirect function calls).

Author: David Rowley, Andres Freund, Haribabu Kommi
Discussion:
    https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
    https://postgr.es/m/20190327054923.t3epfuewxfqdt22e@alap3.anarazel.de
2019-04-04 16:28:18 -07:00
Robert Haas a96c41feec Allow VACUUM to be run with index cleanup disabled.
This commit adds a new reloption, vacuum_index_cleanup, which
controls whether index cleanup is performed for a particular
relation by default.  It also adds a new option to the VACUUM
command, INDEX_CLEANUP, which can be used to override the
reloption.  If neither the reloption nor the VACUUM option is
used, the default is true, as before.

Masahiko Sawada, reviewed and tested by Nathan Bossart, Alvaro
Herrera, Kyotaro Horiguchi, Darafei Praliaskouski, and me.
The wording of the documentation is mostly due to me.

Discussion: http://postgr.es/m/CAD21AoAt5R3DNUZSjOoXDUY=naYPUOuffVsRzuTYMz29yLzQCA@mail.gmail.com
2019-04-04 15:04:43 -04:00
Stephen Frost 87e16db5eb Move the be_gssapi_get_* prototypes
The be_gssapi_get_* prototypes were put close to similar ones for SSL-
but a bit too close since that meant they ended up only being included
for SSL-enabled builds.  Move those to be under ENABLE_GSS instead.

Pointed out by Tom.
2019-04-04 11:11:46 -04:00
Thomas Munro 3eb77eba5a Refactor the fsync queue for wider use.
Previously, md.c and checkpointer.c were tightly integrated so that
fsync calls could be handed off and processed in the background.
Introduce a system of callbacks and file tags, so that other modules
can hand off fsync work in the same way.

For now only md.c uses the new interface, but other users are being
proposed.  Since there may be use cases that are not strictly SMGR
implementations, use a new function table for sync handlers rather
than extending the traditional SMGR one.

Instead of using a bitmapset of segment numbers for each RelFileNode
in the checkpointer's hash table, make the segment number part of the
key.  This requires sending explicit "forget" requests for every
segment individually when relations are dropped, but suits the file
layout schemes of proposed future users better (ie sparse or high
segment numbers).

Author: Shawn Debnath and Thomas Munro
Reviewed-by: Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
2019-04-04 23:38:38 +13:00
Andres Freund b73c3a1196 tableam: basic documentation.
This adds documentation about the user oriented parts of table access
methods (i.e. the default_table_access_method GUC and the USING clause
for CREATE TABLE etc), adds a basic chapter about the table access
method interface, and adds a note to storage.sgml that it's contents
don't necessarily apply for non-builtin AMs.

Author: Haribabu Kommi and Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-04-03 17:40:29 -07:00
Noah Misch 2f932f71d9 Consistently test for in-use shared memory.
postmaster startup scrutinizes any shared memory segment recorded in
postmaster.pid, exiting if that segment matches the current data
directory and has an attached process.  When the postmaster.pid file was
missing, a starting postmaster used weaker checks.  Change to use the
same checks in both scenarios.  This increases the chance of a startup
failure, in lieu of data corruption, if the DBA does "kill -9 `head -n1
postmaster.pid` && rm postmaster.pid && pg_ctl -w start".  A postmaster
will no longer recycle segments pertaining to other data directories.
That's good for production, but it's bad for integration tests that
crash a postmaster and immediately delete its data directory.  Such a
test now leaks a segment indefinitely.  No "make check-world" test does
that.  win32_shmem.c already avoided all these problems.  In 9.6 and
later, enhance PostgresNode to facilitate testing.  Back-patch to 9.4
(all supported versions).

Reviewed by Daniel Gustafsson and Kyotaro HORIGUCHI.

Discussion: https://postgr.es/m/20130911033341.GD225735@tornado.leadboat.com
2019-04-03 17:03:46 -07:00
Tomas Vondra ea569d64ac Add SETTINGS option to EXPLAIN, to print modified settings.
Query planning is affected by a number of configuration options, and it
may be crucial to know which of those options were set to non-default
values.  With this patch you can say EXPLAIN (SETTINGS ON) to include
that information in the query plan.  Only options affecting planning,
with values different from the built-in default are printed.

This patch also adds auto_explain.log_settings option, providing the
same capability in auto_explain module.

Author: Tomas Vondra
Reviewed-by: Rafia Sabih, John Naylor
Discussion: https://postgr.es/m/e1791b4c-df9c-be02-edc5-7c8874944be0@2ndquadrant.com
2019-04-04 00:04:31 +02:00
Alvaro Herrera 799e220346 Log all statements from a sample of transactions
This is useful to obtain a view of the different transaction types in an
application, regardless of the durations of the statements each runs.

Author: Adrien Nayrat
Reviewed-by: Masahiko Sawada, Hayato Kuroda, Andres Freund
2019-04-03 18:43:59 -03:00
Tomas Vondra c50b3158bf Reduce overhead of pg_mcv_list (de)serialization
Commit ea4e1c0e8f resolved issues with memory alignment in serialized
pg_mcv_list values, but it required copying data to/from the varlena
buffer during serialization and deserialization.  As the MCV lits may
be fairly large, the overhead (memory consumption, CPU usage) can get
rather significant too.

This change tweaks the serialization format so that the alignment is
correct with respect to the varlena value, and so the parts may be
accessed directly without copying the data.

Catversion bump, as it affects existing pg_statistic_ext data.
2019-04-03 21:23:40 +02:00
Stephen Frost b0b39f72b9 GSSAPI encryption support
On both the frontend and backend, prepare for GSSAPI encryption
support by moving common code for error handling into a separate file.
Fix a TODO for handling multiple status messages in the process.
Eliminate the OIDs, which have not been needed for some time.

Add frontend and backend encryption support functions.  Keep the
context initiation for authentication-only separate on both the
frontend and backend in order to avoid concerns about changing the
requested flags to include encryption support.

In postmaster, pull GSSAPI authorization checking into a shared
function.  Also share the initiator name between the encryption and
non-encryption codepaths.

For HBA, add "hostgssenc" and "hostnogssenc" entries that behave
similarly to their SSL counterparts.  "hostgssenc" requires either
"gss", "trust", or "reject" for its authentication.

Similarly, add a "gssencmode" parameter to libpq.  Supported values are
"disable", "require", and "prefer".  Notably, negotiation will only be
attempted if credentials can be acquired.  Move credential acquisition
into its own function to support this behavior.

Add a simple pg_stat_gssapi view similar to pg_stat_ssl, for monitoring
if GSSAPI authentication was used, what principal was used, and if
encryption is being used on the connection.

Finally, add documentation for everything new, and update existing
documentation on connection security.

Thanks to Michael Paquier for the Windows fixes.

Author: Robbie Harwood, with changes to the read/write functions by me.
Reviewed in various forms and at different times by: Michael Paquier,
   Andres Freund, David Steele.
Discussion: https://www.postgresql.org/message-id/flat/jlg1tgq1ktm.fsf@thriss.redhat.com
2019-04-03 15:02:33 -04:00
Alvaro Herrera f56f8f8da6 Support foreign keys that reference partitioned tables
Previously, while primary keys could be made on partitioned tables, it
was not possible to define foreign keys that reference those primary
keys.  Now it is possible to do that.

Author: Álvaro Herrera
Reviewed-by: Amit Langote, Jesper Pedersen
Discussion: https://postgr.es/m/20181102234158.735b3fevta63msbj@alvherre.pgsql
2019-04-03 14:40:21 -03:00
Heikki Linnakangas 9155580fd5 Generate less WAL during GiST, GIN and SP-GiST index build.
Instead of WAL-logging every modification during the build separately,
first build the index without any WAL-logging, and make a separate pass
through the index at the end, to write all pages to the WAL. This
significantly reduces the amount of WAL generated, and is usually also
faster, despite the extra I/O needed for the extra scan through the index.
WAL generated this way is also faster to replay.

For GiST, the LSN-NSN interlock makes this a little tricky. All pages must
be marked with a valid (i.e. non-zero) LSN, so that the parent-child
LSN-NSN interlock works correctly. We now use magic value 1 for that during
index build. Change the fake LSN counter to begin from 1000, so that 1 is
safely smaller than any real or fake LSN. 2 would've been enough for our
purposes, but let's reserve a bigger range, in case we need more special
values in the future.

Author: Anastasia Lubennikova, Andrey V. Lepikhov
Reviewed-by: Heikki Linnakangas, Dmitry Dolgov
2019-04-03 17:03:15 +03:00
Alvaro Herrera ab0dfc961b Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.

There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific.  The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.

The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan.  (The index validation index scan requires
patching each AM, which has not been included here.)

Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 15:18:08 -03:00
Stephen Frost 4d0e994eed Add support for partial TOAST decompression
When asked for a slice of a TOAST entry, decompress enough to return the
slice instead of decompressing the entire object.

For use cases where the slice is at, or near, the beginning of the entry,
this avoids a lot of unnecessary decompression work.

This changes the signature of pglz_decompress() by adding a boolean to
indicate if it's ok for the call to finish before consuming all of the
source or destination buffers.

Author: Paul Ramsey
Reviewed-By: Rafia Sabih, Darafei Praliaskouski, Regina Obe
Discussion: https://postgr.es/m/CACowWR07EDm7Y4m2kbhN_jnys%3DBBf9A6768RyQdKm_%3DNpkcaWg%40mail.gmail.com
2019-04-02 12:35:32 -04:00
Etsuro Fujita d50d172e51 postgres_fdw: Perform the (FINAL, NULL) upperrel operations remotely.
The upper-planner pathification allows FDWs to arrange to push down
different types of upper-stage operations to the remote side.  This
commit teaches postgres_fdw to do it for the (FINAL, NULL) upperrel,
which is responsible for doing LockRows, LIMIT, and/or ModifyTable.
This provides the ability for postgres_fdw to handle SELECT commands
so that it 1) skips the LockRows step (if any) (note that this is
safe since it performs early locking) and 2) pushes down the LIMIT
and/or OFFSET restrictions (if any) to the remote side.  This doesn't
handle the INSERT/UPDATE/DELETE cases.

Author: Etsuro Fujita
Reviewed-By: Antonin Houska and Jeff Janes
Discussion: https://postgr.es/m/87pnz1aby9.fsf@news-spur.riddles.org.uk
2019-04-02 20:30:45 +09:00
Etsuro Fujita aef65db676 Refactor create_limit_path() to share cost adjustment code with FDWs.
This is in preparation for an upcoming commit.

Author: Etsuro Fujita
Reviewed-By: Antonin Houska and Jeff Janes
Discussion: https://postgr.es/m/87pnz1aby9.fsf@news-spur.riddles.org.uk
2019-04-02 19:55:12 +09:00
Thomas Munro 475861b261 Add wal_recycle and wal_init_zero GUCs.
On at least ZFS, it can be beneficial to create new WAL files every
time and not to bother zero-filling them.  Since it's not clear which
other filesystems might benefit from one or both of those things,
add individual GUCs to control those two behaviors independently and
make only very general statements in the docs.

Author: Jerry Jelinek, with some adjustments by Thomas Munro
Reviewed-by: Alvaro Herrera, Andres Freund, Tomas Vondra, Robert Haas and others
Discussion: https://postgr.es/m/CACPQ5Fo00QR7LNAcd1ZjgoBi4y97%2BK760YABs0vQHH5dLdkkMA%40mail.gmail.com
2019-04-02 14:37:14 +13:00
Andres Freund d45e401586 tableam: Add table_finish_bulk_insert().
This replaces the previous calls of heap_sync() in places using
bulk-insert. By passing in the flags used for bulk-insert the AM can
decide (first at insert time and then during the finish call) which of
the optimizations apply to it, and what operations are necessary to
finish a bulk insert operation.

Also change HEAP_INSERT_* flags to TABLE_INSERT, and rename hi_options
to ti_options.

These changes are made even in copy.c, which hasn't yet been converted
to tableam. There's no harm in doing so.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-04-01 14:41:42 -07:00
Peter Eisentraut cc8d415117 Unified logging system for command-line programs
This unifies the various ad hoc logging (message printing, error
printing) systems used throughout the command-line programs.

Features:

- Program name is automatically prefixed.

- Message string does not end with newline.  This removes a common
  source of inconsistencies and omissions.

- Additionally, a final newline is automatically stripped, simplifying
  use of PQerrorMessage() etc., another common source of mistakes.

- I converted error message strings to use %m where possible.

- As a result of the above several points, more translatable message
  strings can be shared between different components and between
  frontends and backend, without gratuitous punctuation or whitespace
  differences.

- There is support for setting a "log level".  This is not meant to be
  user-facing, but can be used internally to implement debug or
  verbose modes.

- Lazy argument evaluation, so no significant overhead if logging at
  some level is disabled.

- Some color in the messages, similar to gcc and clang.  Set
  PG_COLOR=auto to try it out.  Some colors are predefined, but can be
  customized by setting PG_COLORS.

- Common files (common/, fe_utils/, etc.) can handle logging much more
  simply by just using one API without worrying too much about the
  context of the calling program, requiring callbacks, or having to
  pass "progname" around everywhere.

- Some programs called setvbuf() to make sure that stderr is
  unbuffered, even on Windows.  But not all programs did that.  This
  is now done centrally.

Soft goals:

- Reduces vertical space use and visual complexity of error reporting
  in the source code.

- Encourages more deliberate classification of messages.  For example,
  in some cases it wasn't clear without analyzing the surrounding code
  whether a message was meant as an error or just an info.

- Concepts and terms are vaguely aligned with popular logging
  frameworks such as log4j and Python logging.

This is all just about printing stuff out.  Nothing affects program
flow (e.g., fatal exits).  The uses are just too varied to do that.
Some existing code had wrappers that do some kind of print-and-exit,
and I adapted those.

I tried to keep the output mostly the same, but there is a lot of
historical baggage to unwind and special cases to consider, and I
might not always have succeeded.  One significant change is that
pg_rewind used to write all error messages to stdout.  That is now
changed to stderr.

Reviewed-by: Donald Dong <xdong@csumb.edu>
Reviewed-by: Arthur Zakirov <a.zakirov@postgrespro.ru>
Discussion: https://www.postgresql.org/message-id/flat/6a609b43-4f57-7348-6480-bd022f924310@2ndquadrant.com
2019-04-01 20:01:35 +02:00
Alexander Korotkov 0a02e2ae02 GIN support for @@ and @? jsonpath operators
This commit makes existing GIN operator classes jsonb_ops and json_path_ops
support "jsonb @@ jsonpath" and "jsonb @? jsonpath" operators.  Basic idea is
to extract statements of following form out of jsonpath.

 key1.key2. ... .keyN = const

The rest of jsonpath is rechecked from heap.

Catversion is bumped.

Discussion: https://postgr.es/m/fcc6fc6a-b497-f39a-923d-aa34d0c588e8%402ndQuadrant.com
Author: Nikita Glukhov, Alexander Korotkov
Reviewed-by: Jonathan Katz, Pavel Stehule
2019-04-01 18:08:52 +03:00
Andres Freund bfbcad478f tableam: bitmap table scan.
This moves bitmap heap scan support to below an optional tableam
callback. It's optional as the whole concept of bitmap heapscans is
fairly block specific.

This basically moves the work previously done in bitgetpage() into the
new scan_bitmap_next_block callback, and the direct poking into the
buffer done in BitmapHeapNext() into the new scan_bitmap_next_tuple()
callback.

The abstraction is currently somewhat leaky because
nodeBitmapHeapscan.c's prefetching and visibilitymap based logic
remains - it's likely that we'll later have to move more into the
AM. But it's not trivial to do so without introducing a significant
amount of code duplication between the AMs, so that's a project for
later.

Note that now nodeBitmapHeapscan.c and the associated node types are a
bit misnamed. But it's not clear whether renaming wouldn't be a cure
worse than the disease. Either way, that'd be best done in a separate
commit.

Author: Andres Freund
Reviewed-By: Robert Haas (in an older version)
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-31 18:37:57 -07:00
Andres Freund 73c954d248 tableam: sample scan.
This moves sample scan support to below tableam. It's not optional as
there is, in contrast to e.g. bitmap heap scans, no alternative way to
perform tablesample queries. If an AM can't deal with the block based
API, it will have to throw an ERROR.

The tableam callbacks for this are block based, but given the current
TsmRoutine interface, that seems to be required.

The new interface doesn't require TsmRoutines to perform visibility
checks anymore - that requires the TsmRoutine to know details about
the AM, which we want to avoid.  To continue to allow taking the
returned number of tuples account SampleScanState now has a donetuples
field (which previously e.g. existed in SystemRowsSamplerData), which
is only incremented after the visibility check succeeds.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-31 18:37:57 -07:00
Andres Freund 4bb50236eb tableam: Formatting and other minor cleanups.
The superflous heapam_xlog.h includes were reported by Peter
Geoghegan.
2019-03-31 18:16:53 -07:00
Andres Freund 696d78469f tableam: Move heap specific logic from estimate_rel_size below tableam.
This just moves the table/matview[/toast] determination of relation
size to a callback, and uses a copy of the existing logic to implement
that callback for heap.

It probably would make sense to also move the index specific logic
into a callback, so the metapage handling (and probably more) can be
index specific. But that's a separate task.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-30 19:26:36 -07:00
Andres Freund 737a292b5d tableam: VACUUM and ANALYZE support.
This is a relatively straightforward move of the current
implementation to sit below tableam. As the current analyze sampling
implementation is pretty inherently block based, the tableam analyze
interface is as well. It might make sense to generalize that at some
point, but that seems like a larger project that shouldn't be
undertaken at the same time as the introduction of tableam.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-30 19:25:58 -07:00
Tom Lane 428b260f87 Speed up planning when partitions can be pruned at plan time.
Previously, the planner created RangeTblEntry and RelOptInfo structs
for every partition of a partitioned table, even though many of them
might later be deemed uninteresting thanks to partition pruning logic.
This incurred significant overhead when there are many partitions.
Arrange to postpone creation of these data structures until after
we've processed the query enough to identify restriction quals for
the partitioned table, and then apply partition pruning before not
after creation of each partition's data structures.  In this way
we need not open the partition relations at all for partitions that
the planner has no real interest in.

For queries that can be proven at plan time to access only a small
number of partitions, this patch improves the practical maximum
number of partitions from under 100 to perhaps a few thousand.

Amit Langote, reviewed at various times by Dilip Kumar, Jesper Pedersen,
Yoshikazu Imai, and David Rowley

Discussion: https://postgr.es/m/9d7c5112-cb99-6a47-d3be-cf1ee6862a1d@lab.ntt.co.jp
2019-03-30 18:58:55 -04:00
Peter Eisentraut fc22b6623b Generated columns
This is an SQL-standard feature that allows creating columns that are
computed from expressions rather than assigned, similar to a view or
materialized view but on a column basis.

This implements one kind of generated column: stored (computed on
write).  Another kind, virtual (computed on read), is planned for the
future, and some room is left for it.

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/b151f851-4019-bdb1-699e-ebab07d2f40a@2ndquadrant.com
2019-03-30 08:15:57 +01:00
Peter Geoghegan 9c7fb7e6d8 Tweak some nbtree-related code comments. 2019-03-29 12:29:05 -07:00
Tomas Vondra d85e0f366a Fix memory alignment in pg_mcv_list serialization
Blind attempt at fixing ia64, hppa an sparc builds.

The serialized representation of MCV lists did not enforce proper memory
alignment for internal fields, resulting in deserialization issues on
platforms that are more sensitive to this (ia64, sparc and hppa).

This forces a catalog version bump, because the layout of serialized
pg_mcv_list changes.

Broken since 7300a699.
2019-03-29 19:06:38 +01:00
Andres Freund ffa8444ce4 tableam: Comment fixes.
Author: Haribabu Kommi
Discussion: CAJrrPGeeYOqP3hkZyohDx_8dot4zvPuPMDBmhJ=iC85cTBNeYw@mail.gmail.com
2019-03-29 08:17:26 -07:00
Peter Eisentraut 5dc92b844e REINDEX CONCURRENTLY
This adds the CONCURRENTLY option to the REINDEX command.  A REINDEX
CONCURRENTLY on a specific index creates a new index (like CREATE
INDEX CONCURRENTLY), then renames the old index away and the new index
in place and adjusts the dependencies, and then drops the old
index (like DROP INDEX CONCURRENTLY).  The REINDEX command also has
the capability to run its other variants (TABLE, DATABASE) with the
CONCURRENTLY option (but not SYSTEM).

The reindexdb command gets the --concurrently option.

Author: Michael Paquier, Andreas Karlsson, Peter Eisentraut
Reviewed-by: Andres Freund, Fujii Masao, Jim Nasby, Sergei Kornilov
Discussion: https://www.postgresql.org/message-id/flat/60052986-956b-4478-45ed-8bd119e9b9cf%402ndquadrant.com#74948a1044c56c5e817a5050f554ddee
2019-03-29 08:26:33 +01:00
Andres Freund d25f519107 tableam: relation creation, VACUUM FULL/CLUSTER, SET TABLESPACE.
This moves the responsibility for:
- creating the storage necessary for a relation, including creating a
  new relfilenode for a relation with existing storage
- non-transactional truncation of a relation
- VACUUM FULL / CLUSTER's rewrite of a table
below tableam.

This is fairly straight forward, with a bit of complexity smattered in
to move the computation of xid / multixid horizons below the AM, as
they don't make sense for every table AM.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-28 20:01:43 -07:00
Andres Freund 46bcd2af18 Fix a few comment copy & pastos. 2019-03-28 13:42:37 -07:00
Tomas Vondra 62bf0fb35c Fix deserialization of pg_mcv_list values
There were multiple issues in deserialization of pg_mcv_list values.

Firstly, the data is loaded from syscache, but the deserialization was
performed after ReleaseSysCache(), at which point the data might have
already disappeared.  Fixed by moving the calls in statext_mcv_load,
and using the same NULL-handling code as existing stats.

Secondly, the deserialized representation used pointers into the
serialized representation.  But that is also unsafe, because the data
may disappear at any time.  Fixed by reworking and simplifying the
deserialization code to always copy all the data.

And thirdly, when deserializing values for types passed by value, the
code simply did memcpy(d,s,typlen) which however does not work on
bigendian machines.  Fixed by using fetch_att/store_att_byval.
2019-03-28 20:03:14 +01:00
Thomas Munro ad308058cc Use FullTransactionId for the transaction stack.
Provide GetTopFullTransactionId() and GetCurrentFullTransactionId().
The intended users of these interfaces are access methods that use
xids for visibility checks but don't want to have to go back and
"freeze" existing references some time later before the 32 bit xid
counter wraps around.

Use a new struct to serialize the transaction state for parallel
query, because FullTransactionId doesn't fit into the previous
serialization scheme very well.

Author: Thomas Munro
Reviewed-by: Heikki Linnakangas
Discussion: https://postgr.es/m/CAA4eK1%2BMv%2Bmb0HFfWM9Srtc6MVe160WFurXV68iAFMcagRZ0dQ%40mail.gmail.com
2019-03-28 18:24:43 +13:00
Thomas Munro 2fc7af5e96 Add basic infrastructure for 64 bit transaction IDs.
Instead of inferring epoch progress from xids and checkpoints,
introduce a 64 bit FullTransactionId type and use it to track xid
generation.  This fixes an unlikely bug where the epoch is reported
incorrectly if the range of active xids wraps around more than once
between checkpoints.

The only user-visible effect of this commit is to correct the epoch
used by txid_current() and txid_status(), also visible with
pg_controldata, in those rare circumstances.  It also creates some
basic infrastructure so that later patches can use 64 bit
transaction IDs in more places.

The new type is a struct that we pass by value, as a form of strong
typedef.  This prevents the sort of accidental confusion between
TransactionId and FullTransactionId that would be possible if we
were to use a plain old uint64.

Author: Thomas Munro
Reported-by: Amit Kapila
Reviewed-by: Andres Freund, Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/CAA4eK1%2BMv%2Bmb0HFfWM9Srtc6MVe160WFurXV68iAFMcagRZ0dQ%40mail.gmail.com
2019-03-28 18:12:20 +13:00
Andres Freund 2a96909a4a tableam: Support for an index build's initial table scan(s).
To support building indexes over tables of different AMs, the scans to
do so need to be routed through the table AM.  While moving a fair
amount of code, nearly all the changes are just moving code to below a
callback.

Currently the range based interface wouldn't make much sense for non
block based table AMs. But that seems aceptable for now.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-27 19:59:06 -07:00
Tomas Vondra a63b29a1de Minor improvements for the multivariate MCV lists
The MCV build should always call get_mincount_for_mcv_list(), as the
there is no other logic to decide whether the MCV list represents all
the data. So just remove the (ngroups > nitems) condition.

Also, when building MCV lists, the number of items was limited by the
statistics target (i.e. up to 10000). But when deserializing the MCV
list, a different value (8192) was used to check the input, causing
an error.  Simply ensure that the same value is used in both places.

This should have been included in 7300a69950, but I forgot to include it
in that commit.
2019-03-27 20:07:41 +01:00
Tomas Vondra 7300a69950 Add support for multivariate MCV lists
Introduce a third extended statistic type, supported by the CREATE
STATISTICS command - MCV lists, a generalization of the statistic
already built and used for individual columns.

Compared to the already supported types (n-distinct coefficients and
functional dependencies), MCV lists are more complex, include column
values and allow estimation of much wider range of common clauses
(equality and inequality conditions, IS NULL, IS NOT NULL etc.).
Similarly to the other types, a new pseudo-type (pg_mcv_list) is used.

Author: Tomas Vondra
Reviewed-by: Dean Rasheed, David Rowley, Mark Dilger, Alvaro Herrera
Discussion: https://postgr.es/m/dfdac334-9cf2-2597-fb27-f0fb3753f435@2ndquadrant.com
2019-03-27 18:32:18 +01:00
Tom Lane 333ed246c6 Avoid passing query tlist around separately from root->processed_tlist.
In the dim past, the planner kept the fully-processed version of the query
targetlist (the result of preprocess_targetlist) in grouping_planner's
local variable "tlist", and only grudgingly passed it to individual other
routines as needed.  Later we discovered a need to still have it available
after grouping_planner finishes, and invented the root->processed_tlist
field for that purpose, but it wasn't used internally to grouping_planner;
the tlist was still being passed around separately in the same places as
before.

Now comes a proposed patch to allow appendrel expansion to add entries
to the processed tlist, well after preprocess_targetlist has finished
its work.  To avoid having to pass around the tlist explicitly, it's
proposed to allow appendrel expansion to modify root->processed_tlist.
That makes aliasing the tlist with assorted parameters and local
variables really scary.  It would accidentally work as long as the
tlist is initially nonempty, because then the List header won't move
around, but it's not exactly hard to think of ways for that to break.
Aliased values are poor programming practice anyway.

Hence, get rid of local variables and parameters that can be identified
with root->processed_tlist, in favor of just using that field directly.
And adjust comments to match.  (Some of the new comments speak as though
it's already possible for appendrel expansion to modify the tlist; that's
not true yet, but will happen in a later patch.)

Discussion: https://postgr.es/m/9d7c5112-cb99-6a47-d3be-cf1ee6862a1d@lab.ntt.co.jp
2019-03-27 12:57:49 -04:00
Michael Paquier 5bde1651bb Switch function current_schema[s]() to be parallel-unsafe
When invoked for the first time in a session, current_schema() and
current_schemas() can finish by creating a temporary schema.  Currently
those functions are parallel-safe, however if for a reason or another
they get launched across multiple parallel workers, they would fail when
attempting to create a temporary schema as temporary contexts are not
supported in this case.

The original issue has been spotted by buildfarm members crake and
lapwing, after commit c5660e0 has introduced the first regression tests
based on current_schema() in the tree.  After that, 396676b has
introduced a workaround to avoid parallel plans but that was not
completely right either.

Catversion is bumped.

Author: Michael Paquier
Reviewed-by: Daniel Gustafsson
Discussion: https://postgr.es/m/20190118024618.GF1883@paquier.xyz
2019-03-27 11:35:12 +09:00
Tomas Vondra 6ca015f9f0 Track unowned relations in doubly-linked list
Relations dropped in a single transaction are tracked in a list of
unowned relations.  With large number of dropped relations this resulted
in poor performance at the end of a transaction, when the relations are
removed from the singly linked list one by one.

Commit b4166911 attempted to address this issue (particularly when it
happens during recovery) by removing the relations in a reverse order,
resulting in O(1) lookups in the list of unowned relations.  This did
not work reliably, though, and it was possible to trigger the O(N^2)
behavior in various ways.

Instead of trying to remove the relations in a specific order with
respect to the linked list, which seems rather fragile, switch to a
regular doubly linked.  That allows us to remove relations cheaply no
matter where in the list they are.

As b4166911 was a bugfix, backpatched to all supported versions, do the
same thing here.

Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/80c27103-99e4-1d0c-642c-d9f3b94aaa0a%402ndquadrant.com
Backpatch-through: 9.4
2019-03-27 02:39:39 +01:00
Andres Freund 558a9165e0 Compute XID horizon for page level index vacuum on primary.
Previously the xid horizon was only computed during WAL replay. That
had two major problems:
1) It relied on knowing what the table pointed to looks like. That was
   easy enough before the introducing of tableam (we knew it had to be
   heap, although some trickery around logging the heap relfilenodes
   was required). But to properly handle table AMs we need
   per-database catalog access to look up the AM handler, which
   recovery doesn't allow.
2) Not knowing the xid horizon also makes it hard to support logical
   decoding on standbys. When on a catalog table, we need to be able
   to conflict with slots that have an xid horizon that's too old. But
   computing the horizon by visiting the heap only works once
   consistency is reached, but we always need to be able to detect
   conflicts.

There's also a secondary problem, in that the current method performs
redundant work on every standby. But that's counterbalanced by
potentially computing the value when not necessary (either because
there's no standby, or because there's no connected backends).

Solve 1) and 2) by moving computation of the xid horizon to the
primary and by involving tableam in the computation of the horizon.

To address the potentially increased overhead, increase the efficiency
of the xid horizon computation for heap by sorting the tids, and
eliminating redundant buffer accesses. When prefetching is available,
additionally perform prefetching of buffers.  As this is more of a
maintenance task, rather than something routinely done in every read
only query, we add an arbitrary 10 to the effective concurrency -
thereby using IO concurrency, when not globally enabled.  That's
possibly not the perfect formula, but seems good enough for now.

Bumps WAL format, as latestRemovedXid is now part of the records, and
the heap's relfilenode isn't anymore.

Author: Andres Freund, Amit Khandekar, Robert Haas
Reviewed-By: Robert Haas
Discussion:
    https://postgr.es/m/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de
    https://postgr.es/m/20181214014235.dal5ogljs3bmlq44@alap3.anarazel.de
    https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-26 16:52:54 -07:00
Tom Lane 53bcf5e3db Build "other rels" of appendrel baserels in a separate step.
Up to now, otherrel RelOptInfos were built at the same time as baserel
RelOptInfos, thanks to recursion in build_simple_rel().  However,
nothing in query_planner's preprocessing cares at all about otherrels,
only baserels, so we don't really need to build them until just before
we enter make_one_rel.  This has two benefits:

* create_lateral_join_info did a lot of extra work to propagate
lateral-reference information from parents to the correct children.
But if we delay creation of the children till after that, it's
trivial (and much harder to break, too).

* Since we have all the restriction quals correctly assigned to
parent appendrels by this point, it'll be possible to do plan-time
pruning and never make child RelOptInfos at all for partitions that
can be pruned away.  That's not done here, but will be later on.

Amit Langote, reviewed at various times by Dilip Kumar, Jesper Pedersen,
Yoshikazu Imai, and David Rowley

Discussion: https://postgr.es/m/9d7c5112-cb99-6a47-d3be-cf1ee6862a1d@lab.ntt.co.jp
2019-03-26 18:21:10 -04:00
Andres Freund 2ac1b2b175 Remove heap_hot_search().
After 71bdc99d0d, "tableam: Add helper for indexes to check if a
corresponding table tuples exist." there's no in-core user left. As
there's unlikely to be an external user, and such an external user
could easily be adjusted to use table_index_fetch_tuple_check(),
remove heap_hot_search().

Per complaint from Peter Geoghegan

Author: Andres Freund
Discussion: https://postgr.es/m/CAH2-Wzn0Oq4ftJrTqRAsWy2WGjv0QrJcwoZ+yqWsF_Z5vjUBFw@mail.gmail.com
2019-03-25 19:04:41 -07:00
Andres Freund 2e3da03e9e tableam: Add table_get_latest_tid, to wrap heap_get_latest_tid.
This primarily is to allow WHERE CURRENT OF to continue to work as it
currently does. It's not clear to me that these semantics make sense
for every AM, but it works for the in-core heap, and the out of core
zheap. We can refine it further at a later point if necessary.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-25 17:14:48 -07:00
Andres Freund 71bdc99d0d tableam: Add helper for indexes to check if a corresponding table tuples exist.
This is, likely exclusively, useful to verify that conflicts detected
in a unique index are with live tuples, rather than dead ones.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-25 16:52:55 -07:00
Alvaro Herrera 25ee70511e pgbench: Remove \cset
Partial revert of commit 6260cc550b, "pgbench: add \cset and \gset
commands".

While \gset is widely considered a useful and necessary tool for user-
defined benchmarks, \cset does not have as much value, and its
implementation was considered "not to be up to project standards"
(though I, Álvaro, can't quite understand exactly how).  Therefore,
remove \cset.

Author: Fabien Coelho
Discussion: https://postgr.es/m/alpine.DEB.2.21.1903230716030.18811@lancre
Discussion: https://postgr.es/m/201901101900.mv7zduch6sad@alvherre.pgsql
2019-03-25 12:16:07 -03:00
Robert Haas 6f97457e0d Add progress reporting for CLUSTER and VACUUM FULL.
This uses the same progress reporting infrastructure added in commit
c16dc1aca5 and extends it to these
additional cases.  We lack the ability to track the internal progress
of sorts and index builds so the information reported is
coarse-grained for some parts of the operation, but it still seems
like a significant improvement over having nothing at all.

Tatsuro Yamada, reviewed by Thomas Munro, Masahiko Sawada, Michael
Paquier, Jeff Janes, Alvaro Herrera, Rafia Sabih, and by me.  A fair
amount of polishing also by me.

Discussion: http://postgr.es/m/59A77072.3090401@lab.ntt.co.jp
2019-03-25 10:59:04 -04:00
Peter Eisentraut 481018f280 Add macro to cast away volatile without allowing changes to underlying type
This adds unvolatize(), which works just like unconstify() but for volatile.

Discussion: https://www.postgresql.org/message-id/flat/7a5cbea7-b8df-e910-0f10-04014bcad701%402ndquadrant.com
2019-03-25 09:37:03 +01:00
Andres Freund 9a8ee1dc65 tableam: Add and use table_fetch_row_version().
This is essentially the tableam version of heapam_fetch(),
i.e. fetching a tuple identified by a tid, performing visibility
checks.

Note that this different from table_index_fetch_tuple(), which is for
index lookups. It therefore has to handle a tid pointing to an earlier
version of a tuple if the AM uses an optimization like heap's HOT. Add
comments to that end.

This commit removes the stats_relation argument from heap_fetch, as
it's been unused for a long time.

Author: Andres Freund
Reviewed-By: Haribabu Kommi
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-25 00:17:59 -07:00
Peter Eisentraut 280a408b48 Transaction chaining
Add command variants COMMIT AND CHAIN and ROLLBACK AND CHAIN, which
start new transactions with the same transaction characteristics as the
just finished one, per SQL standard.

Support for transaction chaining in PL/pgSQL is also added.  This
functionality is especially useful when running COMMIT in a loop in
PL/pgSQL.

Reviewed-by: Fabien COELHO <coelho@cri.ensmp.fr>
Discussion: https://www.postgresql.org/message-id/flat/28536681-324b-10dc-ade8-ab46f7645a5a@2ndquadrant.com
2019-03-24 11:33:02 +01:00
Andres Freund b2db277057 Remove spurious return.
Per buildfarm member anole.

Author: Andres Freund
2019-03-23 21:09:39 -07:00
Andres Freund 5db6df0c01 tableam: Add tuple_{insert, delete, update, lock} and use.
This adds new, required, table AM callbacks for insert/delete/update
and lock_tuple. To be able to reasonably use those, the EvalPlanQual
mechanism had to be adapted, moving more logic into the AM.

Previously both delete/update/lock call-sites and the EPQ mechanism had
to have awareness of the specific tuple format to be able to fetch the
latest version of a tuple. Obviously that needs to be abstracted
away. To do so, move the logic that find the latest row version into
the AM. lock_tuple has a new flag argument,
TUPLE_LOCK_FLAG_FIND_LAST_VERSION, that forces it to lock the last
version, rather than the current one.  It'd have been possible to do
so via a separate callback as well, but finding the last version
usually also necessitates locking the newest version, making it
sensible to combine the two. This replaces the previous use of
EvalPlanQualFetch().  Additionally HeapTupleUpdated, which previously
signaled either a concurrent update or delete, is now split into two,
to avoid callers needing AM specific knowledge to differentiate.

The move of finding the latest row version into tuple_lock means that
encountering a row concurrently moved into another partition will now
raise an error about "tuple to be locked" rather than "tuple to be
updated/deleted" - which is accurate, as that always happens when
locking rows. While possible slightly less helpful for users, it seems
like an acceptable trade-off.

As part of this commit HTSU_Result has been renamed to TM_Result, and
its members been expanded to differentiated between updating and
deleting. HeapUpdateFailureData has been renamed to TM_FailureData.

The interface to speculative insertion is changed so nodeModifyTable.c
does not have to set the speculative token itself anymore. Instead
there's a version of tuple_insert, tuple_insert_speculative, that
performs the speculative insertion (without requiring a flag to signal
that fact), and the speculative insertion is either made permanent
with table_complete_speculative(succeeded = true) or aborted with
succeeded = false).

Note that multi_insert is not yet routed through tableam, nor is
COPY. Changing multi_insert requires changes to copy.c that are large
enough to better be done separately.

Similarly, although simpler, CREATE TABLE AS and CREATE MATERIALIZED
VIEW are also only going to be adjusted in a later commit.

Author: Andres Freund and Haribabu Kommi
Discussion:
    https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
    https://postgr.es/m/20190313003903.nwvrxi7rw3ywhdel@alap3.anarazel.de
    https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-23 19:55:57 -07:00
Peter Geoghegan 29b64d1de7 Add nbtree high key "continuescan" optimization.
Teach nbtree forward index scans to check the high key before moving to
the right sibling page in the hope of finding that it isn't actually
necessary to do so.  The new check may indicate that the scan definitely
cannot find matching tuples to the right, ending the scan immediately.
We already opportunistically force a similar "continuescan orientated"
key check of the final non-pivot tuple when it's clear that it cannot be
returned to the scan due to being dead-to-all.  The new high key check
is complementary.

The new approach for forward scans is more effective than checking the
final non-pivot tuple, especially with composite indexes and non-unique
indexes.  The improvements to the logic for picking a split point added
by commit fab25024 make it likely that relatively dissimilar high keys
will appear on a page.  A distinguishing key value that can only appear
on non-pivot tuples on the right sibling page will often be present in
leaf page high keys.

Since forcing the final item to be key checked no longer makes any
difference in the case of forward scans, the existing extra key check is
now only used for backwards scans.  Backward scans continue to
opportunistically check the final non-pivot tuple, which is actually the
first non-pivot tuple on the page (not the last).

Note that even pg_upgrade'd v3 indexes make use of this optimization.

Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzkOmUduME31QnuTFpimejuQoiZ-HOf0pOWeFZNhTMctvA@mail.gmail.com
2019-03-23 11:01:53 -07:00
Peter Eisentraut 7b084b3831 Revert "Add gitignore entries for jsonpath_gram.h"
This reverts commit 4e274a043f.

These files aren't actually built anymore since 550b9d26f.
2019-03-23 00:19:34 +01:00
Peter Eisentraut 4e274a043f Add gitignore entries for jsonpath_gram.h 2019-03-22 23:19:30 +01:00
Heikki Linnakangas 7df159a620 Delete empty pages during GiST VACUUM.
To do this, we scan GiST two times. In the first pass we make note of
empty leaf pages and internal pages. At second pass we scan through
internal pages, looking for downlinks to the empty pages.

Deleting internal pages is still not supported, like in nbtree, the last
child of an internal page is never deleted. That means that if you have a
workload where new keys are always inserted to different area than where
old keys are removed, the index will still grow without bound. But the rate
of growth will be an order of magnitude slower than before.

Author: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/B1E4DF12-6CD3-4706-BDBD-BF3283328F60@yandex-team.ru
2019-03-22 13:21:45 +02:00
Heikki Linnakangas df816f6ad5 Add IntegerSet, to hold large sets of 64-bit ints efficiently.
The set is implemented as a B-tree, with a compact representation at leaf
items, using Simple-8b algorithm, so that clusters of nearby values use
less memory.

The IntegerSet isn't used for anything yet, aside from the test code, but
we have two patches in the works that would benefit from this: A patch to
allow GiST vacuum to delete empty pages, and a patch to reduce heap
VACUUM's memory usage, by storing the list of dead TIDs more efficiently
and lifting the 1 GB limit on its size.

This includes a unit test module, in src/test/modules/test_integerset.
It can be used to verify correctness, as a regression test, but if you run
it manully, it can also print memory usage and execution time of some of
the tests.

Author: Heikki Linnakangas, Andrey Borodin
Reviewed-by: Julien Rouhaud
Discussion: https://www.postgresql.org/message-id/b5e82599-1966-5783-733c-1a947ddb729f@iki.fi
2019-03-22 13:21:45 +02:00
Peter Eisentraut 5e1963fb76 Collations with nondeterministic comparison
This adds a flag "deterministic" to collations.  If that is false,
such a collation disables various optimizations that assume that
strings are equal only if they are byte-wise equal.  That then allows
use cases such as case-insensitive or accent-insensitive comparisons
or handling of strings with different Unicode normal forms.

This functionality is only supported with the ICU provider.  At least
glibc doesn't appear to have any locales that work in a
nondeterministic way, so it's not worth supporting this for the libc
provider.

The term "deterministic comparison" in this context is from Unicode
Technical Standard #10
(https://unicode.org/reports/tr10/#Deterministic_Comparison).

This patch makes changes in three areas:

- CREATE COLLATION DDL changes and system catalog changes to support
  this new flag.

- Many executor nodes and auxiliary code are extended to track
  collations.  Previously, this code would just throw away collation
  information, because the eventually-called user-defined functions
  didn't use it since they only cared about equality, which didn't
  need collation information.

- String data type functions that do equality comparisons and hashing
  are changed to take the (non-)deterministic flag into account.  For
  comparison, this just means skipping various shortcuts and tie
  breakers that use byte-wise comparison.  For hashing, we first need
  to convert the input string to a canonical "sort key" using the ICU
  analogue of strxfrm().

Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com
2019-03-22 12:12:43 +01:00
Alvaro Herrera 03ae9d59bd Catversion bump announced in previous commit but forgotten 2019-03-21 18:43:41 -03:00
Tom Lane bfb456c1b9 Improve error reporting for DROP FUNCTION/PROCEDURE/AGGREGATE/ROUTINE.
These commands allow the argument type list to be omitted if there is
just one object that matches by name.  However, if that syntax was
used with DROP IF EXISTS and there was more than one match, you got
a "function ... does not exist, skipping" notice message rather than a
truthful complaint about the ambiguity.  This was basically due to
poor factorization and a rats-nest of logic, so refactor the relevant
lookup code to make it cleaner.

Note that this amounts to narrowing the scope of which sorts of
error conditions IF EXISTS will bypass.  Per discussion, we only
intend it to skip no-such-object cases, not multiple-possible-matches
cases.

Per bug #15572 from Ash Marath.  Although this definitely seems like
a bug, it's not clear that people would thank us for changing the
behavior in minor releases, so no back-patch.

David Rowley, reviewed by Julien Rouhaud and Pavel Stehule

Discussion: https://postgr.es/m/15572-ed1b9ed09503de8a@postgresql.org
2019-03-21 11:52:08 -04:00
Tom Lane 8aa9dd74b3 Sort the dependent objects before deletion in DROP OWNED BY.
This finishes a task we left undone in commit f1ad067fc, by extending
the delete-in-descending-OID-order rule to deletions triggered by
DROP OWNED BY.  We've coped with machine-dependent deletion orders
one time too many, and the new issues caused by Peter G's recent
nbtree hacking seem like the last straw.

Discussion: https://postgr.es/m/E1h6eep-0001Mw-Vd@gemulon.postgresql.org
2019-03-20 18:06:29 -04:00
Alvaro Herrera a6da004715 Add index_get_partition convenience function
This new function simplifies some existing coding, as well as supports
future patches.

Discussion: https://postgr.es/m/201901222145.t6wws6t6vrcu@alvherre.pgsql
Reviewed-by: Amit Langote, Jesper Pedersen
2019-03-20 18:18:50 -03:00
Peter Geoghegan fab2502433 Consider secondary factors during nbtree splits.
Teach nbtree to give some consideration to how "distinguishing"
candidate leaf page split points are.  This should not noticeably affect
the balance of free space within each half of the split, while still
making suffix truncation truncate away significantly more attributes on
average.

The logic for choosing a leaf split point now uses a fallback mode in
the case where the page is full of duplicates and it isn't possible to
find even a minimally distinguishing split point.  When the page is full
of duplicates, the split should pack the left half very tightly, while
leaving the right half mostly empty.  Our assumption is that logical
duplicates will almost always be inserted in ascending heap TID order
with v4 indexes.  This strategy leaves most of the free space on the
half of the split that will likely be where future logical duplicates of
the same value need to be placed.

The number of cycles added is not very noticeable.  This is important
because deciding on a split point takes place while at least one
exclusive buffer lock is held.  We avoid using authoritative insertion
scankey comparisons to save cycles, unlike suffix truncation proper.  We
use a faster binary comparison instead.

Note that even pg_upgrade'd v3 indexes make use of these optimizations.
Benchmarking has shown that even v3 indexes benefit, despite the fact
that suffix truncation will only truncate non-key attributes in INCLUDE
indexes.  Grouping relatively similar tuples together is beneficial in
and of itself, since it reduces the number of leaf pages that must be
accessed by subsequent index scans.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzmmoLNQOj9mAD78iQHfWLJDszHEDrAzGTUMG3mVh5xWPw@mail.gmail.com
2019-03-20 10:12:19 -07:00
Peter Geoghegan dd299df818 Make heap TID a tiebreaker nbtree index column.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique.  This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.  A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes.  These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.

A later patch will enhance the logic used by nbtree to pick a split
point.  This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at.  Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.

The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits.  The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised.  However, there should be a compatibility note in the v12
release notes.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20 10:04:01 -07:00
Peter Geoghegan e5adcb789d Refactor nbtree insertion scankeys.
Use dedicated struct to represent nbtree insertion scan keys.  Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions.  This is based on a suggestion by Andrey
Lepikhov.

Streamline how unique index insertions cache binary search progress.
Cache the state of in-progress binary searches within _bt_check_unique()
for later instead of having callers avoid repeating the binary search in
an ad-hoc manner.  This makes it easy to add a new optimization:
_bt_check_unique() now falls out of its loop immediately in the common
case where it's already clear that there couldn't possibly be a
duplicate.

The new _bt_check_unique() scheme makes it a lot easier to manage cached
binary search effort afterwards, from within _bt_findinsertloc().  This
is needed for the upcoming patch to make nbtree tuples unique by
treating heap TID as a final tiebreaker column.  Unique key binary
searches need to restore lower and upper bounds.  They cannot simply
continue to use the >= lower bound as the offset to insert at, because
the heap TID tiebreaker column must be used in comparisons for the
restored binary search (unlike the original _bt_check_unique() binary
search, where scankey's heap TID column must be omitted).

Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas, Andrey Lepikhov
Discussion: https://postgr.es/m/CAH2-WzmE6AhUdk9NdWBf4K3HjWXZBX3+umC7mH7+WDrKcRtsOw@mail.gmail.com
2019-03-20 09:30:57 -07:00
Alexander Korotkov 550b9d26f8 Get rid of jsonpath_gram.h and jsonpath_scanner.h
Jsonpath grammar and scanner are both quite small.  It doesn't worth complexity
to compile them separately.  This commit makes grammar and scanner be compiled
at once.  Therefore, jsonpath_gram.h and jsonpath_gram.h are no longer needed.
This commit also does some reorganization of code in jsonpath_gram.y.

Discussion: https://postgr.es/m/d47b2023-3ecb-5f04-d253-d557547cf74f%402ndQuadrant.com
2019-03-20 11:13:34 +03:00
Alexander Korotkov 641fde2523 Remove ambiguity for jsonb_path_match() and jsonb_path_exists()
There are 2-arguments and 4-arguments versions of jsonb_path_match() and
jsonb_path_exists().  But 4-arguments versions have optional 3rd and 4th
arguments, that leads to ambiguity.  In the same time 2-arguments versions are
needed only for @@ and @? operators.  So, rename 2-arguments versions to
remove the ambiguity.

Catversion is bumped.
2019-03-20 10:30:56 +03:00
Alexander Korotkov 5e28b778bf Rename typedef in jsonpath_gram.y from "string" to "JsonPathString"
Reason is the same as in 75c57058b0.
2019-03-19 21:01:10 +03:00
Tom Lane 0dfe3d0ef5 Make checkpoint requests more robust.
Commit 6f6a6d8b1 introduced a delay of up to 2 seconds if we're trying
to request a checkpoint but the checkpointer hasn't started yet (or,
much less likely, our kill() call fails).  However buildfarm experience
shows that that's not quite enough for slow or heavily-loaded machines.
There's no good reason to assume that the checkpointer won't start
eventually, so we may as well make the timeout much longer, say 60 sec.

However, if the caller didn't say CHECKPOINT_WAIT, it seems like a bad
idea to be waiting at all, much less for as long as 60 sec.  We can
remove the need for that, and make this whole thing more robust, by
adjusting the code so that the existence of a pending checkpoint
request is clear from the contents of shared memory, and making sure
that the checkpointer process will notice it at startup even if it did
not get a signal.  In this way there's no need for a non-CHECKPOINT_WAIT
call to wait at all; if it can't send the signal, it can nonetheless
assume that the checkpointer will eventually service the request.

A potential downside of this change is that "kill -INT" on the checkpointer
process is no longer enough to trigger a checkpoint, should anyone be
relying on something so hacky.  But there's no obvious reason to do it
like that rather than issuing a plain old CHECKPOINT command, so we'll
assume that nobody is.  There doesn't seem to be a way to preserve this
undocumented quasi-feature without introducing race conditions.

Since a principal reason for messing with this is to prevent intermittent
buildfarm failures, back-patch to all supported branches.

Discussion: https://postgr.es/m/27830.1552752475@sss.pgh.pa.us
2019-03-19 12:49:27 -04:00
Peter Eisentraut 28988a84cf Reorder LOCALLOCK structure members to compact the size
Save 8 bytes (on x86-64) by filling up padding holes.

Author: Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com>
Discussion: https://www.postgresql.org/message-id/20190219001639.ft7kxir2iz644alf@alap3.anarazel.de
2019-03-19 14:07:08 +01:00
Andrew Gierth 01bde4fa4c Implement OR REPLACE option for CREATE AGGREGATE.
Aggregates have acquired a dozen or so optional attributes in recent
years for things like parallel query and moving-aggregate mode; the
lack of an OR REPLACE option to add or change these for an existing
agg makes extension upgrades gratuitously hard. Rectify.
2019-03-19 01:16:50 +00:00
Robert Haas 6776142a07 Revise parse tree representation for VACUUM and ANALYZE.
Like commit f41551f61f, this aims
to make it easier to add non-Boolean options to VACUUM (or, in
this case, to ANALYZE).  Instead of building up a bitmap of
options directly in the parser, build up a list of DefElem
objects and let ExecVacuum() sort it out; right now, we make
no use of the fact that a DefElem can carry an associated value,
but it will be easy to make that change in the future.

Masahiko Sawada

Discussion: http://postgr.es/m/CAD21AoATE4sn0jFFH3NcfUZXkU2BMbjBWB_kDj-XWYA-LXDcQA@mail.gmail.com
2019-03-18 15:14:52 -04:00
Robert Haas f41551f61f Fold vacuum's 'int options' parameter into VacuumParams.
Many places need both, so this allows a few functions to take one
fewer parameter.  More importantly, as soon as we add a VACUUM
option that takes a non-Boolean parameter, we need to replace
'int options' with a struct, and it seems better to think
of adding more fields to VacuumParams rather than passing around
both VacuumParams and a separate struct as well.

Patch by me, reviewed by Masahiko Sawada

Discussion: http://postgr.es/m/CA+Tgmob6g6-s50fyv8E8he7APfwCYYJ4z0wbZC2yZeSz=26CYQ@mail.gmail.com
2019-03-18 13:57:33 -04:00
Peter Eisentraut 1ffa59a85c Fix optimization of foreign-key on update actions
In RI_FKey_pk_upd_check_required(), we check among other things
whether the old and new key are equal, so that we don't need to run
cascade actions when nothing has actually changed.  This was using the
equality operator.  But the effect of this is that if a value in the
primary key is changed to one that "looks" different but compares as
equal, the update is not propagated.  (Examples are float -0 and 0 and
case-insensitive text.)  This appears to violate the SQL standard, and
it also behaves inconsistently if in a multicolumn key another key is
also updated that would cause the row to compare as not equal.

To fix, if we are looking at the PK table in ri_KeysEqual(), then do a
bytewise comparison similar to record_image_eq() instead of using the
equality operators.  This only makes a difference for ON UPDATE
CASCADE, but for consistency we treat all changes to the PK the same.  For
the FK table, we continue to use the equality operators.

Discussion: https://www.postgresql.org/message-id/flat/3326fc2e-bc02-d4c5-e3e5-e54da466e89a@2ndquadrant.com
2019-03-18 17:19:21 +01:00
Michael Paquier 8b938d36f7 Refactor more code logic to update the control file
ce6afc6 has begun the refactoring work by plugging pg_rewind into a
central routine to update the control file, and left around two extra
copies, with one in xlog.c for the backend and one in pg_resetwal.c.  By
adding an extra option to the central routine in controldata_utils.c to
control if a flush of the control file needs to be done, it is proving
to be straight-forward to make xlog.c and pg_resetwal.c use the central
code path at the condition of moving the wait event tracking there.
Hence, this allows to have only one central code path to update the
control file, shaving the code from the duplicates.

This refactoring actually fixes a problem in pg_resetwal.  Previously,
the control file was first removed before being recreated.  So if a
crash happened between the moment the file was removed and the moment
the file was created, then it would have been possible to not have a
control file anymore in the database folder.

Author: Fabien Coelho
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/alpine.DEB.2.21.1903170935210.2506@lancre
2019-03-18 12:59:35 +09:00
Alexander Korotkov 142c400d72 Fix make rules for jsonpath grammar making them similar to SQL grammar
Reported-by: Jeff Janes, Tom Lane
Discussion: https://postgr.es/m/CAMkU%3D1w1qBvoW82ZTFpAKae027R-2OHw-m6ALe0VQRNAFueBVA%40mail.gmail.com
2019-03-17 10:55:52 +03:00
Amit Kapila f27314ff9a Update copyright year in files added by 1bb5e78218. 2019-03-16 16:00:38 +05:30
Alexander Korotkov 16d489b0fe Numeric error suppression in jsonpath
Add support of numeric error suppression to jsonpath as it's required by
standard.  This commit doesn't use PG_TRY()/PG_CATCH() in order to implement
that.  Instead, it provides internal versions of numeric functions used, which
support error suppression.

Discussion: https://postgr.es/m/fcc6fc6a-b497-f39a-923d-aa34d0c588e8%402ndQuadrant.com
Author: Alexander Korotkov, Nikita Glukhov
Reviewed-by: Tomas Vondra
2019-03-16 12:21:19 +03:00
Alexander Korotkov 72b6460336 Partial implementation of SQL/JSON path language
SQL 2016 standards among other things contains set of SQL/JSON features for
JSON processing inside of relational database.  The core of SQL/JSON is JSON
path language, allowing access parts of JSON documents and make computations
over them.  This commit implements partial support JSON path language as
separate datatype called "jsonpath".  The implementation is partial because
it's lacking datetime support and suppression of numeric errors.  Missing
features will be added later by separate commits.

Support of SQL/JSON features requires implementation of separate nodes, and it
will be considered in subsequent patches.  This commit includes following
set of plain functions, allowing to execute jsonpath over jsonb values:

 * jsonb_path_exists(jsonb, jsonpath[, jsonb, bool]),
 * jsonb_path_match(jsonb, jsonpath[, jsonb, bool]),
 * jsonb_path_query(jsonb, jsonpath[, jsonb, bool]),
 * jsonb_path_query_array(jsonb, jsonpath[, jsonb, bool]).
 * jsonb_path_query_first(jsonb, jsonpath[, jsonb, bool]).

This commit also implements "jsonb @? jsonpath" and "jsonb @@ jsonpath", which
are wrappers over jsonpath_exists(jsonb, jsonpath) and jsonpath_predicate(jsonb,
jsonpath) correspondingly.  These operators will have an index support
(implemented in subsequent patches).

Catversion bumped, to add new functions and operators.

Code was written by Nikita Glukhov and Teodor Sigaev, revised by me.
Documentation was written by Oleg Bartunov and Liudmila Mantrova.  The work
was inspired by Oleg Bartunov.

Discussion: https://postgr.es/m/fcc6fc6a-b497-f39a-923d-aa34d0c588e8%402ndQuadrant.com
Author: Nikita Glukhov, Teodor Sigaev, Alexander Korotkov, Oleg Bartunov, Liudmila Mantrova
Reviewed-by: Tomas Vondra, Andrew Dunstan, Pavel Stehule, Alexander Korotkov
2019-03-16 12:16:48 +03:00
Peter Eisentraut 893d6f8a1f Avoid casting away a const 2019-03-16 10:13:03 +01:00
Peter Eisentraut 69039fda83 Add walreceiver API to get remote server version
Add a separate walreceiver API function walrcv_server_version() to get
the version of the remote server, instead of doing it as part of
walrcv_identify_system().  This allows the server version to be
available even for uses that don't call IDENTIFY_SYSTEM, and it seems
cleaner anyway.

This is for an upcoming patch, not currently used.

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/20190115071359.GF1433@paquier.xyz
2019-03-15 10:16:26 +01:00
Thomas Munro bb16aba50c Enable parallel query with SERIALIZABLE isolation.
Previously, the SERIALIZABLE isolation level prevented parallel query
from being used.  Allow the two features to be used together by
sharing the leader's SERIALIZABLEXACT with parallel workers.

An extra per-SERIALIZABLEXACT LWLock is introduced to make it safe to
share, and new logic is introduced to coordinate the early release
of the SERIALIZABLEXACT required for the SXACT_FLAG_RO_SAFE
optimization, as follows:

The first backend to observe the SXACT_FLAG_RO_SAFE flag (set by
some other transaction) will 'partially release' the SERIALIZABLEXACT,
meaning that the conflicts and locks it holds are released, but the
SERIALIZABLEXACT itself will remain active because other backends
might still have a pointer to it.

Whenever any backend notices the SXACT_FLAG_RO_SAFE flag, it clears
its own MySerializableXact variable and frees local resources so that
it can skip SSI checks for the rest of the transaction.  In the
special case of the leader process, it transfers the SERIALIZABLEXACT
to a new variable SavedSerializableXact, so that it can be completely
released at the end of the transaction after all workers have exited.

Remove the serializable_okay flag added to CreateParallelContext() by
commit 9da0cc35, because it's now redundant.

Author: Thomas Munro
Reviewed-by: Haribabu Kommi, Robert Haas, Masahiko Sawada, Kevin Grittner
Discussion: https://postgr.es/m/CAEepm=0gXGYhtrVDWOTHS8SQQy_=S9xo+8oCxGLWZAOoeJ=yzQ@mail.gmail.com
2019-03-15 17:47:04 +13:00
Peter Eisentraut b13a913607 Add BKI_DEFAULT to pg_class.relrewrite
This column is always 0 on disk, so it doesn't have to be tracked
separately for each entry.
2019-03-14 21:25:39 +01:00
Peter Eisentraut c6ff0b892c Refactor ParamListInfo initialization
There were six copies of identical nontrivial code.  Put it into a
function.
2019-03-14 13:30:09 +01:00
Tom Lane 401b87a24f Sync commentary in transam.h and bki.sgml.
Commit a6417078c missed updating some comments in transam.h about
reservation of high OIDs for development purposes.  Also tamp down
an over-optimistic comment there about how easy it'd be to change
FirstNormalObjectId.

Earlier, commit 09568ec3d failed to update bki.sgml for the split
between genbki.pl-assigned OIDs and those assigned during initdb.

Also fix genbki.pl so that it will complain if it overruns
that split.  It's possible that doing so would have no very bad
consequences, but that's no excuse for not detecting it.
2019-03-14 00:23:40 -04:00
Thomas Munro c6c9474aaf Use condition variables to wait for checkpoints.
Previously we used a polling/sleeping loop to wait for checkpoints
to begin and end, which leads to up to a couple hundred milliseconds
of needless thumb-twiddling.  Use condition variables instead.

Author: Thomas Munro
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/CA%2BhUKGLY7sDe%2Bbg1K%3DbnEzOofGoo4bJHYh9%2BcDCXJepb6DQmLw%40mail.gmail.com
2019-03-14 10:59:33 +13:00