Commit Graph

427 Commits

Author SHA1 Message Date
John Naylor 6974924347 Specialize tuplesort routines for different kinds of abbreviated keys
Previously, the specialized tuplesort routine inlined handling for
reverse-sort and NULLs-ordering but called the datum comparator via a
pointer in the SortSupport struct parameter. Testing has showed that we
can get a useful performance gain by specializing datum comparison for
the different representations of abbreviated keys -- signed and unsigned
64-bit integers and signed 32-bit integers. Almost all abbreviatable data
types will benefit -- the only exception for now is numeric, since the
datum comparison is more complex. The performance gain depends on data
type and input distribution, but often falls in the range of 10-20% faster.

Thomas Munro

Reviewed by Peter Geoghegan, review and performance testing by me

Discussion:
https://www.postgresql.org/message-id/CA%2BhUKGKKYttZZk-JMRQSVak%3DCXSJ5fiwtirFf%3Dn%3DPAbumvn1Ww%40mail.gmail.com
2022-04-02 15:22:25 +07:00
Peter Eisentraut 94aa7cc5f7 Add UNIQUE null treatment option
The SQL standard has been ambiguous about whether null values in
unique constraints should be considered equal or not.  Different
implementations have different behaviors.  In the SQL:202x draft, this
has been formalized by making this implementation-defined and adding
an option on unique constraint definitions UNIQUE [ NULLS [NOT]
DISTINCT ] to choose a behavior explicitly.

This patch adds this option to PostgreSQL.  The default behavior
remains UNIQUE NULLS DISTINCT.  Making this happen in the btree code
is pretty easy; most of the patch is just to carry the flag around to
all the places that need it.

The CREATE UNIQUE INDEX syntax extension is not from the standard,
it's my own invention.

I named all the internal flags, catalog columns, etc. in the negative
("nulls not distinct") so that the default PostgreSQL behavior is the
default if the flag is false.

Reviewed-by: Maxim Orlov <orlovmg@gmail.com>
Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/84e5ee1b-387e-9a54-c326-9082674bde78@enterprisedb.com
2022-02-03 11:48:21 +01:00
Bruce Momjian 27b77ecf9f Update copyright for 2022
Backpatch-through: 10
2022-01-07 19:04:57 -05:00
Tom Lane a2ff18e89f Improve sift up/down code in binaryheap.c and logtape.c.
Borrow the logic that's long been used in tuplesort.c: instead
of physically swapping the data in two heap entries, keep the
value that's being sifted up or down in a local variable, and
just move the other values as necessary.  This makes the code
shorter as well as faster.  It's not clear that any current
callers are really time-critical enough to notice, but we
might as well code heap maintenance the same way everywhere.

Ma Liangzhu and Tom Lane

Discussion: https://postgr.es/m/17336-fc4e522d26a750fd@postgresql.org
2021-12-14 13:35:22 -05:00
Tom Lane 2de3c1015c Fix datatype confusion in logtape.c's right_offset().
This could only matter if (a) long is wider than int, and (b) the heap
of free blocks exceeds UINT_MAX entries, which seems pretty unlikely.
Still, it's a theoretical bug, so backpatch to v13 where the typo came
in (in commit c02fdc922).

In passing, also make swap_nodes() use consistent datatypes.

Ma Liangzhu

Discussion: https://postgr.es/m/17336-fc4e522d26a750fd@postgresql.org
2021-12-14 11:46:36 -05:00
Heikki Linnakangas 166f94377c Clarify the logic in a few places in the new balanced merge code.
In selectnewtape(), use 'nOutputTapes' rather than 'nOutputRuns' in the
check for whether to start a new tape or to append a new run to an
existing tape. Until 'maxTapes' is reached, nOutputTapes is always equal
to nOutputRuns, so it doesn't change the logic, but it seems more logical
to compare # of tapes with # of tapes. Also, currently maxTapes is never
modified after the merging begins, but written this way, the code would
still work if it was. (Although the nOutputRuns == nOutputTapes assertion
would need to be removed and using nOutputRuns % nOutputTapes to
distribute the runs evenly across the tapes wouldn't do a good job
anymore).

Similarly in mergeruns(), change to USEMEM(state->tape_buffer_mem) to
account for the memory used for tape buffers. It's equal to availMem
currently, but tape_buffer_mem is more direct and future-proof. For
example, if we changed the logic to only allocate half of the remaining
memory to tape buffers, USEMEM(state->tape_buffer_mem) would still be
correct.

Coverity complained about these. Hopefully this patch helps it to
understand the logic better. Thanks to Tom Lane for initial analysis.
2021-10-25 09:30:49 +03:00
Heikki Linnakangas fc0f3b4cb0 Fix parallel sort, broken by the balanced merge patch.
The code for initializing the tapes on each merge iteration was skipped
in a parallel worker. I put the !WORKER(state) check in wrong place while
rebasing the patch.

That caused failures in the index build in 'multiple-row-versions'
isolation test, in multiple buildfarm members. On my laptop it was easier
to reproduce by building an index on a larger table, so that you got a
parallel sort more reliably.
2021-10-18 20:42:10 +03:00
Heikki Linnakangas aa3ac6453b Fix duplicate typedef LogicalTape.
To make buildfarm member locust happy.
2021-10-18 17:02:01 +03:00
Heikki Linnakangas 0bd65a3905 Fix format modifier used in elog.
The previous commit 65014000b3 changed the variable passed to elog
from an int64 to a size_t variable, but neglected to change the modifier
in the format string accordingly.

Per failure on buildfarm member lapwing.
2021-10-18 16:15:44 +03:00
Heikki Linnakangas 65014000b3 Replace polyphase merge algorithm with a simple balanced k-way merge.
The advantage of polyphase merge is that it can reuse the input tapes as
output tapes efficiently, but that is irrelevant on modern hardware, when
we can easily emulate any number of tape drives. The number of input tapes
we can/should use during merging is limited by work_mem, but output tapes
that we are not currently writing to only cost a little bit of memory, so
there is no need to skimp on them.

This makes sorts that need multiple merge passes faster.

Discussion: https://www.postgresql.org/message-id/420a0ec7-602c-d406-1e75-1ef7ddc58d83%40iki.fi
Reviewed-by: Peter Geoghegan, Zhihong Yu, John Naylor
2021-10-18 14:46:01 +03:00
Heikki Linnakangas c4649cce39 Refactor LogicalTapeSet/LogicalTape interface.
All the tape functions, like LogicalTapeRead and LogicalTapeWrite, now
take a LogicalTape as argument, instead of LogicalTapeSet+tape number.
You can create any number of LogicalTapes in a single LogicalTapeSet, and
you don't need to decide the number upfront, when you create the tape set.

This makes the tape management in hash agg spilling in nodeAgg.c simpler.

Discussion: https://www.postgresql.org/message-id/420a0ec7-602c-d406-1e75-1ef7ddc58d83%40iki.fi
Reviewed-by: Peter Geoghegan, Zhihong Yu, John Naylor
2021-10-18 14:46:01 +03:00
Amit Kapila 31c389d8de Optimize fileset usage in apply worker.
Use one fileset for the entire worker lifetime instead of using
separate filesets for each streaming transaction. Now, the
changes/subxacts files for every streaming transaction will be
created under the same fileset and the files will be deleted
after the transaction is completed.

This patch extends the BufFileOpenFileSet and BufFileDeleteFileSet
APIs to allow users to specify whether to give an error on missing
files.

Author: Dilip Kumar, based on suggestion by Thomas Munro
Reviewed-by: Hou Zhijie, Masahiko Sawada, Amit Kapila
Discussion: https://postgr.es/m/E1mCC6U-0004Ik-Fs@gemulon.postgresql.org
2021-09-02 08:13:46 +05:30
Amit Kapila dcac5e7ac1 Refactor sharedfileset.c to separate out fileset implementation.
Move fileset related implementation out of sharedfileset.c to allow its
usage by backends that don't want to share filesets among different
processes. After this split, fileset infrastructure is used by both
sharedfileset.c and worker.c for the named temporary files that survive
across transactions.

Author: Dilip Kumar, based on suggestion by Andres Freund
Reviewed-by: Hou Zhijie, Masahiko Sawada, Amit Kapila
Discussion: https://postgr.es/m/E1mCC6U-0004Ik-Fs@gemulon.postgresql.org
2021-08-30 08:48:15 +05:30
David Rowley 5bd38d2f28 Robustify tuplesort's free_sort_tuple function
41469253e went to the trouble of removing a theoretical bug from
free_sort_tuple by checking if the tuple was NULL before freeing it. Let's
make this a little more robust by also setting the tuple to NULL so that
should we be called again we won't end up doing a pfree on the already
pfree'd tuple. Per advice from Tom Lane.

Discussion: https://postgr.es/m/3188192.1626136953@sss.pgh.pa.us
Backpatch-through: 9.6, same as 41469253e
2021-07-13 13:27:05 +12:00
David Rowley 41469253e9 Fix theoretical bug in tuplesort
This fixes a theoretical bug in tuplesort.c which, if a bounded sort was
used in combination with a byval Datum sort (tuplesort_begin_datum), when
switching the sort to a bounded heap in make_bounded_heap(), we'd call
free_sort_tuple().  The problem was that when sorting Datums of a byval
type, the tuple is NULL and free_sort_tuple() would free the memory for it
regardless of that.  This would result in a crash.

Here we fix that simply by adding a check to see if the tuple is NULL
before trying to disassociate and free any memory belonging to it.

The reason this bug is only theoretical is that nowhere in the current
code base do we do tuplesort_set_bound() when performing a Datum sort.
However, let's backpatch a fix for this as if any extension uses the code
in this way then it's likely to cause problems.

Author: Ronan Dunklau
Discussion: https://postgr.es/m/CAApHDvpdoqNC5FjDb3KUTSMs5dg6f+XxH4Bg_dVcLi8UYAG3EQ@mail.gmail.com
Backpatch-through: 9.6, oldest supported version
2021-07-13 12:40:16 +12:00
Tom Lane def5b065ff Initial pgindent and pgperltidy run for v14.
Also "make reformat-dat-files".

The only change worthy of note is that pgindent messed up the formatting
of launcher.c's struct LogicalRepWorkerId, which led me to notice that
that struct wasn't used at all anymore, so I just took it out.
2021-05-12 13:14:10 -04:00
Thomas Munro 8eda3eba30 Use sort_template.h for qsort_tuple() and qsort_ssup().
Replace the Perl code previously used to generate specialized sort
functions with sort_template.h.

Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/CA%2BhUKGJ2-eaDqAum5bxhpMNhvuJmRDZxB_Tow0n-gse%2BHG0Yig%40mail.gmail.com
2021-03-03 17:02:32 +13:00
Bruce Momjian ca3b37487b Update copyright for 2021
Backpatch-through: 9.5
2021-01-02 13:06:25 -05:00
Tom Lane f859c2ffa0 Fix a few more generator scripts to produce pgindent-clean output.
This completes the project of making all our derived files be
pgindent-clean (or else explicitly excluded from indentation),
so that no surprises result when running pgindent in a built-out
development tree.

Discussion: https://postgr.es/m/79ed5348-be7a-b647-dd40-742207186a22@2ndquadrant.com
2020-09-21 13:58:26 -04:00
Heikki Linnakangas 16fa9b2b30 Add support for building GiST index by sorting.
This adds a new optional support function to the GiST access method:
sortsupport. If it is defined, the GiST index is built by sorting all data
to the order defined by the sortsupport's comparator function, and packing
the tuples in that order to GiST pages. This is similar to how B-tree
index build works, and is much faster than inserting the tuples one by
one. The resulting index is smaller too, because the pages are packed more
tightly, upto 'fillfactor'. The normal build method works by splitting
pages, which tends to lead to more wasted space.

The quality of the resulting index depends on how good the opclass-defined
sort order is. A good order preserves locality of the input data.

As the first user of this facility, add 'sortsupport' function to the
point_ops opclass. It sorts the points in Z-order (aka Morton Code), by
interleaving the bits of the X and Y coordinates.

Author: Andrey Borodin
Reviewed-by: Pavel Borisov, Thomas Munro
Discussion: https://www.postgresql.org/message-id/1A36620E-CAD8-4267-9067-FB31385E7C0D%40yandex-team.ru
2020-09-17 11:33:40 +03:00
Jeff Davis c8aeaf3ab3 Change LogicalTapeSetBlocks() to use nBlocksWritten.
Previously, it was based on nBlocksAllocated to account for tapes with
open write buffers that may not have made it to the BufFile yet.

That was unnecessary, because callers do not need to get the number of
blocks while a tape has an open write buffer; and it also conflicted
with the preallocation logic added for HashAgg.

Reviewed-by: Peter Geoghegan
Discussion: https://postgr.es/m/ce5af05900fdbd0e9185747825a7423c48501964.camel@j-davis.com
Backpatch-through: 13
2020-09-15 21:42:25 -07:00
Peter Eisentraut 3e0242b24c Message fixes and style improvements 2020-09-14 06:42:30 +02:00
Jeff Davis 0758964963 logtape.c: do not preallocate for tapes when sorting
The preallocation logic is only useful for HashAgg, so disable it when
sorting.

Also, adjust an out-of-date comment.

Reviewed-by: Peter Geoghegan
Discussion: https://postgr.es/m/CAH2-Wzn_o7tE2+hRVvwSFghRb75AJ5g-nqGzDUqLYMexjOAe=g@mail.gmail.com
Backpatch-through: 13
2020-09-11 17:10:02 -07:00
Jeff Davis 0852006a94 Fix bogus MaxAllocSize check in logtape.c.
Reported-by: Peter Geoghegan
Discussion: https://postgr.es/m/CAH2-Wz=NZPZc3-fkdmvu=w2itx0PiB-G6QpxHXZOjuvFAzPdZw@mail.gmail.com
Backpatch-through: 13
2020-09-04 12:09:52 -07:00
Amit Kapila 808e13b282 Extend the BufFile interface.
Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times. Such files
need to be created as a member of a SharedFileSet.

Additionally, this commit implements the interface for BufFileTruncate to
allow files to be truncated up to a particular offset and extends the
BufFileSeek API to support the SEEK_END case. This also adds an option to
provide a mode while opening the shared BufFiles instead of always opening
in read-only mode.

These enhancements in BufFile interface are required for the upcoming
patch to allow the replication apply worker, to handle streamed
in-progress transactions.

Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
2020-08-26 07:36:43 +05:30
Thomas Munro 7897e3bb90 Fix buffile.c error handling.
Convert buffile.c error handling to use ereport.  This fixes cases where
I/O errors were indistinguishable from EOF or not reported.  Also remove
"%m" from error messages where errno would be bogus.  While we're
modifying those strings, add block numbers and short read byte counts
where appropriate.

Back-patch to all supported releases.

Reported-by: Amit Khandekar <amitdkhan.pg@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Ibrar Ahmed <ibrar.ahmad@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CA%2BhUKGJE04G%3D8TLK0DLypT_27D9dR8F1RQgNp0jK6qR0tZGWOw%40mail.gmail.com
2020-06-16 16:59:07 +12:00
Tom Lane b5d69b7c22 pgindent run prior to branching v13.
pgperltidy and reformat-dat-files too, though those didn't
find anything to change.
2020-06-07 16:57:08 -04:00
Jeff Davis 1fbb6c93df Fix platform-specific performance regression in logtape.c.
Commit 24d85952 made a change that indirectly caused a performance
regression by triggering a change in the way GCC optimizes memcpy() on
some platforms.

The behavior seemed to contradict a GCC document, so I filed a report:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95556

This patch implements a narrow workaround which eliminates the
regression I observed. The workaround is benign enough that it seems
unlikely to cause a different regression on another platform.

Discussion: https://postgr.es/m/99b2eab335c1592c925d8143979c8e9e81e1575f.camel@j-davis.com
2020-06-07 09:25:55 -07:00
Jeff Davis 896ddf9b3c Avoid fragmentation of logical tapes when writing concurrently.
Disk-based HashAgg relies on writing to multiple tapes
concurrently. Avoid fragmentation of the tapes' blocks by
preallocating many blocks for a tape at once. No file operations are
performed during preallocation; only the block numbers are reserved.

Reviewed-by: Tomas Vondra
Discussion: https://postgr.es/m/20200519151202.u2p2gpiawoaznsv2%40development
2020-05-26 16:49:43 -07:00
Tom Lane 5cbfce562f Initial pgindent and pgperltidy run for v13.
Includes some manual cleanup of places that pgindent messed up,
most of which weren't per project style anyway.

Notably, it seems some people didn't absorb the style rules of
commit c9d297751, because there were a bunch of new occurrences
of function calls with a newline just after the left paren, all
with faulty expectations about how the rest of the call would get
indented.
2020-05-14 13:06:50 -04:00
Tomas Vondra 1a40d37a9f Fix typos and improve incremental sort comments
Author: Justin Pryzby, James Coleman
Discussion: https://postgr.es/m/20200419023625.GP26953@telsasoft.com
2020-05-12 19:37:13 +02:00
Tom Lane 0da06d9faf Get rid of trailing semicolons in C macro definitions.
Writing a trailing semicolon in a macro is almost never the right thing,
because you almost always want to write a semicolon after each macro
call instead.  (Even if there was some reason to prefer not to, pgindent
would probably make a hash of code formatted that way; so within PG the
rule should basically be "don't do it".)  Thus, if we have a semi inside
the macro, the compiler sees "something;;".  Much of the time the extra
empty statement is harmless, but it could lead to mysterious syntax
errors at call sites.  In perhaps an overabundance of neatnik-ism, let's
run around and get rid of the excess semicolons whereever possible.

The only thing worse than a mysterious syntax error is a mysterious
syntax error that only happens in the back branches; therefore,
backpatch these changes where relevant, which is most of them because
most of these mistakes are old.  (The lack of reported problems shows
that this is largely a hypothetical issue, but still, it could bite
us in some future patch.)

John Naylor and Tom Lane

Discussion: https://postgr.es/m/CACPNZCs0qWTqJ2QUSGJ07B7uvAvzMb-KbG2q+oo+J3tsWN5cqw@mail.gmail.com
2020-05-01 17:28:00 -04:00
Jeff Davis 0cacb2b79d Fix missing pfree() in logtape.c, missed by 24d85952. 2020-04-19 10:33:06 -07:00
Michael Paquier 8128b0c152 Fix collection of typos and grammar mistakes in the tree, volume 2
This fixes some comments and documentation new as of Postgres 13, and is
a follow-up of the work done in dd0f37e.

Author: Justin Pryzby
Discussion: https://postgr.es/m/20200408165653.GF2228@telsasoft.com
2020-04-14 14:45:43 +09:00
Andrew Dunstan 7be5d8df1f Use perl warnings pragma consistently
We've had a mixture of the warnings pragma, the -w switch on the shebang
line, and no warnings at all. This patch removes the -w swicth and add
the warnings pragma to all perl sources missing it. It raises the
severity of the TestingAndDebugging::RequireUseWarnings  perlcritic
policy to level 5, so that we catch any future violations.

Discussion: https://postgr.es/m/20200412074245.GB623763@rfd.leadboat.com
2020-04-13 11:55:45 -04:00
Tomas Vondra d2d8a229bc Implement Incremental Sort
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.

This has a number of benefits:

- Reduced memory consumption, because only a single group (determined by
  values in the sorted prefix) needs to be kept in memory. This may also
  eliminate the need to spill to disk.

- Lower startup cost, because Incremental Sort produce results after each
  prefix group, which is beneficial for plans where startup cost matters
  (like for example queries with LIMIT clause).

We consider both Sort and Incremental Sort, and decide based on costing.

The implemented algorithm operates in two different modes:

- Fetching a minimum number of tuples without check of equality on the
  prefix keys, and sorting on all columns when safe.

- Fetching all tuples for a single prefix group and then sorting by
  comparing only the remaining (non-prefix) keys.

We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.

This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.

There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.

Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
2020-04-06 21:35:10 +02:00
Jeff Davis 24d85952a5 Introduce LogicalTapeSetExtend().
Increases the number of tapes in a logical tape set. This will be
important for disk-based hash aggregation, because the maximum number
of tapes is not known ahead of time.

While discussing this change, it was observed to regress the
performance of Sort for at least one test case. The performance
regression was because some versions of GCC switch to an inlined
version of memcpy() in LogicalTapeWrite() after this change. No
performance regression for clang was observed.

Because the regression is due to an arbitrary decision by the
compiler, I decided it shouldn't hold up this change. If it needs to
be fixed, we can find a workaround.

Author: Adam Lee, Jeff Davis
Discussion: https://postgr.es/m/e54bfec11c59689890f277722aaaabd05f78e22c.camel%40j-davis.com
2020-03-09 10:40:02 -07:00
Jeff Davis 8021985d79 logtape.c: allocate read buffer even for an empty tape.
Prior to this commit, the read buffer was allocated at the time the tape
was rewound; but as an optimization, would not be allocated at all if
the tape was empty.

That optimization meant that it was valid to have a rewound tape with
the buffer set to NULL, but only if a number of conditions were met
and only if the API was used properly. After 7fdd919a refactored the
code to support lazily-allocating the buffer, Coverity started
complaining.

The optimization for empty tapes doesn't seem important, so just
allocate the buffer whether the tape has any data or not.

Discussion: https://postgr.es/m/20351.1581868306%40sss.pgh.pa.us
2020-02-19 10:04:17 -08:00
Jeff Davis 7fdd919ae7 Logical Tape Set: lazily allocate read buffer.
The write buffer was already lazily-allocated, so this is more
symmetric. It also means that a freshly-rewound tape (whether for
reading or writing) is not consuming memory for the buffer.

Discussion: https://postgr.es/m/97c46a59c27f3c38e486ca170fcbc618d97ab049.camel%40j-davis.com
2020-02-13 10:44:25 -08:00
Jeff Davis c02fdc9223 Logical Tape Set: use min heap for freelist.
Previously, the freelist of blocks was tracked as an
occasionally-sorted array. A min heap is more resilient to larger
freelists or more frequent changes between reading and writing.

Discussion: https://postgr.es/m/97c46a59c27f3c38e486ca170fcbc618d97ab049.camel%40j-davis.com
2020-02-06 10:09:45 -08:00
Bruce Momjian 7559d8ebfa Update copyrights for 2020
Backpatch-through: update all files in master, backpatch legal files through 9.4
2020-01-01 12:21:45 -05:00
Andres Freund 7d962eaf50 Remove unused code from tuplesort.
copytup_index() is unused, as tuplesort_putindextuplevalues() doesn't
use COPYTUP(). Replace function body with an elog(ERROR), as already
done e.g. for copytup_datum().

Author: Andres Freund
Discussion: https://postgr.es/m/20191013144153.ooxrfglvnaocsrx2@alap3.anarazel.de
2019-11-13 15:57:01 -08:00
Andres Freund 01368e5d9d Split all OBJS style lines in makefiles into one-line-per-entry style.
When maintaining or merging patches, one of the most common sources
for conflicts are the list of objects in makefiles. Especially when
the split across lines has been changed on both sides, which is
somewhat common due to attempting to stay below 80 columns, those
conflicts are unnecessarily laborious to resolve.

By splitting, and alphabetically sorting, OBJS style lines into one
object per line, conflicts should be less frequent, and easier to
resolve when they still occur.

Author: Andres Freund
Discussion: https://postgr.es/m/20191029200901.vww4idgcxv74cwes@alap3.anarazel.de
2019-11-05 14:41:07 -08:00
Andres Freund cef82eda14 Fix CLUSTER on expression indexes.
Since the introduction of different slot types, in 1a0586de36, we
create a virtual slot in tuplesort_begin_cluster(). While that looks
right, it unfortunately doesn't actually work, as ExecStoreHeapTuple()
is used to store tuples in the slot. Unfortunately no regression tests
for CLUSTER on expression indexes existed so far.

Fix the slot type, and add bare bones tests for CLUSTER on expression
indexes.

Reported-By: Justin Pryzby
Author: Andres Freund
Discussion: https://postgr.es/m/20191011210320.GS10470@telsasoft.com
Backpatch: 12, like 1a0586de36
2019-10-15 10:40:13 -07:00
Tom Lane b360e0fcd7 Make tuplesort_set_bound() assertions more comprehensible, hopefully.
Add the comments that I griped were missing.  Also re-order tests
so that parallelism-related tests aren't randomly separated from
each other.

Discussion: https://postgr.es/m/CAAaqYe9GD__4Crm=ddz+-XXcNhfY_V5gFYdLdmkFNq=2VHO56Q@mail.gmail.com
2019-09-13 16:57:07 -04:00
Alvaro Herrera bc98e1ea64 Merge two assertions to make comment clearer
Authored by Tom Lane, after a gripe from James Coleman.

Discussion: https://postgr.es/m/CAAaqYe9GD__4Crm=ddz+-XXcNhfY_V5gFYdLdmkFNq=2VHO56Q@mail.gmail.com
2019-09-12 10:37:04 -03:00
Michael Paquier 66bde49d96 Fix inconsistencies and typos in the tree, take 10
This addresses some issues with unnecessary code comments, fixes various
typos in docs and comments, and removes some orphaned structures and
definitions.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/9aabc775-5494-b372-8bcb-4dfc0bd37c68@gmail.com
2019-08-13 13:53:41 +09:00
Peter Geoghegan d8cd68c8d4 Rename tuplesort.c's SortTuple.tupindex field.
Rename the "tupindex" field from tuplesort.c's SortTuple struct to
"srctape", since it can only ever be used to store a source/input tape
number when merging external sort runs.  This has been the case since
commit 8b304b8b72, which removed replacement selection sort from
tuplesort.c.
2019-08-09 17:06:45 -07:00
Peter Geoghegan 28b901f73a Update obsolete tuplesort READTUP() comment.
READTUP() routines do not and cannot use the resettable "tuplecontext"
memory context, since it is deleted when merging begins.  Update an
obsolete comment that claimed otherwise.  This was an oversight in
commit e94568ecc1.

In passing, fix an unrelated tuplesort typo.
2019-08-08 13:20:44 -07:00
Alvaro Herrera e1f4c481b9 Remove unnecessary #include <limits.h>
This include was probably copied from tuplestore.c, but it's not needed.

Extracted from a larger patch submitted by vignesh C <vignesh21@gmail.com>

Discussion: https://postgr.es/m/CALDaNm1B9naPDTm3ox1m_yZvOm3KA5S4kZQSWWAeLHAQ=3gV1Q@mail.gmail.com
2019-08-07 16:55:31 -04:00
Michael Paquier 3412030205 Fix more typos and inconsistencies in the tree
Author: Alexander Lakhin
Discussion: https://postgr.es/m/0a5419ea-1452-a4e6-72ff-545b1a5a8076@gmail.com
2019-06-17 16:13:16 +09:00
Tom Lane 8255c7a5ee Phase 2 pgindent run for v12.
Switch to 2.1 version of pg_bsd_indent.  This formats
multiline function declarations "correctly", that is with
additional lines of parameter declarations indented to match
where the first line's left parenthesis is.

Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
2019-05-22 13:04:48 -04:00
Peter Geoghegan dd299df818 Make heap TID a tiebreaker nbtree index column.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique.  This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.  A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes.  These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.

A later patch will enhance the logic used by nbtree to pick a split
point.  This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at.  Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.

The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits.  The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised.  However, there should be a compatibility note in the v12
release notes.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20 10:04:01 -07:00
Peter Geoghegan e5adcb789d Refactor nbtree insertion scankeys.
Use dedicated struct to represent nbtree insertion scan keys.  Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions.  This is based on a suggestion by Andrey
Lepikhov.

Streamline how unique index insertions cache binary search progress.
Cache the state of in-progress binary searches within _bt_check_unique()
for later instead of having callers avoid repeating the binary search in
an ad-hoc manner.  This makes it easy to add a new optimization:
_bt_check_unique() now falls out of its loop immediately in the common
case where it's already clear that there couldn't possibly be a
duplicate.

The new _bt_check_unique() scheme makes it a lot easier to manage cached
binary search effort afterwards, from within _bt_findinsertloc().  This
is needed for the upcoming patch to make nbtree tuples unique by
treating heap TID as a final tiebreaker column.  Unique key binary
searches need to restore lower and upper bounds.  They cannot simply
continue to use the >= lower bound as the offset to insert at, because
the heap TID tiebreaker column must be used in comparisons for the
restored binary search (unlike the original _bt_check_unique() binary
search, where scankey's heap TID column must be omitted).

Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas, Andrey Lepikhov
Discussion: https://postgr.es/m/CAH2-WzmE6AhUdk9NdWBf4K3HjWXZBX3+umC7mH7+WDrKcRtsOw@mail.gmail.com
2019-03-20 09:30:57 -07:00
Alvaro Herrera af38498d4c Move hash_any prototype from access/hash.h to utils/hashutils.h
... as well as its implementation from backend/access/hash/hashfunc.c to
backend/utils/hash/hashfn.c.

access/hash is the place for the hash index AM, not really appropriate
for generic facilities, which is what hash_any is; having things the old
way meant that anything using hash_any had to include the AM's include
file, pointlessly polluting its namespace with unrelated, unnecessary
cruft.

Also move the HTEqual strategy number to access/stratnum.h from
access/hash.h.

To avoid breaking third-party extension code, add an #include
"utils/hashutils.h" to access/hash.h.  (An easily removed line by
committers who enjoy their asbestos suits to protect them from angry
extension authors.)

Discussion: https://postgr.es/m/201901251935.ser5e4h6djt2@alvherre.pgsql
2019-03-11 13:17:50 -03:00
Andres Freund a9c35cf85c Change function call information to be variable length.
Before this change FunctionCallInfoData, the struct arguments etc for
V1 function calls are stored in, always had space for
FUNC_MAX_ARGS/100 arguments, storing datums and their nullness in two
arrays.  For nearly every function call 100 arguments is far more than
needed, therefore wasting memory. Arg and argnull being two separate
arrays also guarantees that to access a single argument, two
cachelines have to be touched.

Change the layout so there's a single variable-length array with pairs
of value / isnull. That drastically reduces memory consumption for
most function calls (on x86-64 a two argument function now uses
64bytes, previously 936 bytes), and makes it very likely that argument
value and its nullness are on the same cacheline.

Arguments are stored in a new NullableDatum struct, which, due to
padding, needs more memory per argument than before. But as usually
far fewer arguments are stored, and individual arguments are cheaper
to access, that's still a clear win.  It's likely that there's other
places where conversion to NullableDatum arrays would make sense,
e.g. TupleTableSlots, but that's for another commit.

Because the function call information is now variable-length
allocations have to take the number of arguments into account. For
heap allocations that can be done with SizeForFunctionCallInfoData(),
for on-stack allocations there's a new LOCAL_FCINFO(name, nargs) macro
that helps to allocate an appropriately sized and aligned variable.

Some places with stack allocation function call information don't know
the number of arguments at compile time, and currently variably sized
stack allocations aren't allowed in postgres. Therefore allow for
FUNC_MAX_ARGS space in these cases. They're not that common, so for
now that seems acceptable.

Because of the need to allocate FunctionCallInfo of the appropriate
size, older extensions may need to update their code. To avoid subtle
breakages, the FunctionCallInfoData struct has been renamed to
FunctionCallInfoBaseData. Most code only references FunctionCallInfo,
so that shouldn't cause much collateral damage.

This change is also a prerequisite for more efficient expression JIT
compilation (by allocating the function call information on the stack,
allowing LLVM to optimize it away); previously the size of the call
information caused problems inside LLVM's optimizer.

Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20180605172952.x34m5uz6ju6enaem@alap3.anarazel.de
2019-01-26 14:17:52 -08:00
Bruce Momjian 97c39498e5 Update copyright for 2019
Backpatch-through: certain files through 9.4
2019-01-02 12:44:25 -05:00
Peter Geoghegan 1a990b207b Have BufFileSize() ereport() on FileSize() failure.
Move the responsibility for checking for and reporting a failure from
the only current BufFileSize() caller, logtape.c, to BufFileSize()
itself.  Code within buffile.c is generally responsible for interfacing
with fd.c to report irrecoverable failures.  This seems like a
convention that's worth sticking to.

Reorganizing things this way makes it easy to make the error message
raised in the event of BufFileSize() failure descriptive of the
underlying problem.  We're now clear on the distinction between
temporary file name and BufFile name, and can show errno, confident that
its value actually relates to the error being reported.  In passing, an
existing, similar buffile.c ereport() + errcode_for_file_access() site
is changed to follow the same conventions.

The API of the function BufFileSize() is changed by this commit, despite
already being in a stable release (Postgres 11).  This seems acceptable,
since the BufFileSize() ABI was changed by commit aa55183042, which
hasn't made it into a point release yet.  Besides, it's difficult to
imagine a third party BufFileSize() caller not just raising an error
anyway, since BufFile state should be considered corrupt when
BufFileSize() fails.

Per complaint from Tom Lane.

Discussion: https://postgr.es/m/26974.1540826748@sss.pgh.pa.us
Backpatch: 11-, where shared BufFiles were introduced.
2018-11-28 14:42:54 -08:00
Andres Freund 1a0586de36 Introduce notion of different types of slots (without implementing them).
Upcoming work intends to allow pluggable ways to introduce new ways of
storing table data. Accessing those table access methods from the
executor requires TupleTableSlots to be carry tuples in the native
format of such storage methods; otherwise there'll be a significant
conversion overhead.

Different access methods will require different data to store tuples
efficiently (just like virtual, minimal, heap already require fields
in TupleTableSlot). To allow that without requiring additional pointer
indirections, we want to have different structs (embedding
TupleTableSlot) for different types of slots.  Thus different types of
slots are needed, which requires adapting creators of slots.

The slot that most efficiently can represent a type of tuple in an
executor node will often depend on the type of slot a child node
uses. Therefore we need to track the type of slot is returned by
nodes, so parent slots can create slots based on that.

Relatedly, JIT compilation of tuple deforming needs to know which type
of slot a certain expression refers to, so it can create an
appropriate deforming function for the type of tuple in the slot.

But not all nodes will only return one type of slot, e.g. an append
node will potentially return different types of slots for each of its
subplans.

Therefore add function that allows to query the type of a node's
result slot, and whether it'll always be the same type (whether it's
fixed). This can be queried using ExecGetResultSlotOps().

The scan, result, inner, outer type of slots are automatically
inferred from ExecInitScanTupleSlot(), ExecInitResultSlot(),
left/right subtrees respectively. If that's not correct for a node,
that can be overwritten using new fields in PlanState.

This commit does not introduce the actually abstracted implementation
of different kind of TupleTableSlots, that will be left for a followup
commit.  The different types of slots introduced will, for now, still
use the same backing implementation.

While this already partially invalidates the big comment in
tuptable.h, it seems to make more sense to update it later, when the
different TupleTableSlot implementations actually exist.

Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-15 22:00:30 -08:00
Thomas Munro aa55183042 Use 64 bit type for BufFileSize().
BufFileSize() can't use off_t, because it's only 32 bits wide on
some systems.  BufFile objects can have many 1GB segments so the
total size can exceed 2^31.  The only known client of the function
is parallel CREATE INDEX, which was reported to fail when building
large indexes on Windows.

Though this is technically an ABI break on platforms with a 32 bit
off_t and we might normally avoid back-patching it, the function is
brand new and thus unlikely to have been discovered by extension
authors yet, and it's fairly thoroughly broken on those platforms
anyway, so just fix it.

Defect in 9da0cc35.  Bug #15460.  Back-patch to 11, where this
function landed.

Author: Thomas Munro
Reported-by: Paul van der Linden, Pavel Oskin
Reviewed-by: Peter Geoghegan
Discussion: https://postgr.es/m/15460-b6db80de822fa0ad%40postgresql.org
Discussion: https://postgr.es/m/CAHDGBJP_GsESbTt4P3FZA8kMUKuYxjg57XHF7NRBoKnR%3DCAR-g%40mail.gmail.com
2018-11-15 13:13:57 +13:00
Peter Geoghegan cb6f8a9a72 Adjust trace_sort log messages.
The project message style guide dictates: "When citing the name of an
object, state what kind of object it is".  The parallel CREATE INDEX
patch added a worker number to most of the trace_sort messages within
tuplesort.c without specifying the object type.  Bring these messages
into compliance with the style guide.

We're still treating a leader or serial Tuplesortstate as having worker
number -1.  trace_sort is a developer option, and these two cases are
highly comparable, so this seems appropriate.

Per complaint from Tom Lane.

Discussion: https://postgr.es/m/8330.1540831863@sss.pgh.pa.us
Backpatch: 11-, where parallel CREATE INDEX was introduced.
2018-11-01 09:18:57 -07:00
Andres Freund 29c94e03c7 Split ExecStoreTuple into ExecStoreHeapTuple and ExecStoreBufferHeapTuple.
Upcoming changes introduce further types of tuple table slots, in
preparation of making table storage pluggable. New storage methods
will have different representation of tuples, therefore the slot
accessor should refer explicitly to heap tuples.

Instead of just renaming the functions, split it into one function
that accepts heap tuples not residing in buffers, and one accepting
ones in buffers.  Previously one function was used for both, but that
was a bit awkward already, and splitting will allow us to represent
slot types for tuples in buffers and normal memory separately.

This is split out from the patch introducing abstract slots, as this
largely consists out of mechanical changes.

Author: Ashutosh Bapat
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/20180220224318.gw4oe5jadhpmcdnm@alap3.anarazel.de
2018-09-25 16:27:48 -07:00
Tom Lane 44cac93464 Avoid using potentially-under-aligned page buffers.
There's a project policy against using plain "char buf[BLCKSZ]" local
or static variables as page buffers; preferred style is to palloc or
malloc each buffer to ensure it is MAXALIGN'd.  However, that policy's
been ignored in an increasing number of places.  We've apparently got
away with it so far, probably because (a) relatively few people use
platforms on which misalignment causes core dumps and/or (b) the
variables chance to be sufficiently aligned anyway.  But this is not
something to rely on.  Moreover, even if we don't get a core dump,
we might be paying a lot of cycles for misaligned accesses.

To fix, invent new union types PGAlignedBlock and PGAlignedXLogBlock
that the compiler must allocate with sufficient alignment, and use
those in place of plain char arrays.

I used these types even for variables where there's no risk of a
misaligned access, since ensuring proper alignment should make
kernel data transfers faster.  I also changed some places where
we had been palloc'ing short-lived buffers, for coding style
uniformity and to save palloc/pfree overhead.

Since this seems to be a live portability hazard (despite the lack
of field reports), back-patch to all supported versions.

Patch by me; thanks to Michael Paquier for review.

Discussion: https://postgr.es/m/1535618100.1286.3.camel@credativ.de
2018-09-01 15:27:17 -04:00
Alexander Korotkov edf59c40dd Fix more wrong paths in header comments
It appears that there are more files, whose header comment paths are
wrong.  So, fix those paths.  No backpatching per proposal of Tom Lane.

Discussion: https://postgr.es/m/CAPpHfdsJyYbOj59MOQL%2B4XxdcomLSLfLqBtAvwR%2BpsCqj3ELdQ%40mail.gmail.com
2018-07-11 17:57:04 +03:00
Peter Geoghegan aefb0a382c Correct a comment on logtape.c's leader tape.
randomAccess parallel tuplesorts are disallowed because the leader would
try to write to its own leader tape, not because the leader would try to
write to a worker tape directly.

Cleanup from commit 9da0cc3528.
2018-06-26 11:16:20 -07:00
Andrew Dunstan 3a7cc727c7 Don't fall off the end of perl functions
This complies with the perlcritic policy
Subroutines::RequireFinalReturn, which is a severity 4 policy. Since we
only currently check at severity level 5, the policy is raised to that
level until we move to level 4 or lower, so that any new infringements
will be caught.

A small cosmetic piece of tidying of the pgperlcritic script is
included.

Mike Blackwell

Discussion: https://postgr.es/m/CAESHdJpfFm_9wQnQ3koY3c91FoRQsO-fh02za9R3OEMndOn84A@mail.gmail.com
2018-05-27 09:08:42 -04:00
Heikki Linnakangas 445e31bdc7 Fix some sloppiness in the new BufFileSize() and BufFileAppend() functions.
There were three related issues:

* BufFileAppend() incorrectly reset the seek position on the 'source' file.
  As a result, if you had called BufFileRead() on the file before calling
  BufFileAppend(), it got confused, and subsequent calls would read/write
  at wrong position.

* BufFileSize() did not work with files opened with BufFileOpenShared().

* FileGetSize() only worked on temporary files.

To fix, change the way BufFileSize() works so that it works on shared
files. Remove FileGetSize() altogether, as it's no longer needed. Remove
buffilesize from TapeShare struct, as the leader process can simply call
BufFileSize() to get the tape's size, there's no need to pass it through
shared memory anymore.

Discussion: https://www.postgresql.org/message-id/CAH2-WznEDYe_NZXxmnOfsoV54oFkTdMy7YLE2NPBLuttO96vTQ@mail.gmail.com
2018-05-02 17:23:13 +03:00
Tom Lane 41c912cad1 Clean up warnings from -Wimplicit-fallthrough.
Recent gcc can warn about switch-case fall throughs that are not
explicitly labeled as intentional.  This seems like a good thing,
so clean up the warnings exposed thereby by labeling all such
cases with comments that gcc will recognize.

In files that already had one or more suitable comments, I generally
matched the existing style of those.  Otherwise I went with
/* FALLTHROUGH */, which is one of the spellings approved at the
more-restrictive-than-default level -Wimplicit-fallthrough=4.
(At the default level you can also spell it /* FALL ?THRU */,
and it's not picky about case.  What you can't do is include
additional text in the same comment, so some existing comments
containing versions of this aren't good enough.)

Testing with gcc 8.0.1 (Fedora 28's current version), I found that
I also had to put explicit "break"s after elog(ERROR) or ereport(ERROR);
apparently, for this purpose gcc doesn't recognize that those don't
return.  That seems like possibly a gcc bug, but it's fine because
in most places we did that anyway; so this amounts to a visit from the
style police.

Discussion: https://postgr.es/m/15083.1525207729@sss.pgh.pa.us
2018-05-01 19:35:08 -04:00
Heikki Linnakangas 45f87b7710 Remove outdated comment on how to set logtape's read buffer size.
Commit b75f467b6e removed the LogicalTapeAssignReadBufferSize() function,
but forgot to update this comment. The read buffer size is an argument to
LogicalTapeRewindForRead() now. Doesn't seem worth going into the details
in the file header comment, so remove the outdated sentence altogether.
2018-04-27 09:31:43 +03:00
Tom Lane bdf46af748 Post-feature-freeze pgindent run.
Discussion: https://postgr.es/m/15719.1523984266@sss.pgh.pa.us
2018-04-26 14:47:16 -04:00
Teodor Sigaev c9c875a28f Rename IndexInfo.ii_KeyAttrNumbers array
Rename ii_KeyAttrNumbers to ii_IndexAttrNumbers to prevent confusion with
ii_NumIndexAttrs/ii_NumIndexKeyAttrs. ii_IndexAttrNumbers contains
all attributes including "including" columns, not only key attribute.

Discussion: https://www.postgresql.org/message-id/13123421-1d52-d0e4-c95c-6d69011e0595%40sigaev.ru
2018-04-12 13:02:45 +03:00
Teodor Sigaev 8224de4f42 Indexes with INCLUDE columns and their support in B-tree
This patch introduces INCLUDE clause to index definition.  This clause
specifies a list of columns which will be included as a non-key part in
the index.  The INCLUDE columns exist solely to allow more queries to
benefit from index-only scans.  Also, such columns don't need to have
appropriate operator classes.  Expressions are not supported as INCLUDE
columns since they cannot be used in index-only scans.

Index access methods supporting INCLUDE are indicated by amcaninclude flag
in IndexAmRoutine.  For now, only B-tree indexes support INCLUDE clause.

In B-tree indexes INCLUDE columns are truncated from pivot index tuples
(tuples located in non-leaf pages and high keys).  Therefore, B-tree indexes
now might have variable number of attributes.  This patch also provides
generic facility to support that: pivot tuples contain number of their
attributes in t_tid.ip_posid.  Free 13th bit of t_info is used for indicating
that.  This facility will simplify further support of index suffix truncation.
The changes of above are backward-compatible, pg_upgrade doesn't need special
handling of B-tree indexes for that.

Bump catalog version

Author: Anastasia Lubennikova with contribition by Alexander Korotkov and me
Reviewed by: Peter Geoghegan, Tomas Vondra, Antonin Houska, Jeff Janes,
			 David Rowley, Alexander Korotkov
Discussion: https://www.postgresql.org/message-id/flat/56168952.4010101@postgrespro.ru
2018-04-07 23:00:39 +03:00
Tom Lane 0b11a674fb Fix a boatload of typos in C comments.
Justin Pryzby

Discussion: https://postgr.es/m/20180331105640.GK28454@telsasoft.com
2018-04-01 15:01:28 -04:00
Tom Lane c2d4eb1b1f Fix actual and potential double-frees around tuplesort usage.
tuplesort_gettupleslot() passed back tuples allocated in the tuplesort's
own memory context, even when the caller was responsible to free them.
This created a double-free hazard, because some callers might destroy
the tuplesort object (via tuplesort_end) before trying to clean up the
last returned tuple.  To avoid this, change the API to specify that the
tuple is allocated in the caller's memory context.  v10 and HEAD already
did things that way, but in 9.5 and 9.6 this is a live bug that can
demonstrably cause crashes with some grouping-set usages.

In 9.5 and 9.6, this requires doing an extra tuple copy in some cases,
which is unfortunate.  But the amount of refactoring needed to avoid it
seems excessive for a back-patched change, especially since the cases
where an extra copy happens are less performance-critical.

Likewise change tuplesort_getdatum() to return pass-by-reference Datums
in the caller's context not the tuplesort's context.  There seem to be
no live bugs among its callers, but clearly the same sort of situation
could happen in future.

For other tuplesort fetch routines, continue to allocate the memory in
the tuplesort's context.  This is a little inconsistent with what we now
do for tuplesort_gettupleslot() and tuplesort_getdatum(), but that's
preferable to adding new copy overhead in the back branches where it's
clearly unnecessary.  These other fetch routines provide the weakest
possible guarantees about tuple memory lifespan from v10 on, anyway,
so this actually seems more consistent overall.

Adjust relevant comments to reflect these API redefinitions.

Arguably, we should change the pre-9.5 branches as well, but since
there are no known failure cases there, it seems not worth the risk.

Peter Geoghegan, per report from Bernd Helmle.  Reviewed by Kyotaro
Horiguchi; thanks also to Andreas Seltenreich for extracting a
self-contained test case.

Discussion: https://postgr.es/m/1512661638.9720.34.camel@oopsware.de
2018-03-28 13:26:57 -04:00
Robert Haas de6428afe1 Avoid another valgrind complaint about write() of uninitalized bytes.
Peter Geoghegan, per buildfarm member skink and Andres Freund

Discussion: http://postgr.es/m/20180221053426.gp72lw67yfpzkw7a@alap3.anarazel.de
2018-02-22 09:28:12 -05:00
Tom Lane 9a725f7b5c Silence assorted "variable may be used uninitialized" warnings.
All of these are false positives, but in each case a fair amount of
analysis is needed to see that, and it's not too surprising that not all
compilers are smart enough.  (In particular, in the logtape.c case, a
compiler lacking the knowledge provided by the Assert would almost surely
complain, so that this warning will be seen in any non-assert build.)

Some of these are of long standing while others are pretty recent,
but it only seems worth fixing them in HEAD.

Jaime Casanova, tweaked a bit by me

Discussion: https://postgr.es/m/CAJGNTeMcYAMJdPAom52dppLMtF-UnEZi0dooj==75OEv1EoBZA@mail.gmail.com
2018-02-14 16:06:49 -05:00
Robert Haas 9fafa413ac Avoid valgrind complaint about write() of uninitalized bytes.
LogicalTapeFreeze() may write out its first block when it is dirty but
not full, and then immediately read the first block back in from its
BufFile as a BLCKSZ-width block.  This can only occur in rare cases
where very few tuples were written out, which is currently only
possible with parallel external tuplesorts.  To avoid valgrind
complaints, tell it to treat the tail of logtape.c's buffer as
defined.

Commit 9da0cc3528 exposed this problem
but did not create it.  LogicalTapeFreeze() has always tended to write
out some amount of garbage bytes, but previously never wrote less than
one block of data in total, so the problem was masked.

Per buildfarm members lousyjack and skink.

Peter Geoghegan, based on a suggestion from Tom Lane and me.  Some
comment revisions by me.
2018-02-06 14:24:57 -05:00
Robert Haas 9da0cc3528 Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds.  Testing
to date shows that this can often be 2-3x faster than a serial
index build.

The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature.  We can
refine it as we get more experience.

Peter Geoghegan with some help from Rushabh Lathia.  While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature.  Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.

Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 13:32:44 -05:00
Bruce Momjian 9d4649ca49 Update copyright for 2018
Backpatch-through: certain files through 9.3
2018-01-02 23:30:12 -05:00
Andres Freund ab9e0e718a Add shared tuplestores.
SharedTuplestore allows multiple participants to write into it and
then read the tuples back from it in parallel.  Each reader receives
partial results.

For now it always uses disk files, but other buffering policies and
other kinds of scans (ie each reader receives complete results) may be
useful in future.

The upcoming parallel hash join feature will use this facility.

Author: Thomas Munro
Reviewed-By: Peter Geoghegan, Andres Freund, Robert Haas
Discussion: https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
2017-12-18 14:23:19 -08:00
Robert Haas 95b52351fe Remove obsolete comment.
Commit 8b304b8b72 removed replacement
selection, but left behind this comment text.  The optimization to
which the comment refers is not relevant without replacement
selection, because if we had so few tuples as to require only one
tape, we would have just completed the sort in memory.

Peter Geoghegan

Discussion: http://postgr.es/m/CAH2-WznqupLA8CMjp+vqzoe0yXu0DYYbQSNZxmgN76tLnAOZ_w@mail.gmail.com
2017-12-12 19:33:50 -05:00
Robert Haas eaedf0df71 Update typedefs.list and re-run pgindent
Discussion: http://postgr.es/m/CA+TgmoaA9=1RWKtBWpDaj+sF3Stgc8sHgf5z=KGtbjwPLQVDMA@mail.gmail.com
2017-11-29 09:24:24 -05:00
Peter Eisentraut 2eb4a831e5 Change TRUE/FALSE to true/false
The lower case spellings are C and C++ standard and are used in most
parts of the PostgreSQL sources.  The upper case spellings are only used
in some files/modules.  So standardize on the standard spellings.

The APIs for ICU, Perl, and Windows define their own TRUE and FALSE, so
those are left as is when using those APIs.

In code comments, we use the lower-case spelling for the C concepts and
keep the upper-case spelling for the SQL concepts.

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
2017-11-08 11:37:28 -05:00
Robert Haas 8b304b8b72 Remove replacement selection sort.
At the time replacement_sort_tuples was introduced, there were still
cases where replacement selection sort noticeably outperformed using
quicksort even for the first run.  However, those cases seem to have
evaporated as a result of further improvements made since that time
(and perhaps also advances in CPU technology).  So remove replacement
selection and the controlling GUC entirely.  This makes tuplesort.c
noticeably simpler and probably paves the way for further
optimizations someone might want to do later.

Peter Geoghegan, with review and testing by Tomas Vondra and me.

Discussion: https://postgr.es/m/CAH2-WzmmNjG_K0R9nqYwMq3zjyJJK+hCbiZYNGhAy-Zyjs64GQ@mail.gmail.com
2017-09-29 10:25:44 -04:00
Robert Haas bf11e7ee2e Propagate sort instrumentation from workers back to leader.
Up until now, when parallel query was used, no details about the
sort method or space used by the workers were available; details
were shown only for any sorting done by the leader.  Fix that.

Commit 1177ab1dab forced the test case
added by commit 1f6d515a67 to run
without parallelism; now that we have this infrastructure, allow
that again, with a little tweaking to make it pass with and without
force_parallel_mode.

Robert Haas and Tom Lane

Discussion: http://postgr.es/m/CA+Tgmoa2VBZW6S8AAXfhpHczb=Rf6RqQ2br+zJvEgwJ0uoD_tQ@mail.gmail.com
2017-08-29 13:26:33 -04:00
Alvaro Herrera f5d54ef97a Fix typo in comment 2017-08-08 18:34:25 -04:00
Tom Lane 512f67c8d0 Avoid integer overflow while sifting-up a heap in tuplesort.c.
If the number of tuples in the heap exceeds approximately INT_MAX/2,
this loop's calculation "2*i+1" could overflow, resulting in a crash.
Fix it by using unsigned int rather than int for the relevant local
variables; that shouldn't cost anything extra on any popular hardware.
Per bug #14722 from Sergey Koposov.

Original patch by Sergey Koposov, modified by me per a suggestion
from Heikki Linnakangas to use unsigned int not int64.

Back-patch to 9.4, where tuplesort.c grew the ability to sort as many
as INT_MAX tuples in-memory (commit 263865a48).

Discussion: https://postgr.es/m/20170629161637.1478.93109@wrigleys.postgresql.org
2017-07-12 13:24:16 -04:00
Tom Lane 382ceffdf7 Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.

By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis.  However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent.  That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.

This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.

This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 15:35:54 -04:00
Tom Lane c7b8998ebb Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.

Commit e3860ffa4d wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code.  The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there.  BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs.  So the
net result is that in about half the cases, such comments are placed
one tab stop left of before.  This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.

Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.

This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 15:19:25 -04:00
Tom Lane e3860ffa4d Initial pgindent run with pg_bsd_indent version 2.0.
The new indent version includes numerous fixes thanks to Piotr Stefaniak.
The main changes visible in this commit are:

* Nicer formatting of function-pointer declarations.
* No longer unexpectedly removes spaces in expressions using casts,
  sizeof, or offsetof.
* No longer wants to add a space in "struct structname *varname", as
  well as some similar cases for const- or volatile-qualified pointers.
* Declarations using PG_USED_FOR_ASSERTS_ONLY are formatted more nicely.
* Fixes bug where comments following declarations were sometimes placed
  with no space separating them from the code.
* Fixes some odd decisions for comments following case labels.
* Fixes some cases where comments following code were indented to less
  than the expected column 33.

On the less good side, it now tends to put more whitespace around typedef
names that are not listed in typedefs.list.  This might encourage us to
put more effort into typedef name collection; it's not really a bug in
indent itself.

There are more changes coming after this round, having to do with comment
indentation and alignment of lines appearing within parentheses.  I wanted
to limit the size of the diffs to something that could be reviewed without
one's eyes completely glazing over, so it seemed better to split up the
changes as much as practical.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 14:39:04 -04:00
Bruce Momjian a6fd7b7a5f Post-PG 10 beta1 pgindent run
perltidy run not included.
2017-05-17 16:31:56 -04:00
Andres Freund b58c433ef9 Fix duplicated words in comment.
Reported-By: Peter Geoghegan
Discussion: https://postgr.es/m/CAH2-Wzn3rY2N0gTWndaApD113T+O8L6oz8cm7_F3P8y4awdoOg@mail.gmail.com
Backpatch: no, only present in master
2017-05-06 17:03:45 -07:00
Andres Freund fa117ee403 Allow avoiding tuple copy within tuplesort_gettupleslot().
Add a "copy" argument to make it optional to receive a copy of caller
tuple that is safe to use following a subsequent manipulating of
tuplesort's state.  This is a performance optimization.  Most existing
tuplesort_gettupleslot() callers are made to opt out of copying.
Existing callers that happen to rely on the validity of tuple memory
beyond subsequent manipulations of the tuplesort request their own
copy.

This brings tuplesort_gettupleslot() in line with
tuplestore_gettupleslot().  In the future, a "copy"
tuplesort_getdatum() argument may be added, that similarly allows
callers to opt out of receiving their own copy of tuple.

In passing, clarify assumptions that callers of other tuplesort fetch
routines may make about tuple memory validity, per gripe from Tom
Lane.

Author: Peter Geoghegan
Discussion: CAM3SWZQWZZ_N=DmmL7tKy_OUjGH_5mN=N=A6h7kHyyDvEhg2DA@mail.gmail.com
2017-04-06 14:48:59 -07:00
Robert Haas ea69a0dead Expand hash indexes more gradually.
Since hash indexes typically have very few overflow pages, adding a
new splitpoint essentially doubles the on-disk size of the index,
which can lead to large and abrupt increases in disk usage (and
perhaps long delays on occasion).  To mitigate this problem to some
degree, divide larger splitpoints into four equal phases.  This means
that, for example, instead of growing from 4GB to 8GB all at once, a
hash index will now grow from 4GB to 5GB to 6GB to 7GB to 8GB, which
is perhaps still not as smooth as we'd like but certainly an
improvement.

This changes the on-disk format of the metapage, so bump HASH_VERSION
from 2 to 3.  This will force a REINDEX of all existing hash indexes,
but that's probably a good idea anyway.  First, hash indexes from
pre-10 versions of PostgreSQL could easily be corrupted, and we don't
want to confuse corruption carried over from an older release with any
corruption caused despite the new write-ahead logging in v10.  Second,
it will let us remove some backward-compatibility code added by commit
293e24e507.

Mithun Cy, reviewed by Amit Kapila, Jesper Pedersen and me.  Regression
test outputs updated by me.

Discussion: http://postgr.es/m/CAD__OuhG6F1gQLCgMQNnMNgoCvOLQZz9zKYJQNYvYmmJoM42gA@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoYty0jCf-pa+m+vYUJ716+AxM7nv_syvyanyf5O-L_i2A@mail.gmail.com
2017-04-03 23:46:33 -04:00
Kevin Grittner 18ce3a4ab2 Add infrastructure to support EphemeralNamedRelation references.
A QueryEnvironment concept is added, which allows new types of
objects to be passed into queries from parsing on through
execution.  At this point, the only thing implemented is a
collection of EphemeralNamedRelation objects -- relations which
can be referenced by name in queries, but do not exist in the
catalogs.  The only type of ENR implemented is NamedTuplestore, but
provision is made to add more types fairly easily.

An ENR can carry its own TupleDesc or reference a relation in the
catalogs by relid.

Although these features can be used without SPI, convenience
functions are added to SPI so that ENRs can easily be used by code
run through SPI.

The initial use of all this is going to be transition tables in
AFTER triggers, but that will be added to each PL as a separate
commit.

An incidental effect of this patch is to produce a more informative
error message if an attempt is made to modify the contents of a CTE
from a referencing DML statement.  No tests previously covered that
possibility, so one is added.

Kevin Grittner and Thomas Munro
Reviewed by Heikki Linnakangas, David Fetter, and Thomas Munro
with valuable comments and suggestions from many others
2017-03-31 23:17:18 -05:00
Heikki Linnakangas 7ac4a389a7 Don't create "holes" in BufFiles, in the new logtape code.
The "Simplify tape block format" commit ignored the rule that blocks
returned by ltsGetFreeBlock() must be written out in the same order, at
least in the first write pass. To fix, relax that requirement, by making
ltsWriteBlock() to detect if it's about to create a "hole" in the
underlying BufFile, and fill it with zeros instead.

Reported, analysed, and reviewed by Peter Geoghegan.

Discussion: https://www.postgresql.org/message-id/CAM3SWZRWdNtkhiG0GyiX_1mUAypiK3dV6-6542pYe2iEL-foTA@mail.gmail.com
2017-02-01 12:17:38 +02:00
Heikki Linnakangas dbeca61c35 Remove leftover reference to "indirect blocks" in comment.
Peter Geoghegan
2017-01-30 10:52:50 +02:00
Tom Lane 4e46c97fde Fix NULL pointer dereference in tuplesort.c.
Oversight in commit e94568ecc.  This could cause a crash when an external
datum tuplesort of a pass-by-value type required multiple passes.
Per report from Mithun Cy.

Peter Geoghegan

Discussion: https://postgr.es/m/CAD__OujuhfWFULGFSt1fyHqUb8N-XafjJhudwt88V0Qs2o84qg@mail.gmail.com
2017-01-16 13:53:40 -05:00
Bruce Momjian 1d25779284 Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
Heikki Linnakangas 01ec25631f Simplify tape block format.
No more indirect blocks. The blocks form a linked list instead.

This saves some memory, because we don't need to have a buffer in memory to
hold the indirect block (or blocks). To reflect that, TAPE_BUFFER_OVERHEAD
is reduced from 3 to 1 buffer, which allows using more memory for building
the initial runs.

Reviewed by Peter Geoghegan and Robert Haas.

Discussion: https://www.postgresql.org/message-id/34678beb-938e-646e-db9f-a7def5c44ada%40iki.fi
2016-12-22 18:45:00 +02:00