Commit Graph

160 Commits

Author SHA1 Message Date
Peter Geoghegan 3c3b8a4b26 Truncate line pointer array during VACUUM.
Teach VACUUM to truncate the line pointer array of each heap page when a
contiguous group of LP_UNUSED line pointers appear at the end of the
array -- these unused and unreferenced items are excluded.  This process
occurs during VACUUM's second pass over the heap, right after LP_DEAD
line pointers on the page (those encountered/pruned during the first
pass) are marked LP_UNUSED.

Truncation avoids line pointer bloat with certain workloads,
particularly those involving continual range DELETEs and bulk INSERTs
against the same table.

Also harden heapam code to check for an out-of-range page offset number
in places where we weren't already doing so.

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAEze2WjgaQc55Y5f5CQd3L=eS5CZcff2Obxp=O6pto8-f0hC4w@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopJMhFQj_g1rAj4GWr3ZSzw@mail.gmail.com
2021-04-07 08:47:15 -07:00
Peter Geoghegan 30aaab26e5 PageAddItemExtended(): Add LP_UNUSED assertion.
Assert that LP_UNUSED items have no storage.  If it's worth having
defensive code in non-assert builds then it's worth having an assertion
as well.
2021-03-28 20:10:02 -07:00
David Rowley bb437f995d Add TID Range Scans to support efficient scanning ranges of TIDs
This adds a new executor node named TID Range Scan.  The query planner
will generate paths for TID Range scans when quals are discovered on base
relations which search for ranges on the table's ctid column.  These
ranges may be open at either end. For example, WHERE ctid >= '(10,0)';
will return all tuples on page 10 and over.

To support this, two new optional callback functions have been added to
table AM.  scan_set_tidrange is used to set the scan range to just the
given range of TIDs.  scan_getnextslot_tidrange fetches the next tuple
in the given range.

For AMs were scanning ranges of TIDs would not make sense, these functions
can be set to NULL in the TableAmRoutine.  The query planner won't
generate TID Range Scan Paths in that case.

Author: Edmund Horner, David Rowley
Reviewed-by: David Rowley, Tomas Vondra, Tom Lane, Andres Freund, Zhihong Yu
Discussion: https://postgr.es/m/CAMyN-kB-nFTkF=VA_JPwFNo08S0d-Yk0F741S2B7LDmYAi8eyA@mail.gmail.com
2021-02-27 22:59:36 +13:00
Bruce Momjian ca3b37487b Update copyright for 2021
Backpatch-through: 9.5
2021-01-02 13:06:25 -05:00
Michael Paquier d401c5769e Extend PageIsVerified() to handle more custom options
This is useful for checks of relation pages without having to load the
pages into the shared buffers, and two cases can make use of that: page
verification in base backups and the online, lock-safe, flavor.

Compatibility is kept with past versions using a macro that calls the
new extended routine with the set of options compatible with the
original version.

Extracted from a larger patch by the same author.

Author: Anastasia Lubennikova
Reviewed-by: Michael Paquier, Julien Rouhaud
Discussion: https://postgr.es/m/608f3476-0598-2514-2c03-e05c7d2b0cbd@postgrespro.ru
2020-10-26 09:55:28 +09:00
David Rowley 19c60ad69a Optimize compactify_tuples function
This function could often be seen in profiles of vacuum and could often
be a significant bottleneck during recovery. The problem was that a qsort
was performed in order to sort an array of item pointers in reverse offset
order so that we could use that to safely move tuples up to the end of the
page without overwriting the memory of yet-to-be-moved tuples. i.e. we
used to compact the page starting at the back of the page and move towards
the front. The qsort that this required could be expensive for pages with
a large number of tuples.

In this commit, we take another approach to tuple compactification.

Now, instead of sorting the remaining item pointers array we first check
if the array is presorted and only memmove() the tuples that need to be
moved. This presorted check can be done very cheaply in the calling
functions when the array is being populated. This presorted case is very
fast.

When the item pointer array is not presorted we must copy tuples that need
to be moved into a temp buffer before copying them back into the page
again. This differs from what we used to do here as we're now copying the
tuples back into the page in reverse line pointer order. Previously we
left the existing order alone.  Reordering the tuples results in an
increased likelihood of hitting the pre-sorted case the next time around.
Any newly added tuple which consumes a new line pointer will also maintain
the correct sort order of tuples in the page which will also result in the
presorted case being hit the next time.  Only consuming an unused line
pointer can cause the order of tuples to go out again, but that will be
corrected next time the function is called for the page.

Benchmarks have shown that the non-presorted case is at least equally as
fast as the original qsort method even when the page just has a few
tuples. As the number of tuples becomes larger the new method maintains
its performance whereas the original qsort method became much slower when
the number of tuples on the page became large.

Author: David Rowley
Reviewed-by: Thomas Munro
Tested-by: Jakub Wartak
Discussion: https://postgr.es/m/CA+hUKGKMQFVpjr106gRhwk6R-nXv0qOcTreZuQzxgpHESAL6dw@mail.gmail.com
2020-09-16 13:22:20 +12:00
Tom Lane 695de5d1ed Split Makefile symbol CFLAGS_VECTOR into two symbols.
Replace CFLAGS_VECTOR with CFLAGS_UNROLL_LOOPS and CFLAGS_VECTORIZE,
allowing us to distinguish whether we want to apply -funroll-loops,
-ftree-vectorize, or both to a particular source file.  Up to now
the only consumer of the symbol has been checksum.c which wants
both, so that there was no need to distinguish; but that's about
to change.

Amit Khandekar, reviewed and edited a little by me

Discussion: https://postgr.es/m/CAJ3gD9evtA_vBo+WMYMyT-u=keHX7-r8p2w7OSRfXf42LTwCZQ@mail.gmail.com
2020-09-06 21:28:16 -04:00
Andres Freund 5e7bbb5286 code: replace 'master' with 'primary' where appropriate.
Also changed "in the primary" to "on the primary", and added a few
"the" before "primary".

Author: Andres Freund
Reviewed-By: David Steele
Discussion: https://postgr.es/m/20200615182235.x7lch5n6kcjq4aue@alap3.anarazel.de
2020-07-08 12:57:23 -07:00
Magnus Hagander eff5b245df Document that pg_checksums exists in checksums README
Author: Daniel Gustafsson <daniel@yesql.se>
2020-03-26 15:05:54 +01:00
Amit Kapila b4f140869f Add missing errcode() in a few ereport calls.
This will allow to specifying SQLSTATE error code for the errors in the
missing places.

Reported-by: Sawada Masahiko
Author: Sawada Masahiko
Backpatch-through: 9.5
Discussion: https://postgr.es/m/CA+fd4k6N8EjNvZpM8nme+y+05mz-SM8Z_BgkixzkA34R+ej0Kw@mail.gmail.com
2020-03-18 09:27:14 +05:30
Peter Geoghegan 0d861bbb70 Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method.  The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
the point where a leaf page split would otherwise be required.  New
posting list tuples are formed by merging together existing duplicate
tuples.  The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed.  Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.

The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach.  Most individual inserts of index tuples have
exactly the same overhead as before.  The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits.  The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).

Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key).  The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective.  This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.

A new index storage parameter (deduplicate_items) controls the use of
deduplication.  The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible.  This decision will be
reviewed at the end of the Postgres 13 beta period.

There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values.  The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely.  Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12).  However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from.  This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.

Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
    https://postgr.es/m/55E4051B.7020209@postgrespro.ru
    https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 13:05:30 -08:00
Michael Paquier f1f10a1ba9 Add declaration-level assertions for compile-time checks
Those new assertions can be used at file scope, outside of any function
for compilation checks.  This commit provides implementations for C and
C++, and fallback implementations.

Author: Peter Smith
Reviewed-by: Andres Freund, Kyotaro Horiguchi, Dagfinn Ilmari Mannsåker,
Michael Paquier
Discussion: https://postgr.es/m/201DD0641B056142AC8C6645EC1B5F62014B8E8030@SYD1217
2020-02-03 14:48:42 +09:00
Bruce Momjian 7559d8ebfa Update copyrights for 2020
Backpatch-through: update all files in master, backpatch legal files through 9.4
2020-01-01 12:21:45 -05:00
Amit Kapila 14aec03502 Make the order of the header file includes consistent in backend modules.
Similar to commits 7e735035f2 and dddf4cdc33, this commit makes the order
of header file inclusion consistent for backend modules.

In the passing, removed a couple of duplicate inclusions.

Author: Vignesh C
Reviewed-by: Kuntal Ghosh and Amit Kapila
Discussion: https://postgr.es/m/CALDaNm2Sznv8RR6Ex-iJO6xAdsxgWhCoETkaYX=+9DW3q0QCfA@mail.gmail.com
2019-11-12 08:30:16 +05:30
Andres Freund 01368e5d9d Split all OBJS style lines in makefiles into one-line-per-entry style.
When maintaining or merging patches, one of the most common sources
for conflicts are the list of objects in makefiles. Especially when
the split across lines has been changed on both sides, which is
somewhat common due to attempting to stay below 80 columns, those
conflicts are unnecessarily laborious to resolve.

By splitting, and alphabetically sorting, OBJS style lines into one
object per line, conflicts should be less frequent, and easier to
resolve when they still occur.

Author: Andres Freund
Discussion: https://postgr.es/m/20191029200901.vww4idgcxv74cwes@alap3.anarazel.de
2019-11-05 14:41:07 -08:00
Alvaro Herrera d890fa812d Make one message just like all its siblings. 2019-05-28 23:44:22 -04:00
Peter Geoghegan ae7291acbc Standardize ItemIdData terminology.
The term "item pointer" should not be used to refer to ItemIdData
variables, since that is needlessly ambiguous.  Only
ItemPointerData/ItemPointer variables should be called item pointers.

To fix, establish the convention that ItemIdData variables should always
be referred to either as "item identifiers" or "line pointers".  The
term "item identifier" already predominates in docs and translatable
messages, and so should be the preferred alternative there.

Discussion: https://postgr.es/m/CAH2-Wz=c=MZQjUzde3o9+2PLAPuHTpVZPPdYxN=E4ndQ2--8ew@mail.gmail.com
2019-05-13 15:53:39 -07:00
Magnus Hagander 6b9e875f72 Track block level checksum failures in pg_stat_database
This adds a column that counts how many checksum failures have occurred
on files belonging to a specific database. Both checksum failures
during normal backend processing and those created when a base backup
detects a checksum failure are counted.

Author: Magnus Hagander
Reviewed by: Julien Rouhaud
2019-03-09 10:47:30 -08:00
Bruce Momjian 97c39498e5 Update copyright for 2019
Backpatch-through: certain files through 9.4
2019-01-02 12:44:25 -05:00
Magnus Hagander a228cc13ae Revert "Allow on-line enabling and disabling of data checksums"
This reverts the backend sides of commit 1fde38beaa.
I have, at least for now, left the pg_verify_checksums tool in place, as
this tool can be very valuable without the rest of the patch as well,
and since it's a read-only tool that only runs when the cluster is down
it should be a lot safer.
2018-04-09 19:03:42 +02:00
Magnus Hagander 1fde38beaa Allow on-line enabling and disabling of data checksums
This makes it possible to turn checksums on in a live cluster, without
the previous need for dump/reload or logical replication (and to turn it
off).

Enabling checkusm starts a background process in the form of a
launcher/worker combination that goes through the entire database and
recalculates checksums on each and every page. Only when all pages have
been checksummed are they fully enabled in the cluster. Any failure of
the process will revert to checksums off and the process has to be
started.

This adds a new WAL record that indicates the state of checksums, so
the process works across replicated clusters.

Authors: Magnus Hagander and Daniel Gustafsson
Review: Tomas Vondra, Michael Banck, Heikki Linnakangas, Andrey Borodin
2018-04-05 22:04:48 +02:00
Bruce Momjian 9d4649ca49 Update copyright for 2018
Backpatch-through: certain files through 9.3
2018-01-02 23:30:12 -05:00
Tom Lane a9169f0200 Avoid looping through line pointers twice in PageRepairFragmentation().
There doesn't seem to be any good reason to do the filling of the
itemidbase[] array separately from the first traversal of the pointers.
It's certainly not a win if there are any line pointers with storage,
and even if there aren't, this change doesn't insert code into the part
of the first loop that will be traversed in that case.  So let's just
merge the two loops.

Yura Sokolov, reviewed by Claudio Freire

Discussion: https://postgr.es/m/e49befcc6f1d7099834c6fdf5c675a60@postgrespro.ru
2017-11-03 17:21:59 -04:00
Tom Lane 382ceffdf7 Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.

By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis.  However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent.  That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.

This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.

This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 15:35:54 -04:00
Tom Lane c7b8998ebb Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.

Commit e3860ffa4d wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code.  The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there.  BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs.  So the
net result is that in about half the cases, such comments are placed
one tab stop left of before.  This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.

Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.

This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 15:19:25 -04:00
Tom Lane 651902deb1 Re-run pgindent.
This is just to have a clean base state for testing of Piotr Stefaniak's
latest version of FreeBSD indent.  I fixed up a couple of places where
pgindent would have changed format not-nicely.  perltidy not included.

Discussion: https://postgr.es/m/VI1PR03MB119959F4B65F000CA7CD9F6BF2CC0@VI1PR03MB1199.eurprd03.prod.outlook.com
2017-06-13 13:05:59 -04:00
Alvaro Herrera 55a70a023c Assorted translatable string fixes
Mark our rusage reportage string translatable; remove quotes from type
names; unify formatting of very similar messages.
2017-06-04 11:41:16 -04:00
Alvaro Herrera ce96ce60ca Remove direct uses of ItemPointer.{ip_blkid,ip_posid}
There are no functional changes here; this simply encapsulates knowledge
of the ItemPointerData struct so that a future patch can change things
without more breakage.

All direct users of ip_blkid and ip_posid are changed to use existing
macros ItemPointerGetBlockNumber and ItemPointerGetOffsetNumber
respectively.  For callers where that's inappropriate (because they
Assert that the itempointer is is valid-looking), add
ItemPointerGetBlockNumberNoCheck and ItemPointerGetOffsetNumberNoCheck,
which lack the assertion but are otherwise identical.

Author: Pavan Deolasee
Discussion: https://postgr.es/m/CABOikdNnFon4cJiL=h1mZH3bgUeU+sWHuU4Yr8AB=j3A2p1GiA@mail.gmail.com
2017-03-28 19:02:23 -03:00
Robert Haas b0f18cb77f hash: Refactor bucket squeeze code.
In preparation for adding write-ahead logging to hash indexes,
refactor _hash_freeovflpage and _hash_squeezebucket so that all
related page modifications happen in a single section of code.  The
previous coding assumed that it would be fine to move tuples one at a
time, and also that the various operations involved in freeing an
overflow page didn't necessarily all need to be done together, all
of which is true if you don't care about write-ahead logging.

Amit Kapila, with slight changes by me.
2017-02-27 22:34:21 +05:30
Bruce Momjian 1d25779284 Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
Tom Lane 24992c6db9 Rewrite PageIndexDeleteNoCompact into a form that only deletes 1 tuple.
The full generality of deleting an arbitrary number of tuples is no longer
needed, so let's save some code and cycles by replacing the original coding
with an implementation based on PageIndexTupleDelete.

We can always get back the old code from git if we need it again for new
callers (though I don't care for its willingness to mess with line pointers
it wasn't told to mess with).

Discussion: <552.1473445163@sss.pgh.pa.us>
2016-09-09 19:00:59 -04:00
Tom Lane 1a4be103a5 Convert PageAddItem into a macro to save a few cycles.
Nowadays this is just a backwards-compatibility wrapper around
PageAddItemExtended, so let's avoid the extra level of function call.
In addition, because pretty much all callers are passing constants
for the two bool arguments, compilers will be able to constant-fold
the conversion to a flags bitmask.

Discussion: <552.1473445163@sss.pgh.pa.us>
2016-09-09 18:17:07 -04:00
Tom Lane b1328d78f8 Invent PageIndexTupleOverwrite, and teach BRIN and GiST to use it.
PageIndexTupleOverwrite performs approximately the same function as
PageIndexTupleDelete (or PageIndexDeleteNoCompact) followed by PageAddItem
targeting the same item pointer offset.  But in the case where the new
tuple is the same size as the old, it avoids shuffling other data around on
the page, because the new tuple is placed where the old one was rather than
being appended to the end of the page.  This has been shown to provide a
substantial speedup for some GiST use-cases.

Also, this change allows some API simplifications: we can get rid of
the rather klugy and error-prone PAI_ALLOW_FAR_OFFSET flag for
PageAddItemExtended, since that was used only to cover a corner case
for BRIN that's better expressed by using PageIndexTupleOverwrite.

Note that this patch causes a rather subtle WAL incompatibility: the
physical page content change represented by certain WAL records is now
different than it was before, because while the tuples have the same
itempointer line numbers, the tuples themselves are in different places.
I have not bumped the WAL version number because I think it doesn't matter
unless you are trying to do bitwise comparisons of original and replayed
pages, and in any case we're early in a devel cycle and there will probably
be more WAL changes before v10 gets out the door.

There is probably room to make use of PageIndexTupleOverwrite in SP-GiST
and GIN too, but that is left for a future patch.

Andrey Borodin, reviewed by Anastasia Lubennikova, whacked around a bit
by me

Discussion: <CAJEAwVGQjGGOj6mMSgMwGvtFd5Kwe6VFAxY=uEPZWMDjzbn4VQ@mail.gmail.com>
2016-09-09 18:02:36 -04:00
Tom Lane 984d0a14e8 In PageIndexTupleDelete, don't assume stored item lengths are MAXALIGNed.
PageAddItem stores the item length as-is.  It MAXALIGN's the amount of
space actually allocated for each tuple, but not the stored length.
PageRepairFragmentation, PageIndexMultiDelete, and PageIndexDeleteNoCompact
are all on board with this and MAXALIGN item lengths after fetching them.
But PageIndexTupleDelete expects the stored length to be a MAXALIGN
multiple already.  This accidentally works for existing index AMs because
they all maxalign their tuple sizes internally; but we don't do that for
heap tuples, and it shouldn't be a requirement for index tuples either.

So, sync PageIndexTupleDelete with the rest of bufpage.c by having it
maxalign the item size after fetching.

Also add a check that pd_special is maxaligned, to ensure that the test
"(offset + size) > phdr->pd_special" is still doing the right thing.
(If offset and pd_special are aligned, it doesn't matter whether size is.)
Again, this is in sync with the rest of the routines here, except for
PageAddItem which doesn't test because it doesn't actually do anything
for which pd_special alignment matters.

This shouldn't have any immediate functional impact; it just adds the
flexibility to use PageIndexTupleDelete on index tuples with non-aligned
lengths.

Discussion: <3814.1473366762@sss.pgh.pa.us>
2016-09-09 12:21:09 -04:00
Andres Freund 417fefaf08 Faster PageIsVerified() for the all zeroes case.
That's primarily useful for testing very large relations, using sparse
files.

Discussion: <20140331101001.GE13135@alap3.anarazel.de>
Reviewed-By: Peter Geoghegan
2016-09-08 17:02:43 -07:00
Alvaro Herrera 975ad4e602 Fix PageAddItem BRIN bug
BRIN was relying on the ability to remove a tuple from an index page,
then putting another tuple in the same line pointer.  But PageAddItem
refuses to add a tuple beyond the first free item past the last used
item, and in particular, it rejects an attempt to add an item to an
empty page anywhere other than the first line pointer.  PageAddItem
issues a WARNING and indicates to the caller that it failed, which in
turn causes the BRIN calling code to issue a PANIC, so the whole
sequence looks like this:
	WARNING:  specified item offset is too large
	PANIC:  failed to add BRIN tuple

To fix, create a new function PageAddItemExtended which is like
PageAddItem except that the two boolean arguments become a flags bitmap;
the "overwrite" and "is_heap" boolean flags in PageAddItem become
PAI_OVERWITE and PAI_IS_HEAP flags in the new function, and a new flag
PAI_ALLOW_FAR_OFFSET enables the behavior required by BRIN.
PageAddItem() retains its original signature, for compatibility with
third-party modules (other callers in core code are not modified,
either).

Also, in the belt-and-suspenders spirit, I added a new sanity check in
brinGetTupleForHeapBlock to raise an error if an TID found in the revmap
is not marked as live by the page header.  This causes it to react with
"ERROR: corrupted BRIN index" to the bug at hand, rather than a hard
crash.

Backpatch to 9.5.

Bug reported by Andreas Seltenreich as detected by his handy sqlsmith
fuzzer.
Discussion: https://www.postgresql.org/message-id/87mvni77jh.fsf@elite.ansel.ydns.eu
2016-05-30 14:47:22 -04:00
Bruce Momjian ee94300446 Update copyright for 2016
Backpatch certain files through 9.1
2016-01-02 13:33:40 -05:00
Bruce Momjian 807b9e0dff pgindent run for 9.5 2015-05-23 21:35:49 -04:00
Tom Lane d4b538ea36 Improve packing/alignment annotation for ItemPointerData.
We want this struct to be exactly a series of 3 int16 words, no more
and no less.  Historically, at least, some ARM compilers preferred to
pad it to 8 bytes unless coerced.  Our old way of doing that was just
to use __attribute__((packed)), but as pointed out by Piotr Stefaniak,
that does too much: it also licenses the compiler to give the struct
only byte-alignment.  We don't want that because it adds access overhead,
possibly quite significant overhead.  According to the GCC manual, what
we want requires also specifying __attribute__((align(2))).  It's not
entirely clear if all the relevant compilers accept this pragma as well,
but we can hope the buildfarm will tell us if not.  We can also add a
static assertion that should fire if the compiler padded the struct.

Since the combination of these pragmas should define exactly what we
want on any compiler that accepts them, let's try using them wherever
we think they exist, not only for __arm__.  (This is likely to expose
that the conditional definitions in c.h are inadequate, but finding
that out would be a good thing.)

The immediate motivation for this is that the current definition of
ExecRowMark allows its curCtid field to be misaligned.  It is not clear
whether there are any other uses of ItemPointerData with a similar hazard.
We could change the definition of ExecRowMark if this doesn't work, but
it would be far better to have a future-proof fix.

Piotr Stefaniak, some further hacking by me
2015-05-21 17:21:46 -04:00
Heikki Linnakangas b48437d11b Fix off-by-one error in Assertion.
The point of the assertion is to ensure that the arrays allocated in stack
are large enough, but the check was one item short.

This won't matter in practice because MaxIndexTuplesPerPage is an
overestimate, so you can't have that many items on a page in reality.
But let's be tidy.

Spotted by Anastasia Lubennikova. Backpatch to all supported versions, like
the patch that added the assertion.
2015-05-19 19:25:01 +03:00
Heikki Linnakangas 809d9a260b Refactor page compactifying code.
The logic to compact away removed tuples from page was duplicated with
small differences in PageRepairFragmentation, PageIndexMultiDelete, and
PageIndexDeleteNoCompact. Put it into a common function.

Reviewed by Peter Geoghegan.
2015-02-03 14:09:29 +02:00
Bruce Momjian 4baaf863ec Update copyright for 2015
Backpatch certain files through 9.0
2015-01-06 11:43:47 -05:00
Alvaro Herrera 7516f52594 BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes.  They work by maintaining "summary" data about
block ranges.  Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not.  Normal index scans are not supported
because these indexes do not store TIDs.

As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.

For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range.  This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results.  In this commit I only include minmax.

Catalog version bumped due to new builtin catalog entries.

There's more that could be done here, but this is a good step forwards.

Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.

Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.

PS:
  The research leading to these results has received funding from the
  European Union's Seventh Framework Programme (FP7/2007-2013) under
  grant agreement n° 318633.
2014-11-07 16:38:14 -03:00
Bruce Momjian 0a78320057 pgindent run for 9.4
This includes removing tabs after periods in C comments, which was
applied to back branches, so this change should not effect backpatching.
2014-05-06 12:12:18 -04:00
Heikki Linnakangas 877b088785 Avoid allocations in critical sections.
If a palloc in a critical section fails, it becomes a PANIC.
2014-04-04 13:35:44 +03:00
Bruce Momjian 7e04792a1c Update copyright for 2014
Update all files in head, and files COPYRIGHT and legal.sgml in all back
branches.
2014-01-07 16:05:30 -05:00
Noah Misch 19085116ee Cooperate with the Valgrind instrumentation framework.
Valgrind "client requests" in aset.c and mcxt.c teach Valgrind and its
Memcheck tool about the PostgreSQL allocator.  This makes Valgrind
roughly as sensitive to memory errors involving palloc chunks as it is
to memory errors involving malloc chunks.  Further client requests in
PageAddItem() and printtup() verify that all bits being added to a
buffer page or furnished to an output function are predictably-defined.
Those tests catch failures of C-language functions to fully initialize
the bits of a Datum, which in turn stymie optimizations that rely on
_equalConst().  Define the USE_VALGRIND symbol in pg_config_manual.h to
enable these additions.  An included "suppression file" silences nominal
errors we don't plan to fix.

Reviewed in earlier versions by Peter Geoghegan and Korry Douglas.
2013-06-26 20:22:25 -04:00
Tom Lane f04216341d Refactor checksumming code to make it easier to use externally.
pg_filedump and other external utility programs are likely to want to be
able to check Postgres page checksums.  To avoid messy duplication of code,
move the checksumming functionality into an exported header file, much as
we did awhile back for the CRC code.

In passing, get rid of an unportable assumption that a static char[] array
will be word-aligned, and do some other minor code beautification.
2013-06-13 22:35:56 -04:00
Bruce Momjian 9af4159fce pgindent run for release 9.3
This is the first run of the Perl-based pgindent script.  Also update
pgindent instructions.
2013-05-29 16:58:43 -04:00
Simon Riggs fdea2530bd Compiler optimizations for page checksum code.
Ants Aasma and Jeff Davis
2013-04-30 06:59:26 +01:00