Commit Graph

3099 Commits

Author SHA1 Message Date
Tom Lane 81e334ce4e Set the metapage's pd_lower correctly in brin, gin, and spgist indexes.
Previously, these index types left the pd_lower field set to the default
SizeOfPageHeaderData, which is really a lie because it ought to point past
whatever space is being used for metadata.  The coding accidentally failed
to fail because we never told xlog.c that the metapage is of standard
format --- but that's not very good, because it impedes WAL consistency
checking, and in some cases prevents compression of full-page images.

To fix, ensure that we set pd_lower correctly, not only when creating a
metapage but whenever we write it out (these apparently redundant steps are
needed to cope with pg_upgrade'd indexes that don't yet contain the right
value).  This allows telling xlog.c that the page is of standard format.

The WAL consistency check mask functions are made to mask only if pd_lower
appears valid, which I think is likely unnecessary complication, since
any metapage appearing in a v11 WAL stream should contain valid pd_lower.
But it doesn't cost much to be paranoid.

Amit Langote, reviewed by Michael Paquier and Amit Kapila

Discussion: https://postgr.es/m/0d273805-0e9e-ec1a-cb84-d4da400b8f85@lab.ntt.co.jp
2017-11-02 17:22:08 -04:00
Tom Lane 62a16572d5 Fix corner-case errors in brin_doupdate().
In some cases the BRIN code releases lock on an index page, and later
re-acquires lock and tries to check that the tuple it was working on is
still there.  That check was a couple bricks shy of a load.  It didn't
consider that the page might have turned into a "revmap" page.  (The
samepage code path doesn't call brin_getinsertbuffer(), so it isn't
protected by the checks for revmap status there.)  It also didn't check
whether the tuple offset was now off the end of the linepointer array.
Since commit 24992c6db the latter case is pretty common, but at least
in principle it could have occurred before that.  The net result is
that concurrent updates of a BRIN index could fail with errors like
"invalid index offnum" or "inconsistent range map".

Per report from Tomas Vondra.  Back-patch to 9.5, since this code is
substantially the same in all versions containing BRIN.

Discussion: https://postgr.es/m/10d2b9f9-f427-03b8-8ad9-6af4ecacbee9@2ndquadrant.com
2017-11-02 12:54:55 -04:00
Alvaro Herrera c6764eb3ae Revert bogus fixes of HOT-freezing bug
It turns out we misdiagnosed what the real problem was.  Revert the
previous changes, because they may have worse consequences going
forward.  A better fix is forthcoming.

The simplistic test case is kept, though disabled.

Discussion: https://postgr.es/m/20171102112019.33wb7g5wp4zpjelu@alap3.anarazel.de
2017-11-02 15:51:41 +01:00
Robert Haas 846fcc8516 Fix problems with the "role" GUC and parallel query.
Without this fix, dropping a role can sometimes result in parallel
query failures in sessions that have used "SET ROLE" to assume the
dropped role, even if that setting isn't active any more.

Report by Pavan Deolasee.  Patch by Amit Kapila, reviewed by me.

Discussion: http://postgr.es/m/CABOikdOomRcZsLsLK+Z+qENM1zxyaWnAvFh3MJZzZnnKiF+REg@mail.gmail.com
2017-10-29 12:58:40 +05:30
Andres Freund 31079a4a8e Replace remaining uses of pq_sendint with pq_sendint{8,16,32}.
pq_sendint() remains, so extension code doesn't unnecessarily break.

Author: Andres Freund
Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbekae4@alap3.anarazel.de
2017-10-11 21:00:46 -07:00
Andres Freund 4c119fbcd4 Improve performance of SendRowDescriptionMessage.
There's three categories of changes leading to better performance:
- Splitting the per-attribute part of SendRowDescriptionMessage into a
  v2 and a v3 version allows avoiding branches for every attribute.
- Preallocating the size of the buffer to be big enough for all
  attributes and then using pq_write* avoids unnecessary buffer
  size checks & resizing.
- Reusing a persistently allocated StringInfo for all
  SendRowDescriptionMessage() invocations avoids repeated allocations
  & reallocations.

Author: Andres Freund
Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbekae4@alap3.anarazel.de
2017-10-11 17:23:23 -07:00
Andres Freund f2dec34e19 Use one stringbuffer for all rows printed in printtup.c.
This avoids newly allocating, and then possibly growing, the
stringbuffer for every row. For wide rows this can substantially
reduce memory allocator overhead, at the price of not immediately
reducing memory usage after outputting an especially wide row.

Author: Andres Freund
Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbekae4@alap3.anarazel.de
2017-10-11 16:26:35 -07:00
Tom Lane 5fa6b0d102 Remove unnecessary PG_TRY overhead for CurrentResourceOwner changes.
resowner/README contained advice to use a PG_TRY block to restore the
old CurrentResourceOwner value anywhere that that variable is transiently
changed.  That advice was only inconsistently followed, however, and
on reflection it seems like unnecessary overhead.  We don't bother
with such a convention for transient CurrentMemoryContext changes,
on the grounds that any (sub)transaction abort will start out by
resetting CurrentMemoryContext to what it wants.  But the same is
true of CurrentResourceOwner, so there seems no need to treat it
differently.

Hence, remove PG_TRY blocks that exist only to restore CurrentResourceOwner
before re-throwing the error.  There are a couple of places that restore
it along with some other actions, and I left those alone; the restore is
probably unnecessary but no noticeable gain will result from removing it.

Discussion: https://postgr.es/m/5236.1507583529@sss.pgh.pa.us
2017-10-11 17:44:09 -04:00
Tom Lane 6b87416c9a Fix access-off-end-of-array in clog.c.
Sloppy loop coding in set_status_by_pages() resulted in fetching one array
element more than it should from the subxids[] array.  The odds of this
resulting in SIGSEGV are pretty small, but we've certainly seen that happen
with similar mistakes elsewhere.  While at it, we can get rid of an extra
TransactionIdToPage() calculation per loop.

Per report from David Binderman.  Back-patch to all supported branches,
since this code is quite old.

Discussion: https://postgr.es/m/HE1PR0802MB2331CBA919CBFFF0C465EB429C710@HE1PR0802MB2331.eurprd08.prod.outlook.com
2017-10-06 12:20:23 -04:00
Alvaro Herrera a5736bf754 Fix traversal of half-frozen update chains
When some tuple versions in an update chain are frozen due to them being
older than freeze_min_age, the xmax/xmin trail can become broken.  This
breaks HOT (and probably other things).  A subsequent VACUUM can break
things in more serious ways, such as leaving orphan heap-only tuples
whose root HOT redirect items were removed.  This can be seen because
index creation (or REINDEX) complain like
  ERROR:  XX000: failed to find parent tuple for heap-only tuple at (0,7) in table "t"

Because of relfrozenxid contraints, we cannot avoid the freezing of the
early tuples, so we must cope with the results: whenever we see an Xmin
of FrozenTransactionId, consider it a match for whatever the previous
Xmax value was.

This problem seems to have appeared in 9.3 with multixact changes,
though strictly speaking it seems unrelated.

Since 9.4 we have commit 37484ad2a "Change the way we mark tuples as
frozen", so the fix is simple: just compare the raw Xmin (still stored
in the tuple header, since freezing merely set an infomask bit) to the
Xmax.  But in 9.3 we rewrite the Xmin value to FrozenTransactionId, so
the original value is lost and we have nothing to compare the Xmax with.
To cope with that case we need to compare the Xmin with FrozenXid,
assume it's a match, and hope for the best.  Sadly, since you can
pg_upgrade a 9.3 instance containing half-frozen pages to newer
releases, we need to keep the old check in newer versions too, which
seems a bit brittle; I hope we can somehow get rid of that.

I didn't optimize the new function for performance.  The new coding is
probably a bit slower than before, since there is a function call rather
than a straight comparison, but I'd rather have it work correctly than
be fast but wrong.

This is a followup after 20b6552242 fixed a few related problems.
Apparently, in 9.6 and up there are more ways to get into trouble, but
in 9.3 - 9.5 I cannot reproduce a problem anymore with this patch, so
there must be a separate bug.

Reported-by: Peter Geoghegan
Diagnosed-by: Peter Geoghegan, Michael Paquier, Daniel Wood,
	Yi Wen Wong, Álvaro
Discussion: https://postgr.es/m/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com
2017-10-06 17:20:01 +02:00
Tom Lane fe9ba28ee8 Fix typo in README.
s/BeginInternalSubtransaction/BeginInternalSubTransaction/
2017-10-05 15:06:01 -04:00
Robert Haas e9baa5e9fa Allow DML commands that create tables to use parallel query.
Haribabu Kommi, reviewed by Dilip Kumar and Rafia Sabih.  Various
cosmetic changes by me to explain why this appears to be safe but
allowing inserts in parallel mode in general wouldn't be.  Also, I
removed the REFRESH MATERIALIZED VIEW case from Haribabu's patch,
since I'm not convinced that case is OK, and hacked on the
documentation somewhat.

Discussion: http://postgr.es/m/CAJrrPGdo5bak6qnPWe8Kpi8g_jfQEs-G4SYmG9y+OFaw2-dPvA@mail.gmail.com
2017-10-05 11:40:48 -04:00
Peter Eisentraut 5373bc2a08 Add background worker type
Add bgw_type field to background worker structure.  It is intended to be
set to the same value for all workers of the same type, so they can be
grouped in pg_stat_activity, for example.

The backend_type column in pg_stat_activity now shows bgw_type for a
background worker.  The ps listing also no longer calls out that a
process is a background worker but just show the bgw_type.  That way,
being a background worker is more of an implementation detail now that
is not shown to the user.  However, most log messages still refer to
'background worker "%s"'; otherwise constructing sensible and
translatable log messages would become tricky.

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
2017-09-29 11:08:24 -04:00
Alvaro Herrera 20b6552242 Fix freezing of a dead HOT-updated tuple
Vacuum calls page-level HOT prune to remove dead HOT tuples before doing
liveness checks (HeapTupleSatisfiesVacuum) on the remaining tuples.  But
concurrent transaction commit/abort may turn DEAD some of the HOT tuples
that survived the prune, before HeapTupleSatisfiesVacuum tests them.
This happens to activate the code that decides to freeze the tuple ...
which resuscitates it, duplicating data.

(This is especially bad if there's any unique constraints, because those
are now internally violated due to the duplicate entries, though you
won't know until you try to REINDEX or dump/restore the table.)

One possible fix would be to simply skip doing anything to the tuple,
and hope that the next HOT prune would remove it.  But there is a
problem: if the tuple is older than freeze horizon, this would leave an
unfrozen XID behind, and if no HOT prune happens to clean it up before
the containing pg_clog segment is truncated away, it'd later cause an
error when the XID is looked up.

Fix the problem by having the tuple freezing routines cope with the
situation: don't freeze the tuple (and keep it dead).  In the cases that
the XID is older than the freeze age, set the HEAP_XMAX_COMMITTED flag
so that there is no need to look up the XID in pg_clog later on.

An isolation test is included, authored by Michael Paquier, loosely
based on Daniel Wood's original reproducer.  It only tests one
particular scenario, though, not all the possible ways for this problem
to surface; it be good to have a more reliable way to test this more
fully, but it'd require more work.
In message https://postgr.es/m/20170911140103.5akxptyrwgpc25bw@alvherre.pgsql
I outlined another test case (more closely matching Dan Wood's) that
exposed a few more ways for the problem to occur.

Backpatch all the way back to 9.3, where this problem was introduced by
multixact juggling.  In branches 9.3 and 9.4, this includes a backpatch
of commit e5ff9fefcd50 (of 9.5 era), since the original is not
correctable without matching the coding pattern in 9.5 up.

Reported-by: Daniel Wood
Diagnosed-by: Daniel Wood
Reviewed-by: Yi Wen Wong, Michaël Paquier
Discussion: https://postgr.es/m/E5711E62-8FDF-4DCA-A888-C200BF6B5742@amazon.com
2017-09-28 16:44:01 +02:00
Tom Lane 28e0727076 Revert to 9.6 treatment of ALTER TYPE enumtype ADD VALUE.
This reverts commit 15bc038f9, along with the followon commits 1635e80d3
and 984c92074 that tried to clean up the problems exposed by bug #14825.
The result was incomplete because it failed to address parallel-query
requirements.  With 10.0 release so close upon us, now does not seem like
the time to be adding more code to fix that.  I hope we can un-revert this
code and add the missing parallel query support during the v11 cycle.

Back-patch to v10.

Discussion: https://postgr.es/m/20170922185904.1448.16585@wrigleys.postgresql.org
2017-09-27 16:14:43 -04:00
Tom Lane 1635e80d30 Use a blacklist to distinguish original from add-on enum values.
Commit 15bc038f9 allowed ALTER TYPE ADD VALUE to be executed inside
transaction blocks, by disallowing the use of the added value later
in the same transaction, except under limited circumstances.  However,
the test for "limited circumstances" was heuristic and could reject
references to enum values that were created during CREATE TYPE AS ENUM,
not just later.  This breaks the use-case of restoring pg_dump scripts
in a single transaction, as reported in bug #14825 from Balazs Szilfai.

We can improve this by keeping a "blacklist" table of enum value OIDs
created by ALTER TYPE ADD VALUE during the current transaction.  Any
visible-but-uncommitted value whose OID is not in the blacklist must
have been created by CREATE TYPE AS ENUM, and can be used safely
because it could not have a lifespan shorter than its parent enum type.

This change also removes the restriction that a renamed enum value
can't be used before being committed (unless it was on the blacklist).

Andrew Dunstan, with cosmetic improvements by me.
Back-patch to v10.

Discussion: https://postgr.es/m/20170922185904.1448.16585@wrigleys.postgresql.org
2017-09-26 13:14:46 -04:00
Robert Haas 22c5e73562 Remove lsn from HashScanPosData.
This was intended as infrastructure for weakening VACUUM's locking
requirements, similar to what was done for btree indexes in commit
2ed5b87f96.  However, for hash indexes,
it seems that the improvements which are possible are actually
extremely marginal.  Furthermore, performing the LSN cross-check will
end up skipping cleanup far more often than is necessary; we only care
about page modifications due to a VACUUM, but the LSN check will fail
if ANY modification has occurred.  So, rather than pressing forward
with that "optimization", just rip the LSN field out.

Patch by me, reviewed by Ashutosh Sharma and Amit Kapila

Discussion: http://postgr.es/m/CAA4eK1JxqqcuC5Un7YLQVhOYSZBS+t=3xqZuEkt5RyquyuxpwQ@mail.gmail.com
2017-09-26 09:16:45 -04:00
Robert Haas 79a4a665c0 Fix trivial mistake in README.
You might think I (Robert) could manage to count to five without
messing it up, but if you did, you would be wrong.

Amit Kapila

Discussion: http://postgr.es/m/CAA4eK1JxqqcuC5Un7YLQVhOYSZBS+t=3xqZuEkt5RyquyuxpwQ@mail.gmail.com
2017-09-26 09:01:43 -04:00
Peter Eisentraut 0c5803b450 Refactor new file permission handling
The file handling functions from fd.c were called with a diverse mix of
notations for the file permissions when they were opening new files.
Almost all files created by the server should have the same permissions
set.  So change the API so that e.g. OpenTransientFile() automatically
uses the standard permissions set, and OpenTransientFilePerm() is a new
function that takes an explicit permissions set for the few cases where
it is needed.  This also saves an unnecessary argument for call sites
that are just opening an existing file.

While we're reviewing these APIs, get rid of the FileName typedef and
use the standard const char * for the file name and mode_t for the file
mode.  This makes these functions match other file handling functions
and removes an unnecessary layer of mysteriousness.  We can also get rid
of a few casts that way.

Author: David Steele <david@pgmasters.net>
2017-09-23 10:16:18 -04:00
Robert Haas 6a2fa09c0c For wal_consistency_checking, mask page checksum as well as page LSN.
If the LSN is different, the checksum will be different, too.

Ashwin Agrawal, reviewed by Michael Paquier and Kuntal Ghosh

Discussion: http://postgr.es/m/CALfoeis5iqrAU-+JAN+ZzXkpPr7+-0OAGv7QUHwFn=-wDy4o4Q@mail.gmail.com
2017-09-22 14:28:22 -04:00
Robert Haas 7c75ef5715 hash: Implement page-at-a-time scan.
Commit 09cb5c0e7d added a similar
optimization to btree back in 2006, but nobody bothered to implement
the same thing for hash indexes, probably because they weren't
WAL-logged and had lots of other performance problems as well.  As
with the corresponding btree case, this eliminates the problem of
potentially needing to refind our position within the page, and cuts
down on pin/unpin traffic as well.

Ashutosh Sharma, reviewed by Alexander Korotkov, Jesper Pedersen,
Amit Kapila, and me.  Some final edits to comments and README by
me.

Discussion: http://postgr.es/m/CAE9k0Pm3KTx93K8_5j6VMzG4h5F+SyknxUwXrN-zqSZ9X8ZS3w@mail.gmail.com
2017-09-22 13:56:27 -04:00
Andres Freund fc49e24fa6 Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.

But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.

This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured.  For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.

Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
    Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-19 22:03:48 -07:00
Tom Lane 2d484f9b05 Remove no-op GiST support functions in the core GiST opclasses.
The preceding patch allowed us to remove useless GiST support functions.
This patch actually does that for all the no-op cases in the core GiST
code.  This buys us whatever performance gain is to be had, and more
importantly exercises the preceding patch.

There remain no-op functions in the contrib GiST opclasses, but those
will take more work to remove.

Discussion: https://postgr.es/m/CAJEAwVELVx9gYscpE=Be6iJxvdW5unZ_LkcAaVNSeOwvdwtD=A@mail.gmail.com
2017-09-19 23:32:59 -04:00
Tom Lane d3a4f89d8a Allow no-op GiST support functions to be omitted.
There are common use-cases in which the compress and/or decompress
functions can be omitted, with the result being that we make no
data transformation when storing or retrieving index values.
Previously, you had to provide a no-op function anyway, but this
patch allows such opclass support functions to be omitted.

Furthermore, if the compress function is omitted, then the core code
knows that the stored representation is the same as the original data.
This means we can allow index-only scans without requiring a fetch
function to be provided either.  Previously you had to provide a
no-op fetch function if you wanted IOS to work.

This reportedly provides a small performance benefit in such cases,
but IMO the real reason for doing it is just to reduce the amount of
useless boilerplate code that has to be written for GiST opclasses.

Andrey Borodin, reviewed by Dmitriy Sarafannikov

Discussion: https://postgr.es/m/CAJEAwVELVx9gYscpE=Be6iJxvdW5unZ_LkcAaVNSeOwvdwtD=A@mail.gmail.com
2017-09-19 23:32:59 -04:00
Andres Freund ec9e05b3c3 Fix crash restart bug introduced in 8356753c21.
The bug was caused by not re-reading the control file during crash
recovery restarts, which lead to an attempt to pfree() shared memory
contents. The fix is to re-read the control file, which seems good
anyway.

It's unclear as of this moment, whether we want to keep the
refactoring introduced in the commit referenced above, or come up with
an alternative approach. But fixing the bug in the mean time seems
like a good idea regardless.

A followup commit will introduce regression test coverage for crash
restarts.

Reported-By: Tom Lane
Discussion: https://postgr.es/m/14134.1505572349@sss.pgh.pa.us
2017-09-18 17:25:49 -07:00
Tom Lane eb5c404b17 Minor code-cleanliness improvements for btree.
Make the btree page-flags test macros (P_ISLEAF and friends) return clean
boolean values, rather than values that might not fit in a bool.  Use them
in a few places that were randomly referencing the flag bits directly.

In passing, change access/nbtree/'s only direct use of BUFFER_LOCK_SHARE to
BT_READ.  (Some think we should go the other way, but as long as we have
BT_READ/BT_WRITE, let's use them consistently.)

Masahiko Sawada, reviewed by Doug Doole

Discussion: https://postgr.es/m/CAD21AoBmWPeN=WBB5Jvyz_Nt3rmW1ebUyAnk3ZbJP3RMXALJog@mail.gmail.com
2017-09-18 16:36:28 -04:00
Andres Freund cc5f81366c Add support for coordinating record typmods among parallel workers.
Tuples can have type RECORDOID and a typmod number that identifies a blessed
TupleDesc in a backend-private cache.  To support the sharing of such tuples
through shared memory and temporary files, provide a typmod registry in
shared memory.

To achieve that, introduce per-session DSM segments, created on demand when a
backend first runs a parallel query.  The per-session DSM segment has a
table-of-contents just like the per-query DSM segment, and initially the
contents are a shared record typmod registry and a DSA area to provide the
space it needs to grow.

State relating to the current session is accessed via a Session object
reached through global variable CurrentSession that may require significant
redesign further down the road as we figure out what else needs to be shared
or remodelled.

Author: Thomas Munro
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com
2017-09-14 19:59:21 -07:00
Andres Freund 8356753c21 Perform only one ReadControlFile() during startup.
Previously we read the control file in multiple places. But soon the
segment size will be configurable and stored in the control file, and
that needs to be available earlier than it currently is needed.

Instead of adding yet another place where it's read, refactor things
so there's a single processing of the control file during startup (in
EXEC_BACKEND that's every individual backend's startup).

Author: Andres Freund
Discussion: http://postgr.es/m/20170913092828.aozd3gvvmw67gmyc@alap3.anarazel.de
2017-09-14 14:14:34 -07:00
Peter Eisentraut 821fb8cdbf Message style fixes 2017-09-11 11:21:27 -04:00
Tom Lane 3ca930fc39 Improve performance of get_actual_variable_range with recently-dead tuples.
In commit fccebe421, we hacked get_actual_variable_range() to scan the
index with SnapshotDirty, so that if there are many uncommitted tuples
at the end of the index range, it wouldn't laboriously scan through all
of them looking for a live value to return.  However, that didn't fix it
for the case of many recently-dead tuples at the end of the index;
SnapshotDirty recognizes those as committed dead and so we're back to
the same problem.

To improve the situation, invent a "SnapshotNonVacuumable" snapshot type
and use that instead.  The reason this helps is that, if the snapshot
rejects a given index entry, we know that the indexscan will mark that
index entry as killed.  This means the next get_actual_variable_range()
scan will proceed past that entry without visiting the heap, making the
scan a lot faster.  We may end up accepting a recently-dead tuple as
being the estimated extremal value, but that doesn't seem much worse than
the compromise we made before to accept not-yet-committed extremal values.

The cost of the scan is still proportional to the number of dead index
entries at the end of the range, so in the interval after a mass delete
but before VACUUM's cleaned up the mess, it's still possible for
get_actual_variable_range() to take a noticeable amount of time, if you've
got enough such dead entries.  But the constant factor is much much better
than before, since all we need to do with each index entry is test its
"killed" bit.

We chose to back-patch commit fccebe421 at the time, but I'm hesitant to
do so here, because this form of the problem seems to affect many fewer
people.  Also, even when it happens, it's less bad than the case fixed
by commit fccebe421 because we don't get the contention effects from
expensive TransactionIdIsInProgress tests.

Dmitriy Sarafannikov, reviewed by Andrey Borodin

Discussion: https://postgr.es/m/05C72CF7-B5F6-4DB9-8A09-5AC897653113@yandex.ru
2017-09-07 19:41:51 -04:00
Peter Eisentraut 1356f78ea9 Reduce excessive dereferencing of function pointers
It is equivalent in ANSI C to write (*funcptr) () and funcptr().  These
two styles have been applied inconsistently.  After discussion, we'll
use the more verbose style for plain function pointer variables, to make
it clear that it's a variable, and the shorter style when the function
pointer is in a struct (s.func() or s->func()), because then it's clear
that it's not a plain function name, and otherwise the excessive
punctuation makes some of those invocations hard to read.

Discussion: https://www.postgresql.org/message-id/f52c16db-14ed-757d-4b48-7ef360b1631d@2ndquadrant.com
2017-09-07 13:56:09 -04:00
Tom Lane 6eb52da394 Fix handling of savepoint commands within multi-statement Query strings.
Issuing a savepoint-related command in a Query message that contains
multiple SQL statements led to a FATAL exit with a complaint about
"unexpected state STARTED".  This is a shortcoming of commit 4f896dac1,
which attempted to prevent such misbehaviors in multi-statement strings;
its quick hack of marking the individual statements as "not top-level"
does the wrong thing in this case, and isn't a very accurate description
of the situation anyway.

To fix, let's introduce into xact.c an explicit model of what happens for
multi-statement Query strings.  This is an "implicit transaction block
in progress" state, which for many purposes works like the normal
TBLOCK_INPROGRESS state --- in particular, IsTransactionBlock returns true,
causing the desired result that PreventTransactionChain will throw error.
But in case of error abort it works like TBLOCK_STARTED, allowing the
transaction to be cancelled without need for an explicit ROLLBACK command.

Commit 4f896dac1 is reverted in toto, so that we go back to treating the
individual statements as "top level".  We could have left it as-is, but
this allows sharpening the error message for PreventTransactionChain
calls inside functions.

Except for getting a normal error instead of a FATAL exit for savepoint
commands, this patch should result in no user-visible behavioral change
(other than that one error message rewording).  There are some things
we might want to do in the line of changing the appearance or wording of
error and warning messages around this behavior, which would be much
simpler to do now that it's an explicitly modeled state.  But I haven't
done them here.

Although this fixes a long-standing bug, no backpatch.  The consequences
of the bug don't seem severe enough to justify the risk that this commit
itself creates some new issue.

Patch by me, but it owes something to previous investigation by
Takayuki Tsunakawa, who also reported the bug in the first place.
Also thanks to Michael Paquier for reviewing.

Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F6BE40D@G01JPEXMBYT05
2017-09-07 09:49:55 -04:00
Simon Riggs f06588a8e6 Exclude special values in recovery_target_time
recovery_target_time accepts timestamp input, though
does not allow use of special values, e.g. “today”.
Report a useful error message for these cases.

Reported-by: Piotr Stefaniak
Author: Simon Riggs
Discussion: https://postgr.es/m/CANP8+jJdKA+BkkYLWz9zAm16Y0s2ExBv0WfpAwXdTpPfWnA9Bg@mail.gmail.com
2017-09-07 04:56:34 -07:00
Robert Haas baaf272ac9 Use group updates when setting transaction status in clog.
Commit 0e141c0fbb introduced a mechanism
to reduce contention on ProcArrayLock by having a single process clear
XIDs in the procArray on behalf of multiple processes, reducing the
need to hand the lock around.  A previous attempt to introduce a similar
mechanism for CLogControlLock in ccce90b398
crashed and burned, but the design problem which resulted in those
failures is believed to have been corrected in this version.

Amit Kapila, with some cosmetic changes by me.  See the previous commit
message for additional credits.

Discussion: http://postgr.es/m/CAA4eK1KudxzgWhuywY_X=yeSAhJMT4DwCjroV5Ay60xaeB2Eew@mail.gmail.com
2017-09-01 11:45:40 -04:00
Robert Haas 81c5e46c49 Introduce 64-bit hash functions with a 64-bit seed.
This will be useful for hash partitioning, which needs a way to seed
the hash functions to avoid problems such as a hash index on a hash
partitioned table clumping all values into a small portion of the
bucket space; it's also useful for anything that wants a 64-bit hash
value rather than a 32-bit hash value.

Just in case somebody wants a 64-bit hash value that is compatible
with the existing 32-bit hash values, make the low 32-bits of the
64-bit hash value match the 32-bit hash value when the seed is 0.

Robert Haas and Amul Sul

Discussion: http://postgr.es/m/CA+Tgmoafx2yoJuhCQQOL5CocEi-w_uG4S2xT0EtgiJnPGcHW3g@mail.gmail.com
2017-08-31 22:21:21 -04:00
Tom Lane 6708e447ef Clean up shm_mq cleanup.
The logic around shm_mq_detach was a few bricks shy of a load, because
(contrary to the comments for shm_mq_attach) all it did was update the
shared shm_mq state.  That left us leaking a bit of process-local
memory, but much worse, the on_dsm_detach callback for shm_mq_detach
was still armed.  That means that whenever we ultimately detach from
the DSM segment, we'd run shm_mq_detach again for already-detached,
possibly long-dead queues.  This accidentally fails to fail today,
because we only ever re-use a shm_mq's memory for another shm_mq, and
multiple detach attempts on the last such shm_mq are fairly harmless.
But it's gonna bite us someday, so let's clean it up.

To do that, change shm_mq_detach's API so it takes a shm_mq_handle
not the underlying shm_mq.  This makes the callers simpler in most
cases anyway.  Also fix a few places in parallel.c that were just
pfree'ing the handle structs rather than doing proper cleanup.

Back-patch to v10 because of the risk that the revenant shm_mq_detach
callbacks would cause a live bug sometime.  Since this is an API
change, it's too late to do it in 9.6.  (We could make a variant
patch that preserves API, but I'm not excited enough to do that.)

Discussion: https://postgr.es/m/8670.1504192177@sss.pgh.pa.us
2017-08-31 15:10:24 -04:00
Tom Lane 41b0dd987d Separate reinitialization of shared parallel-scan state from ExecReScan.
Previously, the parallel executor logic did reinitialization of shared
state within the ExecReScan code for parallel-aware scan nodes.  This is
problematic, because it means that the ExecReScan call has to occur
synchronously (ie, during the parent Gather node's ReScan call).  That is
swimming very much against the tide so far as the ExecReScan machinery is
concerned; the fact that it works at all today depends on a lot of fragile
assumptions, such as that no plan node between Gather and a parallel-aware
scan node is parameterized.  Another objection is that because ExecReScan
might be called in workers as well as the leader, hacky extra tests are
needed in some places to prevent unwanted shared-state resets.

Hence, let's separate this code into two functions, a ReInitializeDSM
call and the ReScan call proper.  ReInitializeDSM is called only in
the leader and is guaranteed to run before we start new workers.
ReScan is returned to its traditional function of resetting only local
state, which means that ExecReScan's usual habits of delaying or
eliminating child rescan calls are safe again.

As with the preceding commit 7df2c1f8d, it doesn't seem to be necessary
to make these changes in 9.6, which is a good thing because the FDW and
CustomScan APIs are impacted.

Discussion: https://postgr.es/m/CAA4eK1JkByysFJNh9M349u_nNjqETuEnY_y1VUc_kJiU0bxtaQ@mail.gmail.com
2017-08-30 13:18:16 -04:00
Andres Freund 35ea75632a Refactor typcache.c's record typmod hash table.
Previously, tuple descriptors were stored in chains keyed by a fixed size
array of OIDs.  That meant there were effectively two levels of collision
chain -- one inside and one outside the hash table.  Instead, let dynahash.c
look after conflicts for us by supplying a proper hash and equal function
pair.

This is a nice cleanup on its own, but also simplifies followup
changes allowing blessed TupleDescs to be shared between backends
participating in parallel query.

Author: Thomas Munro
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm%3D34GVhOL%2BarUx56yx7OPk7%3DqpGsv3CpO54feqjAwQKm5g%40mail.gmail.com
2017-08-22 16:11:54 -07:00
Andres Freund c6293249dc Partially flatten struct tupleDesc so that it can be used in DSM.
TupleDesc's attributes were already stored in contiguous memory after the
struct.  Go one step further and get rid of the array of pointers to
attributes so that they can be stored in shared memory mapped at different
addresses in each backend.  This won't work for TupleDescs with contraints
and defaults, since those point to other objects, but for many purposes
only attributes are needed.

Author: Thomas Munro
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com
2017-08-20 11:19:12 -07:00
Andres Freund 2cd7084524 Change tupledesc->attrs[n] to TupleDescAttr(tupledesc, n).
This is a mechanical change in preparation for a later commit that
will change the layout of TupleDesc.  Introducing a macro to abstract
the details of where attributes are stored will allow us to change
that in separate step and revise it in future.

Author: Thomas Munro, editorialized by Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com
2017-08-20 11:19:07 -07:00
Heikki Linnakangas dcd052c8d2 Fix pg_atomic_u64 initialization.
As Andres pointed out, pg_atomic_init_u64 must be used to initialize an
atomic variable, before it can be accessed with the actual atomic ops.
Trying to use pg_atomic_write_u64 on an uninitialized variable leads to a
failure with the fallback implementation that uses a spinlock.

Discussion: https://www.postgresql.org/message-id/20170816191346.d3ke5tpshhco4bnd%40alap3.anarazel.de
2017-08-17 00:48:44 +03:00
Heikki Linnakangas 3cda10f41b Use atomic ops to hand out pages to scan in parallel scan.
With a lot of CPUs, the spinlock that protects the current scan location
in a parallel scan can become a bottleneck. Use an atomic fetch-and-add
instruction instead.

David Rowley

Discussion: https://www.postgresql.org/message-id/CAKJS1f9tgsPhqBcoPjv9_KUPZvTLCZ4jy%3DB%3DbhqgaKn7cYzm-w@mail.gmail.com
2017-08-16 16:18:41 +03:00
Heikki Linnakangas 0c504a80cf Remove dedicated B-tree root-split record types.
Since commit 40dae7ec53, which changed the way b-tree page splitting
works, there has been no difference in the handling of root, and non-root
split WAL records. We don't need to distinguish them anymore

If you're worried about the loss of debugging information, note that
usually a root split record will normally be followed by a WAL record to
create the new root page. The root page will also have the BTP_ROOT flag
set on the page itself, and there is a pointer to it from the metapage.

Author: Aleksander Alekseev
Discussion: https://www.postgresql.org/message-id/20170406122116.GA11081@e733.localdomain
2017-08-16 12:24:40 +03:00
Tom Lane 21d304dfed Final pgindent + perltidy run for v10. 2017-08-14 17:29:33 -04:00
Tom Lane 5b6289c1e0 Handle elog(FATAL) during ROLLBACK more robustly.
Stress testing by Andreas Seltenreich disclosed longstanding problems that
occur if a FATAL exit (e.g. due to receipt of SIGTERM) occurs while we are
trying to execute a ROLLBACK of an already-failed transaction.  In such a
case, xact.c is in TBLOCK_ABORT state, so that AbortOutOfAnyTransaction
would skip AbortTransaction and go straight to CleanupTransaction.  This
led to an assert failure in an assert-enabled build (due to the ROLLBACK's
portal still having a cleanup hook) or without assertions, to a FATAL exit
complaining about "cannot drop active portal".  The latter's not
disastrous, perhaps, but it's messy enough to want to improve it.

We don't really want to run all of AbortTransaction in this code path.
The minimum required to clean up the open portal safely is to do
AtAbort_Memory and AtAbort_Portals.  It seems like a good idea to
do AtAbort_Memory unconditionally, to be entirely sure that we are
starting with a safe CurrentMemoryContext.  That means that if the
main loop in AbortOutOfAnyTransaction does nothing, we need an extra
step at the bottom to restore CurrentMemoryContext = TopMemoryContext,
which I chose to do by invoking AtCleanup_Memory.  This'll result in
calling AtCleanup_Memory twice in many of the paths through this function,
but that seems harmless and reasonably inexpensive.

The original motivation for the assertion in AtCleanup_Portals was that
we wanted to be sure that any user-defined code executed as a consequence
of the cleanup hook runs during AbortTransaction not CleanupTransaction.
That still seems like a valid concern, and now that we've seen one case
of the assertion firing --- which means that exactly that would have
happened in a production build --- let's replace the Assert with a runtime
check.  If we see the cleanup hook still set, we'll emit a WARNING and
just drop the hook unexecuted.

This has been like this a long time, so back-patch to all supported
branches.

Discussion: https://postgr.es/m/877ey7bmun.fsf@ansel.ydns.eu
2017-08-14 15:43:20 -04:00
Tom Lane 004a9702e0 Remove AtEOXact_CatCache().
The sole useful effect of this function, to check that no catcache
entries have positive refcounts at transaction end, has really been
obsolete since we introduced ResourceOwners in PG 8.1.  We reduced the
checks to assertions years ago, so that the function was a complete
no-op in production builds.  There have been previous discussions about
removing it entirely, but consensus up to now was that it had some small
value as a cross-check for bugs in the ResourceOwner logic.

However, it now emerges that it's possible to trigger these assertions
if you hit an assert-enabled backend with SIGTERM during a call to
SearchCatCacheList, because that function temporarily increases the
refcounts of entries it's intending to add to a catcache list construct.
In a normal ERROR scenario, the extra refcounts are cleaned up by
SearchCatCacheList's PG_CATCH block; but in a FATAL exit we do a
transaction abort and exit without ever executing PG_CATCH handlers.

There's a case to be made that this is a generic hazard and we should
consider restructuring elog(FATAL) handling so that pending PG_CATCH
handlers do get run.  That's pretty scary though: it could easily create
more problems than it solves.  Preliminary stress testing by Andreas
Seltenreich suggests that there are not many live problems of this ilk,
so we rejected that idea.

There are more-localized ways to fix the problem; the most principled
one would be to use PG_ENSURE_ERROR_CLEANUP instead of plain PG_TRY.
But adding cycles to SearchCatCacheList isn't very appealing.  We could
also weaken the assertions in AtEOXact_CatCache in some more or less
ad-hoc way, but that just makes its raison d'etre even less compelling.
In the end, the most reasonable solution seems to be to just remove
AtEOXact_CatCache altogether, on the grounds that it's not worth trying
to fix it.  It hasn't found any bugs for us in many years.

Per report from Jeevan Chalke.  Back-patch to all supported branches.

Discussion: https://postgr.es/m/CAM2+6=VEE30YtRQCZX7_sCFsEpoUkFBV1gZazL70fqLn8rcvBA@mail.gmail.com
2017-08-13 16:15:14 -04:00
Peter Eisentraut a1ef920e27 Remove uses of "slave" in replication contexts
This affects mostly code comments, some documentation, and tests.
Official APIs already used "standby".
2017-08-10 22:55:41 -04:00
Robert Haas ec99dd5aee Remove incorrect assertion in clog.c
We must advance the oldest XID that can be safely looked up in clog
*before* truncating CLOG, and the oldest XID that can't be reused
*after* truncating CLOG.  This assertion, and the accompanying
comment, are confused; remove them.

Reported by Neha Sharma.

Discussion: http://postgr.es/m/CANiYTQumC3T=UMBMd1Hor=5XWZYuCEQBioL3ug0YtNQCMMT5wQ@mail.gmail.com
2017-08-10 11:20:57 -04:00
Alvaro Herrera 77d2c00af7 Reword some unclear comments 2017-08-08 18:48:01 -04:00
Robert Haas 52f8a59dd9 Make pg_stop_backup's wait_for_archive flag work on standbys.
Previously, it had no effect.  Now, if archive_mode=always, it will
work, and if not, you'll get a warning.

Masahiko Sawada, Michael Paquier, and Robert Haas.  The patch as
submitted also changed the behavior so that we would write and remove
history files on standbys, but that seems like material for a separate
patch to me.

Discussion: http://postgr.es/m/CAD21AoC2Xw6M=ZJyejq_9d_iDkReC_=rpvQRw5QsyzKQdfYpkw@mail.gmail.com
2017-08-05 10:49:26 -04:00