Commit Graph

3546 Commits

Author SHA1 Message Date
Robert Haas 846fcc8516 Fix problems with the "role" GUC and parallel query.
Without this fix, dropping a role can sometimes result in parallel
query failures in sessions that have used "SET ROLE" to assume the
dropped role, even if that setting isn't active any more.

Report by Pavan Deolasee.  Patch by Amit Kapila, reviewed by me.

Discussion: http://postgr.es/m/CABOikdOomRcZsLsLK+Z+qENM1zxyaWnAvFh3MJZzZnnKiF+REg@mail.gmail.com
2017-10-29 12:58:40 +05:30
Andres Freund 31079a4a8e Replace remaining uses of pq_sendint with pq_sendint{8,16,32}.
pq_sendint() remains, so extension code doesn't unnecessarily break.

Author: Andres Freund
Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbekae4@alap3.anarazel.de
2017-10-11 21:00:46 -07:00
Andres Freund 4c119fbcd4 Improve performance of SendRowDescriptionMessage.
There's three categories of changes leading to better performance:
- Splitting the per-attribute part of SendRowDescriptionMessage into a
  v2 and a v3 version allows avoiding branches for every attribute.
- Preallocating the size of the buffer to be big enough for all
  attributes and then using pq_write* avoids unnecessary buffer
  size checks & resizing.
- Reusing a persistently allocated StringInfo for all
  SendRowDescriptionMessage() invocations avoids repeated allocations
  & reallocations.

Author: Andres Freund
Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbekae4@alap3.anarazel.de
2017-10-11 17:23:23 -07:00
Andres Freund f2dec34e19 Use one stringbuffer for all rows printed in printtup.c.
This avoids newly allocating, and then possibly growing, the
stringbuffer for every row. For wide rows this can substantially
reduce memory allocator overhead, at the price of not immediately
reducing memory usage after outputting an especially wide row.

Author: Andres Freund
Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbekae4@alap3.anarazel.de
2017-10-11 16:26:35 -07:00
Tom Lane 5fa6b0d102 Remove unnecessary PG_TRY overhead for CurrentResourceOwner changes.
resowner/README contained advice to use a PG_TRY block to restore the
old CurrentResourceOwner value anywhere that that variable is transiently
changed.  That advice was only inconsistently followed, however, and
on reflection it seems like unnecessary overhead.  We don't bother
with such a convention for transient CurrentMemoryContext changes,
on the grounds that any (sub)transaction abort will start out by
resetting CurrentMemoryContext to what it wants.  But the same is
true of CurrentResourceOwner, so there seems no need to treat it
differently.

Hence, remove PG_TRY blocks that exist only to restore CurrentResourceOwner
before re-throwing the error.  There are a couple of places that restore
it along with some other actions, and I left those alone; the restore is
probably unnecessary but no noticeable gain will result from removing it.

Discussion: https://postgr.es/m/5236.1507583529@sss.pgh.pa.us
2017-10-11 17:44:09 -04:00
Tom Lane 6b87416c9a Fix access-off-end-of-array in clog.c.
Sloppy loop coding in set_status_by_pages() resulted in fetching one array
element more than it should from the subxids[] array.  The odds of this
resulting in SIGSEGV are pretty small, but we've certainly seen that happen
with similar mistakes elsewhere.  While at it, we can get rid of an extra
TransactionIdToPage() calculation per loop.

Per report from David Binderman.  Back-patch to all supported branches,
since this code is quite old.

Discussion: https://postgr.es/m/HE1PR0802MB2331CBA919CBFFF0C465EB429C710@HE1PR0802MB2331.eurprd08.prod.outlook.com
2017-10-06 12:20:23 -04:00
Alvaro Herrera a5736bf754 Fix traversal of half-frozen update chains
When some tuple versions in an update chain are frozen due to them being
older than freeze_min_age, the xmax/xmin trail can become broken.  This
breaks HOT (and probably other things).  A subsequent VACUUM can break
things in more serious ways, such as leaving orphan heap-only tuples
whose root HOT redirect items were removed.  This can be seen because
index creation (or REINDEX) complain like
  ERROR:  XX000: failed to find parent tuple for heap-only tuple at (0,7) in table "t"

Because of relfrozenxid contraints, we cannot avoid the freezing of the
early tuples, so we must cope with the results: whenever we see an Xmin
of FrozenTransactionId, consider it a match for whatever the previous
Xmax value was.

This problem seems to have appeared in 9.3 with multixact changes,
though strictly speaking it seems unrelated.

Since 9.4 we have commit 37484ad2a "Change the way we mark tuples as
frozen", so the fix is simple: just compare the raw Xmin (still stored
in the tuple header, since freezing merely set an infomask bit) to the
Xmax.  But in 9.3 we rewrite the Xmin value to FrozenTransactionId, so
the original value is lost and we have nothing to compare the Xmax with.
To cope with that case we need to compare the Xmin with FrozenXid,
assume it's a match, and hope for the best.  Sadly, since you can
pg_upgrade a 9.3 instance containing half-frozen pages to newer
releases, we need to keep the old check in newer versions too, which
seems a bit brittle; I hope we can somehow get rid of that.

I didn't optimize the new function for performance.  The new coding is
probably a bit slower than before, since there is a function call rather
than a straight comparison, but I'd rather have it work correctly than
be fast but wrong.

This is a followup after 20b6552242 fixed a few related problems.
Apparently, in 9.6 and up there are more ways to get into trouble, but
in 9.3 - 9.5 I cannot reproduce a problem anymore with this patch, so
there must be a separate bug.

Reported-by: Peter Geoghegan
Diagnosed-by: Peter Geoghegan, Michael Paquier, Daniel Wood,
	Yi Wen Wong, Álvaro
Discussion: https://postgr.es/m/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com
2017-10-06 17:20:01 +02:00
Tom Lane fe9ba28ee8 Fix typo in README.
s/BeginInternalSubtransaction/BeginInternalSubTransaction/
2017-10-05 15:06:01 -04:00
Robert Haas e9baa5e9fa Allow DML commands that create tables to use parallel query.
Haribabu Kommi, reviewed by Dilip Kumar and Rafia Sabih.  Various
cosmetic changes by me to explain why this appears to be safe but
allowing inserts in parallel mode in general wouldn't be.  Also, I
removed the REFRESH MATERIALIZED VIEW case from Haribabu's patch,
since I'm not convinced that case is OK, and hacked on the
documentation somewhat.

Discussion: http://postgr.es/m/CAJrrPGdo5bak6qnPWe8Kpi8g_jfQEs-G4SYmG9y+OFaw2-dPvA@mail.gmail.com
2017-10-05 11:40:48 -04:00
Peter Eisentraut 5373bc2a08 Add background worker type
Add bgw_type field to background worker structure.  It is intended to be
set to the same value for all workers of the same type, so they can be
grouped in pg_stat_activity, for example.

The backend_type column in pg_stat_activity now shows bgw_type for a
background worker.  The ps listing also no longer calls out that a
process is a background worker but just show the bgw_type.  That way,
being a background worker is more of an implementation detail now that
is not shown to the user.  However, most log messages still refer to
'background worker "%s"'; otherwise constructing sensible and
translatable log messages would become tricky.

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
2017-09-29 11:08:24 -04:00
Alvaro Herrera 20b6552242 Fix freezing of a dead HOT-updated tuple
Vacuum calls page-level HOT prune to remove dead HOT tuples before doing
liveness checks (HeapTupleSatisfiesVacuum) on the remaining tuples.  But
concurrent transaction commit/abort may turn DEAD some of the HOT tuples
that survived the prune, before HeapTupleSatisfiesVacuum tests them.
This happens to activate the code that decides to freeze the tuple ...
which resuscitates it, duplicating data.

(This is especially bad if there's any unique constraints, because those
are now internally violated due to the duplicate entries, though you
won't know until you try to REINDEX or dump/restore the table.)

One possible fix would be to simply skip doing anything to the tuple,
and hope that the next HOT prune would remove it.  But there is a
problem: if the tuple is older than freeze horizon, this would leave an
unfrozen XID behind, and if no HOT prune happens to clean it up before
the containing pg_clog segment is truncated away, it'd later cause an
error when the XID is looked up.

Fix the problem by having the tuple freezing routines cope with the
situation: don't freeze the tuple (and keep it dead).  In the cases that
the XID is older than the freeze age, set the HEAP_XMAX_COMMITTED flag
so that there is no need to look up the XID in pg_clog later on.

An isolation test is included, authored by Michael Paquier, loosely
based on Daniel Wood's original reproducer.  It only tests one
particular scenario, though, not all the possible ways for this problem
to surface; it be good to have a more reliable way to test this more
fully, but it'd require more work.
In message https://postgr.es/m/20170911140103.5akxptyrwgpc25bw@alvherre.pgsql
I outlined another test case (more closely matching Dan Wood's) that
exposed a few more ways for the problem to occur.

Backpatch all the way back to 9.3, where this problem was introduced by
multixact juggling.  In branches 9.3 and 9.4, this includes a backpatch
of commit e5ff9fefcd50 (of 9.5 era), since the original is not
correctable without matching the coding pattern in 9.5 up.

Reported-by: Daniel Wood
Diagnosed-by: Daniel Wood
Reviewed-by: Yi Wen Wong, Michaël Paquier
Discussion: https://postgr.es/m/E5711E62-8FDF-4DCA-A888-C200BF6B5742@amazon.com
2017-09-28 16:44:01 +02:00
Tom Lane 28e0727076 Revert to 9.6 treatment of ALTER TYPE enumtype ADD VALUE.
This reverts commit 15bc038f9, along with the followon commits 1635e80d3
and 984c92074 that tried to clean up the problems exposed by bug #14825.
The result was incomplete because it failed to address parallel-query
requirements.  With 10.0 release so close upon us, now does not seem like
the time to be adding more code to fix that.  I hope we can un-revert this
code and add the missing parallel query support during the v11 cycle.

Back-patch to v10.

Discussion: https://postgr.es/m/20170922185904.1448.16585@wrigleys.postgresql.org
2017-09-27 16:14:43 -04:00
Tom Lane 1635e80d30 Use a blacklist to distinguish original from add-on enum values.
Commit 15bc038f9 allowed ALTER TYPE ADD VALUE to be executed inside
transaction blocks, by disallowing the use of the added value later
in the same transaction, except under limited circumstances.  However,
the test for "limited circumstances" was heuristic and could reject
references to enum values that were created during CREATE TYPE AS ENUM,
not just later.  This breaks the use-case of restoring pg_dump scripts
in a single transaction, as reported in bug #14825 from Balazs Szilfai.

We can improve this by keeping a "blacklist" table of enum value OIDs
created by ALTER TYPE ADD VALUE during the current transaction.  Any
visible-but-uncommitted value whose OID is not in the blacklist must
have been created by CREATE TYPE AS ENUM, and can be used safely
because it could not have a lifespan shorter than its parent enum type.

This change also removes the restriction that a renamed enum value
can't be used before being committed (unless it was on the blacklist).

Andrew Dunstan, with cosmetic improvements by me.
Back-patch to v10.

Discussion: https://postgr.es/m/20170922185904.1448.16585@wrigleys.postgresql.org
2017-09-26 13:14:46 -04:00
Robert Haas 22c5e73562 Remove lsn from HashScanPosData.
This was intended as infrastructure for weakening VACUUM's locking
requirements, similar to what was done for btree indexes in commit
2ed5b87f96.  However, for hash indexes,
it seems that the improvements which are possible are actually
extremely marginal.  Furthermore, performing the LSN cross-check will
end up skipping cleanup far more often than is necessary; we only care
about page modifications due to a VACUUM, but the LSN check will fail
if ANY modification has occurred.  So, rather than pressing forward
with that "optimization", just rip the LSN field out.

Patch by me, reviewed by Ashutosh Sharma and Amit Kapila

Discussion: http://postgr.es/m/CAA4eK1JxqqcuC5Un7YLQVhOYSZBS+t=3xqZuEkt5RyquyuxpwQ@mail.gmail.com
2017-09-26 09:16:45 -04:00
Robert Haas 79a4a665c0 Fix trivial mistake in README.
You might think I (Robert) could manage to count to five without
messing it up, but if you did, you would be wrong.

Amit Kapila

Discussion: http://postgr.es/m/CAA4eK1JxqqcuC5Un7YLQVhOYSZBS+t=3xqZuEkt5RyquyuxpwQ@mail.gmail.com
2017-09-26 09:01:43 -04:00
Peter Eisentraut 0c5803b450 Refactor new file permission handling
The file handling functions from fd.c were called with a diverse mix of
notations for the file permissions when they were opening new files.
Almost all files created by the server should have the same permissions
set.  So change the API so that e.g. OpenTransientFile() automatically
uses the standard permissions set, and OpenTransientFilePerm() is a new
function that takes an explicit permissions set for the few cases where
it is needed.  This also saves an unnecessary argument for call sites
that are just opening an existing file.

While we're reviewing these APIs, get rid of the FileName typedef and
use the standard const char * for the file name and mode_t for the file
mode.  This makes these functions match other file handling functions
and removes an unnecessary layer of mysteriousness.  We can also get rid
of a few casts that way.

Author: David Steele <david@pgmasters.net>
2017-09-23 10:16:18 -04:00
Robert Haas 6a2fa09c0c For wal_consistency_checking, mask page checksum as well as page LSN.
If the LSN is different, the checksum will be different, too.

Ashwin Agrawal, reviewed by Michael Paquier and Kuntal Ghosh

Discussion: http://postgr.es/m/CALfoeis5iqrAU-+JAN+ZzXkpPr7+-0OAGv7QUHwFn=-wDy4o4Q@mail.gmail.com
2017-09-22 14:28:22 -04:00
Robert Haas 7c75ef5715 hash: Implement page-at-a-time scan.
Commit 09cb5c0e7d added a similar
optimization to btree back in 2006, but nobody bothered to implement
the same thing for hash indexes, probably because they weren't
WAL-logged and had lots of other performance problems as well.  As
with the corresponding btree case, this eliminates the problem of
potentially needing to refind our position within the page, and cuts
down on pin/unpin traffic as well.

Ashutosh Sharma, reviewed by Alexander Korotkov, Jesper Pedersen,
Amit Kapila, and me.  Some final edits to comments and README by
me.

Discussion: http://postgr.es/m/CAE9k0Pm3KTx93K8_5j6VMzG4h5F+SyknxUwXrN-zqSZ9X8ZS3w@mail.gmail.com
2017-09-22 13:56:27 -04:00
Andres Freund fc49e24fa6 Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.

But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.

This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured.  For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.

Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
    Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-19 22:03:48 -07:00
Tom Lane 2d484f9b05 Remove no-op GiST support functions in the core GiST opclasses.
The preceding patch allowed us to remove useless GiST support functions.
This patch actually does that for all the no-op cases in the core GiST
code.  This buys us whatever performance gain is to be had, and more
importantly exercises the preceding patch.

There remain no-op functions in the contrib GiST opclasses, but those
will take more work to remove.

Discussion: https://postgr.es/m/CAJEAwVELVx9gYscpE=Be6iJxvdW5unZ_LkcAaVNSeOwvdwtD=A@mail.gmail.com
2017-09-19 23:32:59 -04:00
Tom Lane d3a4f89d8a Allow no-op GiST support functions to be omitted.
There are common use-cases in which the compress and/or decompress
functions can be omitted, with the result being that we make no
data transformation when storing or retrieving index values.
Previously, you had to provide a no-op function anyway, but this
patch allows such opclass support functions to be omitted.

Furthermore, if the compress function is omitted, then the core code
knows that the stored representation is the same as the original data.
This means we can allow index-only scans without requiring a fetch
function to be provided either.  Previously you had to provide a
no-op fetch function if you wanted IOS to work.

This reportedly provides a small performance benefit in such cases,
but IMO the real reason for doing it is just to reduce the amount of
useless boilerplate code that has to be written for GiST opclasses.

Andrey Borodin, reviewed by Dmitriy Sarafannikov

Discussion: https://postgr.es/m/CAJEAwVELVx9gYscpE=Be6iJxvdW5unZ_LkcAaVNSeOwvdwtD=A@mail.gmail.com
2017-09-19 23:32:59 -04:00
Andres Freund ec9e05b3c3 Fix crash restart bug introduced in 8356753c21.
The bug was caused by not re-reading the control file during crash
recovery restarts, which lead to an attempt to pfree() shared memory
contents. The fix is to re-read the control file, which seems good
anyway.

It's unclear as of this moment, whether we want to keep the
refactoring introduced in the commit referenced above, or come up with
an alternative approach. But fixing the bug in the mean time seems
like a good idea regardless.

A followup commit will introduce regression test coverage for crash
restarts.

Reported-By: Tom Lane
Discussion: https://postgr.es/m/14134.1505572349@sss.pgh.pa.us
2017-09-18 17:25:49 -07:00
Tom Lane eb5c404b17 Minor code-cleanliness improvements for btree.
Make the btree page-flags test macros (P_ISLEAF and friends) return clean
boolean values, rather than values that might not fit in a bool.  Use them
in a few places that were randomly referencing the flag bits directly.

In passing, change access/nbtree/'s only direct use of BUFFER_LOCK_SHARE to
BT_READ.  (Some think we should go the other way, but as long as we have
BT_READ/BT_WRITE, let's use them consistently.)

Masahiko Sawada, reviewed by Doug Doole

Discussion: https://postgr.es/m/CAD21AoBmWPeN=WBB5Jvyz_Nt3rmW1ebUyAnk3ZbJP3RMXALJog@mail.gmail.com
2017-09-18 16:36:28 -04:00
Andres Freund cc5f81366c Add support for coordinating record typmods among parallel workers.
Tuples can have type RECORDOID and a typmod number that identifies a blessed
TupleDesc in a backend-private cache.  To support the sharing of such tuples
through shared memory and temporary files, provide a typmod registry in
shared memory.

To achieve that, introduce per-session DSM segments, created on demand when a
backend first runs a parallel query.  The per-session DSM segment has a
table-of-contents just like the per-query DSM segment, and initially the
contents are a shared record typmod registry and a DSA area to provide the
space it needs to grow.

State relating to the current session is accessed via a Session object
reached through global variable CurrentSession that may require significant
redesign further down the road as we figure out what else needs to be shared
or remodelled.

Author: Thomas Munro
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com
2017-09-14 19:59:21 -07:00
Andres Freund 8356753c21 Perform only one ReadControlFile() during startup.
Previously we read the control file in multiple places. But soon the
segment size will be configurable and stored in the control file, and
that needs to be available earlier than it currently is needed.

Instead of adding yet another place where it's read, refactor things
so there's a single processing of the control file during startup (in
EXEC_BACKEND that's every individual backend's startup).

Author: Andres Freund
Discussion: http://postgr.es/m/20170913092828.aozd3gvvmw67gmyc@alap3.anarazel.de
2017-09-14 14:14:34 -07:00
Peter Eisentraut 821fb8cdbf Message style fixes 2017-09-11 11:21:27 -04:00
Tom Lane 3ca930fc39 Improve performance of get_actual_variable_range with recently-dead tuples.
In commit fccebe421, we hacked get_actual_variable_range() to scan the
index with SnapshotDirty, so that if there are many uncommitted tuples
at the end of the index range, it wouldn't laboriously scan through all
of them looking for a live value to return.  However, that didn't fix it
for the case of many recently-dead tuples at the end of the index;
SnapshotDirty recognizes those as committed dead and so we're back to
the same problem.

To improve the situation, invent a "SnapshotNonVacuumable" snapshot type
and use that instead.  The reason this helps is that, if the snapshot
rejects a given index entry, we know that the indexscan will mark that
index entry as killed.  This means the next get_actual_variable_range()
scan will proceed past that entry without visiting the heap, making the
scan a lot faster.  We may end up accepting a recently-dead tuple as
being the estimated extremal value, but that doesn't seem much worse than
the compromise we made before to accept not-yet-committed extremal values.

The cost of the scan is still proportional to the number of dead index
entries at the end of the range, so in the interval after a mass delete
but before VACUUM's cleaned up the mess, it's still possible for
get_actual_variable_range() to take a noticeable amount of time, if you've
got enough such dead entries.  But the constant factor is much much better
than before, since all we need to do with each index entry is test its
"killed" bit.

We chose to back-patch commit fccebe421 at the time, but I'm hesitant to
do so here, because this form of the problem seems to affect many fewer
people.  Also, even when it happens, it's less bad than the case fixed
by commit fccebe421 because we don't get the contention effects from
expensive TransactionIdIsInProgress tests.

Dmitriy Sarafannikov, reviewed by Andrey Borodin

Discussion: https://postgr.es/m/05C72CF7-B5F6-4DB9-8A09-5AC897653113@yandex.ru
2017-09-07 19:41:51 -04:00
Peter Eisentraut 1356f78ea9 Reduce excessive dereferencing of function pointers
It is equivalent in ANSI C to write (*funcptr) () and funcptr().  These
two styles have been applied inconsistently.  After discussion, we'll
use the more verbose style for plain function pointer variables, to make
it clear that it's a variable, and the shorter style when the function
pointer is in a struct (s.func() or s->func()), because then it's clear
that it's not a plain function name, and otherwise the excessive
punctuation makes some of those invocations hard to read.

Discussion: https://www.postgresql.org/message-id/f52c16db-14ed-757d-4b48-7ef360b1631d@2ndquadrant.com
2017-09-07 13:56:09 -04:00
Tom Lane 6eb52da394 Fix handling of savepoint commands within multi-statement Query strings.
Issuing a savepoint-related command in a Query message that contains
multiple SQL statements led to a FATAL exit with a complaint about
"unexpected state STARTED".  This is a shortcoming of commit 4f896dac1,
which attempted to prevent such misbehaviors in multi-statement strings;
its quick hack of marking the individual statements as "not top-level"
does the wrong thing in this case, and isn't a very accurate description
of the situation anyway.

To fix, let's introduce into xact.c an explicit model of what happens for
multi-statement Query strings.  This is an "implicit transaction block
in progress" state, which for many purposes works like the normal
TBLOCK_INPROGRESS state --- in particular, IsTransactionBlock returns true,
causing the desired result that PreventTransactionChain will throw error.
But in case of error abort it works like TBLOCK_STARTED, allowing the
transaction to be cancelled without need for an explicit ROLLBACK command.

Commit 4f896dac1 is reverted in toto, so that we go back to treating the
individual statements as "top level".  We could have left it as-is, but
this allows sharpening the error message for PreventTransactionChain
calls inside functions.

Except for getting a normal error instead of a FATAL exit for savepoint
commands, this patch should result in no user-visible behavioral change
(other than that one error message rewording).  There are some things
we might want to do in the line of changing the appearance or wording of
error and warning messages around this behavior, which would be much
simpler to do now that it's an explicitly modeled state.  But I haven't
done them here.

Although this fixes a long-standing bug, no backpatch.  The consequences
of the bug don't seem severe enough to justify the risk that this commit
itself creates some new issue.

Patch by me, but it owes something to previous investigation by
Takayuki Tsunakawa, who also reported the bug in the first place.
Also thanks to Michael Paquier for reviewing.

Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F6BE40D@G01JPEXMBYT05
2017-09-07 09:49:55 -04:00
Simon Riggs f06588a8e6 Exclude special values in recovery_target_time
recovery_target_time accepts timestamp input, though
does not allow use of special values, e.g. “today”.
Report a useful error message for these cases.

Reported-by: Piotr Stefaniak
Author: Simon Riggs
Discussion: https://postgr.es/m/CANP8+jJdKA+BkkYLWz9zAm16Y0s2ExBv0WfpAwXdTpPfWnA9Bg@mail.gmail.com
2017-09-07 04:56:34 -07:00
Robert Haas baaf272ac9 Use group updates when setting transaction status in clog.
Commit 0e141c0fbb introduced a mechanism
to reduce contention on ProcArrayLock by having a single process clear
XIDs in the procArray on behalf of multiple processes, reducing the
need to hand the lock around.  A previous attempt to introduce a similar
mechanism for CLogControlLock in ccce90b398
crashed and burned, but the design problem which resulted in those
failures is believed to have been corrected in this version.

Amit Kapila, with some cosmetic changes by me.  See the previous commit
message for additional credits.

Discussion: http://postgr.es/m/CAA4eK1KudxzgWhuywY_X=yeSAhJMT4DwCjroV5Ay60xaeB2Eew@mail.gmail.com
2017-09-01 11:45:40 -04:00
Robert Haas 81c5e46c49 Introduce 64-bit hash functions with a 64-bit seed.
This will be useful for hash partitioning, which needs a way to seed
the hash functions to avoid problems such as a hash index on a hash
partitioned table clumping all values into a small portion of the
bucket space; it's also useful for anything that wants a 64-bit hash
value rather than a 32-bit hash value.

Just in case somebody wants a 64-bit hash value that is compatible
with the existing 32-bit hash values, make the low 32-bits of the
64-bit hash value match the 32-bit hash value when the seed is 0.

Robert Haas and Amul Sul

Discussion: http://postgr.es/m/CA+Tgmoafx2yoJuhCQQOL5CocEi-w_uG4S2xT0EtgiJnPGcHW3g@mail.gmail.com
2017-08-31 22:21:21 -04:00
Tom Lane 6708e447ef Clean up shm_mq cleanup.
The logic around shm_mq_detach was a few bricks shy of a load, because
(contrary to the comments for shm_mq_attach) all it did was update the
shared shm_mq state.  That left us leaking a bit of process-local
memory, but much worse, the on_dsm_detach callback for shm_mq_detach
was still armed.  That means that whenever we ultimately detach from
the DSM segment, we'd run shm_mq_detach again for already-detached,
possibly long-dead queues.  This accidentally fails to fail today,
because we only ever re-use a shm_mq's memory for another shm_mq, and
multiple detach attempts on the last such shm_mq are fairly harmless.
But it's gonna bite us someday, so let's clean it up.

To do that, change shm_mq_detach's API so it takes a shm_mq_handle
not the underlying shm_mq.  This makes the callers simpler in most
cases anyway.  Also fix a few places in parallel.c that were just
pfree'ing the handle structs rather than doing proper cleanup.

Back-patch to v10 because of the risk that the revenant shm_mq_detach
callbacks would cause a live bug sometime.  Since this is an API
change, it's too late to do it in 9.6.  (We could make a variant
patch that preserves API, but I'm not excited enough to do that.)

Discussion: https://postgr.es/m/8670.1504192177@sss.pgh.pa.us
2017-08-31 15:10:24 -04:00
Tom Lane 41b0dd987d Separate reinitialization of shared parallel-scan state from ExecReScan.
Previously, the parallel executor logic did reinitialization of shared
state within the ExecReScan code for parallel-aware scan nodes.  This is
problematic, because it means that the ExecReScan call has to occur
synchronously (ie, during the parent Gather node's ReScan call).  That is
swimming very much against the tide so far as the ExecReScan machinery is
concerned; the fact that it works at all today depends on a lot of fragile
assumptions, such as that no plan node between Gather and a parallel-aware
scan node is parameterized.  Another objection is that because ExecReScan
might be called in workers as well as the leader, hacky extra tests are
needed in some places to prevent unwanted shared-state resets.

Hence, let's separate this code into two functions, a ReInitializeDSM
call and the ReScan call proper.  ReInitializeDSM is called only in
the leader and is guaranteed to run before we start new workers.
ReScan is returned to its traditional function of resetting only local
state, which means that ExecReScan's usual habits of delaying or
eliminating child rescan calls are safe again.

As with the preceding commit 7df2c1f8d, it doesn't seem to be necessary
to make these changes in 9.6, which is a good thing because the FDW and
CustomScan APIs are impacted.

Discussion: https://postgr.es/m/CAA4eK1JkByysFJNh9M349u_nNjqETuEnY_y1VUc_kJiU0bxtaQ@mail.gmail.com
2017-08-30 13:18:16 -04:00
Andres Freund 35ea75632a Refactor typcache.c's record typmod hash table.
Previously, tuple descriptors were stored in chains keyed by a fixed size
array of OIDs.  That meant there were effectively two levels of collision
chain -- one inside and one outside the hash table.  Instead, let dynahash.c
look after conflicts for us by supplying a proper hash and equal function
pair.

This is a nice cleanup on its own, but also simplifies followup
changes allowing blessed TupleDescs to be shared between backends
participating in parallel query.

Author: Thomas Munro
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm%3D34GVhOL%2BarUx56yx7OPk7%3DqpGsv3CpO54feqjAwQKm5g%40mail.gmail.com
2017-08-22 16:11:54 -07:00
Andres Freund c6293249dc Partially flatten struct tupleDesc so that it can be used in DSM.
TupleDesc's attributes were already stored in contiguous memory after the
struct.  Go one step further and get rid of the array of pointers to
attributes so that they can be stored in shared memory mapped at different
addresses in each backend.  This won't work for TupleDescs with contraints
and defaults, since those point to other objects, but for many purposes
only attributes are needed.

Author: Thomas Munro
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com
2017-08-20 11:19:12 -07:00
Andres Freund 2cd7084524 Change tupledesc->attrs[n] to TupleDescAttr(tupledesc, n).
This is a mechanical change in preparation for a later commit that
will change the layout of TupleDesc.  Introducing a macro to abstract
the details of where attributes are stored will allow us to change
that in separate step and revise it in future.

Author: Thomas Munro, editorialized by Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com
2017-08-20 11:19:07 -07:00
Heikki Linnakangas dcd052c8d2 Fix pg_atomic_u64 initialization.
As Andres pointed out, pg_atomic_init_u64 must be used to initialize an
atomic variable, before it can be accessed with the actual atomic ops.
Trying to use pg_atomic_write_u64 on an uninitialized variable leads to a
failure with the fallback implementation that uses a spinlock.

Discussion: https://www.postgresql.org/message-id/20170816191346.d3ke5tpshhco4bnd%40alap3.anarazel.de
2017-08-17 00:48:44 +03:00
Heikki Linnakangas 3cda10f41b Use atomic ops to hand out pages to scan in parallel scan.
With a lot of CPUs, the spinlock that protects the current scan location
in a parallel scan can become a bottleneck. Use an atomic fetch-and-add
instruction instead.

David Rowley

Discussion: https://www.postgresql.org/message-id/CAKJS1f9tgsPhqBcoPjv9_KUPZvTLCZ4jy%3DB%3DbhqgaKn7cYzm-w@mail.gmail.com
2017-08-16 16:18:41 +03:00
Heikki Linnakangas 0c504a80cf Remove dedicated B-tree root-split record types.
Since commit 40dae7ec53, which changed the way b-tree page splitting
works, there has been no difference in the handling of root, and non-root
split WAL records. We don't need to distinguish them anymore

If you're worried about the loss of debugging information, note that
usually a root split record will normally be followed by a WAL record to
create the new root page. The root page will also have the BTP_ROOT flag
set on the page itself, and there is a pointer to it from the metapage.

Author: Aleksander Alekseev
Discussion: https://www.postgresql.org/message-id/20170406122116.GA11081@e733.localdomain
2017-08-16 12:24:40 +03:00
Tom Lane 21d304dfed Final pgindent + perltidy run for v10. 2017-08-14 17:29:33 -04:00
Tom Lane 5b6289c1e0 Handle elog(FATAL) during ROLLBACK more robustly.
Stress testing by Andreas Seltenreich disclosed longstanding problems that
occur if a FATAL exit (e.g. due to receipt of SIGTERM) occurs while we are
trying to execute a ROLLBACK of an already-failed transaction.  In such a
case, xact.c is in TBLOCK_ABORT state, so that AbortOutOfAnyTransaction
would skip AbortTransaction and go straight to CleanupTransaction.  This
led to an assert failure in an assert-enabled build (due to the ROLLBACK's
portal still having a cleanup hook) or without assertions, to a FATAL exit
complaining about "cannot drop active portal".  The latter's not
disastrous, perhaps, but it's messy enough to want to improve it.

We don't really want to run all of AbortTransaction in this code path.
The minimum required to clean up the open portal safely is to do
AtAbort_Memory and AtAbort_Portals.  It seems like a good idea to
do AtAbort_Memory unconditionally, to be entirely sure that we are
starting with a safe CurrentMemoryContext.  That means that if the
main loop in AbortOutOfAnyTransaction does nothing, we need an extra
step at the bottom to restore CurrentMemoryContext = TopMemoryContext,
which I chose to do by invoking AtCleanup_Memory.  This'll result in
calling AtCleanup_Memory twice in many of the paths through this function,
but that seems harmless and reasonably inexpensive.

The original motivation for the assertion in AtCleanup_Portals was that
we wanted to be sure that any user-defined code executed as a consequence
of the cleanup hook runs during AbortTransaction not CleanupTransaction.
That still seems like a valid concern, and now that we've seen one case
of the assertion firing --- which means that exactly that would have
happened in a production build --- let's replace the Assert with a runtime
check.  If we see the cleanup hook still set, we'll emit a WARNING and
just drop the hook unexecuted.

This has been like this a long time, so back-patch to all supported
branches.

Discussion: https://postgr.es/m/877ey7bmun.fsf@ansel.ydns.eu
2017-08-14 15:43:20 -04:00
Tom Lane 004a9702e0 Remove AtEOXact_CatCache().
The sole useful effect of this function, to check that no catcache
entries have positive refcounts at transaction end, has really been
obsolete since we introduced ResourceOwners in PG 8.1.  We reduced the
checks to assertions years ago, so that the function was a complete
no-op in production builds.  There have been previous discussions about
removing it entirely, but consensus up to now was that it had some small
value as a cross-check for bugs in the ResourceOwner logic.

However, it now emerges that it's possible to trigger these assertions
if you hit an assert-enabled backend with SIGTERM during a call to
SearchCatCacheList, because that function temporarily increases the
refcounts of entries it's intending to add to a catcache list construct.
In a normal ERROR scenario, the extra refcounts are cleaned up by
SearchCatCacheList's PG_CATCH block; but in a FATAL exit we do a
transaction abort and exit without ever executing PG_CATCH handlers.

There's a case to be made that this is a generic hazard and we should
consider restructuring elog(FATAL) handling so that pending PG_CATCH
handlers do get run.  That's pretty scary though: it could easily create
more problems than it solves.  Preliminary stress testing by Andreas
Seltenreich suggests that there are not many live problems of this ilk,
so we rejected that idea.

There are more-localized ways to fix the problem; the most principled
one would be to use PG_ENSURE_ERROR_CLEANUP instead of plain PG_TRY.
But adding cycles to SearchCatCacheList isn't very appealing.  We could
also weaken the assertions in AtEOXact_CatCache in some more or less
ad-hoc way, but that just makes its raison d'etre even less compelling.
In the end, the most reasonable solution seems to be to just remove
AtEOXact_CatCache altogether, on the grounds that it's not worth trying
to fix it.  It hasn't found any bugs for us in many years.

Per report from Jeevan Chalke.  Back-patch to all supported branches.

Discussion: https://postgr.es/m/CAM2+6=VEE30YtRQCZX7_sCFsEpoUkFBV1gZazL70fqLn8rcvBA@mail.gmail.com
2017-08-13 16:15:14 -04:00
Peter Eisentraut a1ef920e27 Remove uses of "slave" in replication contexts
This affects mostly code comments, some documentation, and tests.
Official APIs already used "standby".
2017-08-10 22:55:41 -04:00
Robert Haas ec99dd5aee Remove incorrect assertion in clog.c
We must advance the oldest XID that can be safely looked up in clog
*before* truncating CLOG, and the oldest XID that can't be reused
*after* truncating CLOG.  This assertion, and the accompanying
comment, are confused; remove them.

Reported by Neha Sharma.

Discussion: http://postgr.es/m/CANiYTQumC3T=UMBMd1Hor=5XWZYuCEQBioL3ug0YtNQCMMT5wQ@mail.gmail.com
2017-08-10 11:20:57 -04:00
Alvaro Herrera 77d2c00af7 Reword some unclear comments 2017-08-08 18:48:01 -04:00
Robert Haas 52f8a59dd9 Make pg_stop_backup's wait_for_archive flag work on standbys.
Previously, it had no effect.  Now, if archive_mode=always, it will
work, and if not, you'll get a warning.

Masahiko Sawada, Michael Paquier, and Robert Haas.  The patch as
submitted also changed the behavior so that we would write and remove
history files on standbys, but that seems like material for a separate
patch to me.

Discussion: http://postgr.es/m/CAD21AoC2Xw6M=ZJyejq_9d_iDkReC_=rpvQRw5QsyzKQdfYpkw@mail.gmail.com
2017-08-05 10:49:26 -04:00
Peter Eisentraut 7e174fa793 Only kill sync workers at commit time in subscription DDL
This allows a transaction abort to avoid killing those workers.

Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
2017-08-04 21:17:47 -04:00
Robert Haas ff98a5e1e4 hash: Immediately after a bucket split, try to clean the old bucket.
If it works, then we won't be storing two copies of all the tuples
that were just moved.  If not, VACUUM will still take care of it
eventually.  Per a report from AP and analysis from Amit Kapila, it
seems that a bulk load can cause splits fast enough that VACUUM won't
deal with the problem in time to prevent bloat.

Amit Kapila; I rewrote the comment.

Discussion: http://postgr.es/m/20170704105728.mwb72jebfmok2nm2@zip.com.au
2017-08-04 19:33:01 -04:00
Alvaro Herrera 459c64d322 Fix concurrent locking of tuple update chain
If several sessions are concurrently locking a tuple update chain with
nonconflicting lock modes using an old snapshot, and they all succeed,
it may happen that some of them fail because of restarting the loop (due
to a concurrent Xmax change) and getting an error in the subsequent pass
while trying to obtain a tuple lock that they already have in some tuple
version.

This can only happen with very high concurrency (where a row is being
both updated and FK-checked by multiple transactions concurrently), but
it's been observed in the field and can have unpleasant consequences
such as an FK check failing to see a tuple that definitely exists:
    ERROR:  insert or update on table "child_table" violates foreign key constraint "fk_constraint_name"
    DETAIL:  Key (keyid)=(123456) is not present in table "parent_table".
(where the key is observably present in the table).

Discussion: https://postgr.es/m/20170714210011.r25mrff4nxjhmf3g@alvherre.pgsql
2017-07-26 17:24:16 -04:00
Teodor Sigaev 7e1fb4c59e Fix double shared memory allocation.
SLRU buffer lwlocks are allocated twice by oversight in commit
fe702a7b3f where that locks were moved to
separate tranche. The bug doesn't have user-visible effects except small
overspending of shared memory.

Backpatch to 9.6 where it was introduced.

Alexander Korotkov with small editorization by me.
2017-07-21 13:31:20 +03:00
Tom Lane 3cb29c42f9 Add static assertions about pg_control fitting into one disk sector.
When pg_control was first designed, sizeof(ControlFileData) was small
enough that a comment seemed like plenty to document the assumption that
it'd fit into one disk sector.  Now it's nearly 300 bytes, raising the
possibility that somebody would carelessly add enough stuff to create
a problem.  Let's add a StaticAssertStmt() to ensure that the situation
doesn't pass unnoticed if it ever occurs.

While at it, rename PG_CONTROL_SIZE to PG_CONTROL_FILE_SIZE to make it
clearer what that symbol means, and convert the existing runtime
comparisons of sizeof(ControlFileData) vs. PG_CONTROL_FILE_SIZE to be
static asserts --- we didn't have that technology when this code was
first written.

Discussion: https://postgr.es/m/9192.1500490591@sss.pgh.pa.us
2017-07-19 16:16:57 -04:00
Robert Haas 09c2e7cd2f hash: Fix write-ahead logging bugs related to init forks.
One, logging for CREATE INDEX was oblivious to the fact that when
an unlogged table is created, *only* operations on the init fork
should be logged.

Two, init fork buffers need to be flushed after they are written;
otherwise, a filesystem-level copy following recovery may do the
wrong thing.  (There may be a better fix for this issue than the
one used here, but this is transposed from the similar logic already
present in XLogReadBufferForRedoExtended, and a broader refactoring
after beta2 seems inadvisable.)

Amit Kapila, reviewed by Ashutosh Sharma, Kyotaro Horiguchi,
and Michael Paquier

Discussion: http://postgr.es/m/CAA4eK1JpcMsEtOL_J7WODumeEfyrPi7FPYHeVdS7fyyrCrgp4w@mail.gmail.com
2017-07-17 12:03:35 -04:00
Teodor Sigaev 31b8db8e6c Fix potential data corruption during freeze
Fix oversight in 3b97e6823b bug fix. Bitwise AND is used instead of OR and
it cleans all bits in t_infomask heap tuple field.

Backpatch to 9.3
2017-07-06 17:18:55 +03:00
Peter Eisentraut da8f26ec4e Fix typo in comment
Author: Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
2017-06-30 14:48:43 -04:00
Alvaro Herrera 82c1507e30 Fix typo in comment
Once upon a time, WAL pointers could be NULL, but no longer.  We talk about
"valid" now.

Reported-by: Amit Langote
Discussion: https://postgr.es/m/33e9617d-27f1-eee8-3311-e27af98eaf2b@lab.ntt.co.jp
2017-06-22 16:42:38 -04:00
Andres Freund fb886c153b Fix possibility of creating a "phantom" segment after promotion.
When promoting a standby just after a XLOG_SWITCH record was replayed,
and next segment(s) are already are locally available (via walsender,
restore_command + trigger/recovery target), that segment could
accidentally be recycled onto the past of the new timeline.  Later
checkpointer would create a .ready file for it, assuming there was an
error during creation, and it would get archived.  That causes trouble
if another standby is later brought up from a basebackup from before
the timeline creation, because it would try to read the
segment, because XLogFileReadAnyTLI just tries all possible timelines,
which doesn't have valid contents.  Thus replay would fail.

The problem, if already occurred, can be fixed by removing the segment
and/or having restore_command filter it out.

The reason for the creation of such "phantom" segments was, that after
an XLOG_SWITCH record the EndOfLog variable points to the beginning of
the next segment, and RemoveXlogFile() used XLByteToPrevSeg().
Normally RemoveXlogFile() doing so is harmless, because the last
segment will still exist preventing InstallXLogFileSegment() from
causing harm, but just after promotion there's no previous segment on
the new timeline.

Fix that by using XLByteToSeg() instead of XLByteToPrevSeg().

Author: Andres Freund
Reported-By: Greg Burek
Discussion: https://postgr.es/m/20170619073026.zcwpe6mydsaz5ygd@alap3.anarazel.de
Backpatch: 9.2-, bug older than all supported versions
2017-06-21 14:14:45 -07:00
Tom Lane 382ceffdf7 Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.

By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis.  However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent.  That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.

This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.

This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 15:35:54 -04:00
Tom Lane c7b8998ebb Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.

Commit e3860ffa4d wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code.  The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there.  BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs.  So the
net result is that in about half the cases, such comments are placed
one tab stop left of before.  This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.

Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.

This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 15:19:25 -04:00
Tom Lane e3860ffa4d Initial pgindent run with pg_bsd_indent version 2.0.
The new indent version includes numerous fixes thanks to Piotr Stefaniak.
The main changes visible in this commit are:

* Nicer formatting of function-pointer declarations.
* No longer unexpectedly removes spaces in expressions using casts,
  sizeof, or offsetof.
* No longer wants to add a space in "struct structname *varname", as
  well as some similar cases for const- or volatile-qualified pointers.
* Declarations using PG_USED_FOR_ASSERTS_ONLY are formatted more nicely.
* Fixes bug where comments following declarations were sometimes placed
  with no space separating them from the code.
* Fixes some odd decisions for comments following case labels.
* Fixes some cases where comments following code were indented to less
  than the expected column 33.

On the less good side, it now tends to put more whitespace around typedef
names that are not listed in typedefs.list.  This might encourage us to
put more effort into typedef name collection; it's not really a bug in
indent itself.

There are more changes coming after this round, having to do with comment
indentation and alignment of lines appearing within parentheses.  I wanted
to limit the size of the diffs to something that could be reviewed without
one's eyes completely glazing over, so it seemed better to split up the
changes as much as practical.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 14:39:04 -04:00
Magnus Hagander bb1f8f9e5b Fix typos in comments
Author: Daniel Gustafsson <daniel@yesql.se>
2017-06-17 10:17:28 +02:00
Alvaro Herrera e90ceeaa49 Avoid bogus TwoPhaseState locking sequences
The optimized code in 728bd991c3 contains a few invalid locking
sequences.  To wit, the original code would try to acquire an lwlock
that it already holds.  Avoid this by moving lock acquisitions to
higher-level code, and install appropriate assertions in low-level that
the correct mode is held.

Authors: Michael Paquier, Álvaro Herrera
Reported-By: chuanting wang
Bug: #14680
Discussion: https://postgr.es/m/20170531033228.1487.10124@wrigleys.postgresql.org
2017-06-14 11:29:05 -04:00
Tom Lane 651902deb1 Re-run pgindent.
This is just to have a clean base state for testing of Piotr Stefaniak's
latest version of FreeBSD indent.  I fixed up a couple of places where
pgindent would have changed format not-nicely.  perltidy not included.

Discussion: https://postgr.es/m/VI1PR03MB119959F4B65F000CA7CD9F6BF2CC0@VI1PR03MB1199.eurprd03.prod.outlook.com
2017-06-13 13:05:59 -04:00
Peter Eisentraut e11e24b1ed Formatting improvements in config file samples 2017-06-09 14:38:33 -04:00
Peter Eisentraut 8c9387c55e Update code comments
Author: Neha Khatri <nehakhatri5@gmail.com>
2017-06-09 14:04:22 -04:00
Andres Freund 9206ced1dc Clean up latch related code.
The larger part of this patch replaces usages of MyProc->procLatch
with MyLatch.  The latter works even early during backend startup,
where MyProc->procLatch doesn't yet.  While the affected code
shouldn't run in cases where it's not initialized, it might get copied
into places where it might.  Using MyLatch is simpler and a bit faster
to boot, so there's little point to stick with the previous coding.

While doing so I noticed some weaknesses around newly introduced uses
of latches that could lead to missed events, and an omitted
CHECK_FOR_INTERRUPTS() call in worker_spi.

As all the actual bugs are in v10 code, there doesn't seem to be
sufficient reason to backpatch this.

Author: Andres Freund
Discussion:
    https://postgr.es/m/20170606195321.sjmenrfgl2nu6j63@alap3.anarazel.de
    https://postgr.es/m/20170606210405.sim3yl6vpudhmufo@alap3.anarazel.de
Backpatch: -
2017-06-06 16:13:00 -07:00
Andres Freund c6c3334364 Prevent possibility of panics during shutdown checkpoint.
When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and
throws a PANIC if so.  At that point, only walsenders are still
active, so one might think this could not happen, but walsenders can
also generate WAL, for instance in BASE_BACKUP and logical decoding
related commands (e.g. via hint bits).  So they can trigger this panic
if such a command is run while the shutdown checkpoint is being
written.

To fix this, divide the walsender shutdown into two phases.  First,
checkpointer, itself triggered by postmaster, sends a
PROCSIG_WALSND_INIT_STOPPING signal to all walsenders.  If the backend
is idle or runs an SQL query this causes the backend to shutdown, if
logical replication is in progress all existing WAL records are
processed followed by a shutdown.  Otherwise this causes the walsender
to switch to the "stopping" state. In this state, the walsender will
reject any further replication commands. The checkpointer begins the
shutdown checkpoint once all walsenders are confirmed as
stopping. When the shutdown checkpoint finishes, the postmaster sends
us SIGUSR2. This instructs walsender to send any outstanding WAL,
including the shutdown checkpoint record, wait for it to be replicated
to the standby, and then exit.

Author: Andres Freund, based on an earlier patch by Michael Paquier
Reported-By: Fujii Masao, Andres Freund
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20170602002912.tqlwn4gymzlxpvs2@alap3.anarazel.de
Backpatch: 9.4, where logical decoding was introduced
2017-06-05 19:18:15 -07:00
Andres Freund 703f148e98 Revert "Prevent panic during shutdown checkpoint"
This reverts commit 086221cf6b, which
was made to master only.

The approach implemented in the above commit has some issues.  While
those could easily be fixed incrementally, doing so would make
backpatching considerably harder, so instead first revert this patch.

Discussion: https://postgr.es/m/20170602002912.tqlwn4gymzlxpvs2@alap3.anarazel.de
2017-06-05 19:18:15 -07:00
Tom Lane d466335064 Don't be so trusting that shm_toc_lookup() will always succeed.
Given the possibility of race conditions and so on, it seems entirely
unsafe to just assume that shm_toc_lookup() always finds the key it's
looking for --- but that was exactly what all but one call site were
doing.  To fix, add a "bool noError" argument, similarly to what we
have in many other functions, and throw an error on an unexpected
lookup failure.  Remove now-redundant Asserts that a rather random
subset of call sites had.

I doubt this will throw any light on buildfarm member lorikeet's
recent failures, because if an unnoticed lookup failure were involved,
you'd kind of expect a null-pointer-dereference crash rather than the
observed symptom.  But you never know ... and this is better coding
practice even if it never catches anything.

Discussion: https://postgr.es/m/9697.1496675981@sss.pgh.pa.us
2017-06-05 12:05:42 -04:00
Alvaro Herrera 55a70a023c Assorted translatable string fixes
Mark our rusage reportage string translatable; remove quotes from type
names; unify formatting of very similar messages.
2017-06-04 11:41:16 -04:00
Robert Haas 814573e6c4 Restore accidentally-removed line.
Commit 88e66d193f is to blame.

Masahiko Sawada

Discussion: http://postgr.es/m/CAD21AoAXeb7O4hgg+efs8JT_SxpR4doAH5c5s-Z5WoRLstBZJA@mail.gmail.com
2017-05-31 14:24:22 -04:00
Alvaro Herrera b4da9d0e1e brin: Don't crash on auto-summarization
We were trying to free a pointer into a shared buffer, which never
works; and we were failing to release the buffer lock appropriately.
Fix those omissions.

While at it, improve documentation for brinGetTupleForHeapBlock, the
inadequacy of which evidently caused these bugs in the first place.

Reported independently by Zhou Digoal (bug #14668) and Alexander Sosna.

Discussion: https://postgr.es/m/8c31c11b-6adb-228d-22c2-4ace89fc9209@credativ.de
Discussion: https://postgr.es/m/20170524063323.29941.46339@wrigleys.postgresql.org
2017-05-30 18:17:09 -04:00
Alvaro Herrera e6785a5ca1 Fix wording in amvalidate error messages
Remove some gratuituous message differences by making the AM name
previously embedded in each message be a %s instead.  While at it, get
rid of terminology that's unclear and unnecessary in one message.

Discussion: https://postgr.es/m/20170523001557.bq2hbq7hxyvyw62q@alvherre.pgsql
2017-05-30 15:45:42 -04:00
Magnus Hagander 312bac54cc Fix typo in comment
Author: Masahiko Sawada
2017-05-22 09:10:02 +02:00
Bruce Momjian a6fd7b7a5f Post-PG 10 beta1 pgindent run
perltidy run not included.
2017-05-17 16:31:56 -04:00
Peter Eisentraut d496a65790 Standardize "WAL location" terminology
Other previously used terms were "WAL position" or "log position".
2017-05-12 13:51:27 -04:00
Peter Eisentraut c1a7f64b4a Replace "transaction log" with "write-ahead log"
This makes documentation and error messages match the renaming of "xlog"
to "wal" in APIs and file naming.
2017-05-12 11:52:43 -04:00
Tom Lane d10c626de4 Rename WAL-related functions and views to use "lsn" not "location".
Per discussion, "location" is a rather vague term that could refer to
multiple concepts.  "LSN" is an unambiguous term for WAL locations and
should be preferred.  Some function names, view column names, and function
output argument names used "lsn" already, but others used "location",
as well as yet other terms such as "wal_position".  Since we've already
renamed a lot of things in this area from "xlog" to "wal" for v10,
we may as well incur a bit more compatibility pain and make these names
all consistent.

David Rowley, minor additional docs hacking by me

Discussion: https://postgr.es/m/CAKJS1f8O0njDKe8ePFQ-LK5-EjwThsDws6ohJ-+c6nWK+oUxtg@mail.gmail.com
2017-05-11 11:49:59 -04:00
Alvaro Herrera b66adb7b0c Revert "Permit dump/reload of not-too-large >1GB tuples"
This reverts commits fa2fa99552 and 42f50cb8fa.

While the functionality that was intended to be provided by these
commits is desired, the patch didn't actually solve as many of the
problematic situations as we hoped, and it created a bunch of its own
problems.  Since we're going to require more extensive changes soon for
other reasons and users have been working around these problems for a
long time already, there is no point in spending effort in fixing this
halfway measure.

Per complaint from Tom Lane.
Discussion: https://postgr.es/m/21407.1484606922@sss.pgh.pa.us

(Commit fa2fa99552 had already been reverted in branches 9.5 as
f858524ee4 and 9.6 as e9e44a0953, so this touches master only.
Commit 42f50cb8fa was not present in the older branches.)
2017-05-10 18:41:27 -03:00
Robert Haas a5775991bb Remove no-longer-needed compatibility code for hash indexes.
Because commit ea69a0dead bumped the
HASH_VERSION, we don't need to worry about PostgreSQL 10 seeing
bucket pages from earlier versions.

Amit Kapila

Discussion: http://postgr.es/m/CAA4eK1LAo4DGwh+mi-G3U8Pj1WkBBeFL38xdCnUHJv1z4bZFkQ@mail.gmail.com
2017-05-09 23:44:21 -04:00
Peter Eisentraut 086221cf6b Prevent panic during shutdown checkpoint
When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and throws
a PANIC if so.  At that point, only walsenders are still active, so one
might think this could not happen, but walsenders can also generate WAL,
for instance in BASE_BACKUP and certain variants of
CREATE_REPLICATION_SLOT.  So they can trigger this panic if such a
command is run while the shutdown checkpoint is being written.

To fix this, divide the walsender shutdown into two phases.  First, the
postmaster sends a SIGUSR2 signal to all walsenders.  The walsenders
then put themselves into the "stopping" state.  In this state, they
reject any new commands.  (For simplicity, we reject all new commands,
so that in the future we do not have to track meticulously which
commands might generate WAL.)  The checkpointer waits for all walsenders
to reach this state before proceeding with the shutdown checkpoint.
After the shutdown checkpoint is done, the postmaster sends
SIGINT (previously unused) to the walsenders.  This triggers the
existing shutdown behavior of sending out the shutdown checkpoint record
and then terminating.

Author: Michael Paquier <michael.paquier@gmail.com>
Reported-by: Fujii Masao <masao.fujii@gmail.com>
2017-05-05 10:31:42 -04:00
Tom Lane 3f074845a8 Fix pfree-of-already-freed-tuple when rescanning a GiST index-only scan.
GiST's getNextNearest() function attempts to pfree the previously-returned
tuple if any (that is, scan->xs_hitup in HEAD, or scan->xs_itup in older
branches).  However, if we are rescanning a plan node after ending a
previous scan early, those tuple pointers could be pointing to garbage,
because they would be pointing into the scan's pageDataCxt or queueCxt
which has been reset.  In a debug build this reliably results in a crash,
although I think it might sometimes accidentally fail to fail in
production builds.

To fix, clear the pointer field anyplace we reset a context it might
be pointing into.  This may be overkill --- I think probably only the
queueCxt case is involved in this bug, so that resetting in gistrescan()
would be sufficient --- but dangling pointers are generally bad news,
so let's avoid them.

Another plausible answer might be to just not bother with the pfree in
getNextNearest().  The reconstructed tuples would go away anyway in the
context resets, and I'm far from convinced that freeing them a bit earlier
really saves anything meaningful.  I'll stick with the original logic in
this patch, but if we find more problems in the same area we should
consider that approach.

Per bug #14641 from Denis Smirnov.  Back-patch to 9.5 where this
logic was introduced.

Discussion: https://postgr.es/m/20170504072034.24366.57688@wrigleys.postgresql.org
2017-05-04 13:59:39 -04:00
Peter Eisentraut 9414e41ea7 Fix logical replication launcher wake up and reset
After the logical replication launcher was told to wake up at
commit (for example, by a CREATE SUBSCRIPTION command), the flag to wake
up was not reset, so it would be woken up at every following commit as
well.  So fix that by resetting the flag.

Also, we don't need to wake up anything if the transaction was rolled
back.  Just reset the flag in that case.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reported-by: Fujii Masao <masao.fujii@gmail.com>
2017-05-01 10:18:09 -04:00
Simon Riggs 49e9281549 Rework handling of subtransactions in 2PC recovery
The bug fixed by 0874d4f3e1
caused us to question and rework the handling of
subtransactions in 2PC during and at end of recovery.
Patch adds checks and tests to ensure no further bugs.

This effectively removes the temporary measure put in place
by 546c13e11b.

Author: Simon Riggs
Reviewed-by: Tom Lane, Michael Paquier
Discussion: http://postgr.es/m/CANP8+j+vvXmruL_i2buvdhMeVv5TQu0Hm2+C5N+kdVwHJuor8w@mail.gmail.com
2017-04-27 14:41:22 +02:00
Simon Riggs 546c13e11b Workaround for RecoverPreparedTransactions()
Force overwriteOK = true while we investigate deeper fix

Proposed by Tom Lane as temporary measure, accepted by me
2017-04-23 22:12:01 +01:00
Tom Lane 0874d4f3e1 Fix order of arguments to SubTransSetParent().
ProcessTwoPhaseBuffer (formerly StandbyRecoverPreparedTransactions)
mixed up the parent and child XIDs when calling SubTransSetParent to
record the transactions' relationship in pg_subtrans.

Remarkably, analysis by Simon Riggs suggests that this doesn't lead to
visible problems (at least, not in non-Assert builds).  That might
explain why we'd not noticed it before.  Nonetheless, it's surely wrong.

This code was born broken, so back-patch to all supported branches.

Discussion: https://postgr.es/m/20110.1492905318@sss.pgh.pa.us
2017-04-23 13:11:06 -04:00
Simon Riggs ee01f7092f Exit correctly from PrepareRedoRemove() when not found
Complex crash bug all started with this failure.
Diagnosed and fixed by Nikhil Sontakke, reviewed by me.

Reported-by: Jeff Janes
Author: Nikhil Sontakke
Discussion: https://postgr.es/m/CAMkU=1xBP8cqdS5eK8APHL=X6RHMMM2vG5g+QamduuTsyCwv9g@mail.gmail.com
2017-04-18 11:35:38 +01:00
Simon Riggs aa203e7600 Don’t push nextid too far forwards in recovery
Doing so allows various crash possibilities. Fix by avoiding
having PrescanPreparedTransactions() increment
ShmemVariableCache->nextXid when it has no 2PC files

Bug found by Jeff Janes, diagnosis and patch by Pavan Deolasee,
then patch re-designed for clarity and full accuracy by
Michael Paquier.

Reported-by: Jeff Janes
Author: Pavan Deolasee, Michael Paquier
Discussion: https://postgr.es/m/CAMkU=1zMLnH_i1-PVQ-biZzvNx7VcuatriquEnh7HNk6K8Ss3Q@mail.gmail.com
2017-04-18 11:14:05 +01:00
Peter Eisentraut 6275f5d28a Fix new warnings from GCC 7
This addresses the new warning types -Wformat-truncation
-Wformat-overflow that are part of -Wall, via -Wformat, in GCC 7.
2017-04-17 13:59:46 -04:00
Tom Lane b6dd127128 Ensure BackgroundWorker struct contents are well-defined.
Coverity complained because bgw.bgw_extra wasn't being filled in by
ApplyLauncherRegister().  The most future-proof fix is to memset the
whole BackgroundWorker struct to zeroes.  While at it, let's apply the
same coding rule to other places that set up BackgroundWorker structs;
four out of five had the same or related issues.
2017-04-16 23:23:44 -04:00
Peter Eisentraut c7d225e227 Fix typo in comment
Author: Masahiko Sawada <sawada.mshk@gmail.com>
2017-04-16 19:47:37 -04:00
Tom Lane 083dc95a14 More cleanup of manipulations of hash indexes' hasho_flag field.
Not much point in defining test macros for the flag bits if we
don't use 'em.

Amit Kapila
2017-04-15 14:11:15 -04:00
Tom Lane 32470825d3 Avoid passing function pointers across process boundaries.
We'd already recognized that we can't pass function pointers across process
boundaries for functions in loadable modules, since a shared library could
get loaded at different addresses in different processes.  But actually the
practice doesn't work for functions in the core backend either, if we're
using EXEC_BACKEND.  This is the cause of recent failures on buildfarm
member culicidae.  Switch to passing a string function name in all cases.

Something like this needs to be back-patched into 9.6, but let's see
if the buildfarm likes it first.

Petr Jelinek, with a bunch of basically-cosmetic adjustments by me

Discussion: https://postgr.es/m/548f9c1d-eafa-e3fa-9da8-f0cc2f654e60@2ndquadrant.com
2017-04-14 23:50:16 -04:00
Tom Lane 2040bb4a0b Clean up manipulations of hash indexes' hasho_flag field.
Standardize on testing a hash index page's type by doing
	(opaque->hasho_flag & LH_PAGE_TYPE) == LH_xxx_PAGE
Various places were taking shortcuts like
	opaque->hasho_flag & LH_BUCKET_PAGE
which while not actually wrong, is still bad practice because
it encourages use of
	opaque->hasho_flag & LH_UNUSED_PAGE
which *is* wrong (LH_UNUSED_PAGE == 0, so the above is constant false).
hash_xlog.c's hash_mask() contained such an incorrect test.

This also ensures that we mask out the additional flag bits that
hasho_flag has accreted since 9.6.  pgstattuple's pgstat_hash_page(),
for one, was failing to do that and was thus actively broken.

Also fix assorted comments that hadn't been updated to reflect the
extended usage of hasho_flag, and fix some macros that were testing
just "(hasho_flag & bit)" to use the less dangerous, project-approved
form "((hasho_flag & bit) != 0)".

Coverity found the bug in hash_mask(); I noted the one in
pgstat_hash_page() through code reading.
2017-04-14 17:04:25 -04:00
Alvaro Herrera 8bf74967da Reduce the number of pallocs() in BRIN
Instead of allocating memory in brin_deform_tuple and brin_copy_tuple
over and over during a scan, allow reuse of previously allocated memory.
This is said to make for a measurable performance improvement.

Author: Jinyu Zhang, Álvaro Herrera
Reviewed by: Tomas Vondra
Discussion: https://postgr.es/m/495deb78.4186.1500dacaa63.Coremail.beijing_pg@163.com
2017-04-07 19:08:43 -03:00
Alvaro Herrera 817cb10013 Fix new BRIN desummarize WAL record
The WAL-writing piece was forgetting to set the pages-per-range value.
Also, fix the declared type of struct member heapBlk, which I mistakenly
set as OffsetNumber rather than BlockNumber.

Problem was introduced by commit c655899ba9 (April 1st).  Any system
that tries to replay the new WAL record written before this fix is
likely to die on replay and require pg_resetwal.

Reported by Tom Lane.
Discussion: https://postgr.es/m/20191.1491524824@sss.pgh.pa.us
2017-04-07 17:11:56 -03:00
Tom Lane 3f902354b0 Clean up after insufficiently-researched optimization of tuple conversions.
tupconvert.c's functions formerly considered that an explicit tuple
conversion was necessary if the input and output tupdescs contained
different type OIDs.  The point of that was to make sure that a composite
datum resulting from the conversion would contain the destination rowtype
OID in its composite-datum header.  However, commit 3838074f8 entirely
misunderstood what that check was for, thinking that it had something to do
with presence or absence of an OID column within the tuple.  Removal of the
check broke the no-op conversion path in ExecEvalConvertRowtype, as
reported by Ashutosh Bapat.

It turns out that of the dozen or so call sites for tupconvert.c functions,
ExecEvalConvertRowtype is the only one that cares about the composite-datum
header fields in the output tuple.  In all the rest, we'd much rather avoid
an unnecessary conversion whenever the tuples are physically compatible.
Moreover, the comments in tupconvert.c only promise physical compatibility
not a metadata match.  So, let's accept the removal of the guarantee about
the output tuple's rowtype marking, recognizing that this is a API change
that could conceivably break third-party callers of tupconvert.c.  (So,
let's remember to mention it in the v10 release notes.)

However, commit 3838074f8 did have a bit of a point here, in that two
tuples mustn't be considered physically compatible if one has HEAP_HASOID
set and the other doesn't.  (Some of the callers of tupconvert.c might not
really care about that, but we can't assume it in general.)  The previous
check accidentally covered that issue, because no RECORD types ever have
OIDs, while if two tupdescs have the same named composite type OID then,
a fortiori, they have the same tdhasoid setting.  If we're removing the
type OID match check then we'd better include tdhasoid match as part of
the physical compatibility check.

Without that hack in tupconvert.c, we need ExecEvalConvertRowtype to take
responsibility for inserting the correct rowtype OID label whenever
tupconvert.c decides it need not do anything.  This is easily done with
heap_copy_tuple_as_datum, which will be considerably faster than a tuple
disassembly and reassembly anyway; so from a performance standpoint this
change is a win all around compared to what happened in earlier branches.
It just means a couple more lines of code in ExecEvalConvertRowtype.

Ashutosh Bapat and Tom Lane

Discussion: https://postgr.es/m/CAFjFpRfvHABV6+oVvGcshF8rHn+1LfRUhj7Jz1CDZ4gPUwehBg@mail.gmail.com
2017-04-06 21:10:20 -04:00
Alvaro Herrera 7e534adcdc Fix BRIN cost estimation
The original code was overly optimistic about the cost of scanning a
BRIN index, leading to BRIN indexes being selected when they'd be a
worse choice than some other index.  This complete rewrite should be
more accurate.

Author: David Rowley, based on an earlier patch by Emre Hasegeli
Reviewed-by: Emre Hasegeli
Discussion: https://postgr.es/m/CAKJS1f9n-Wapop5Xz1dtGdpdqmzeGqQK4sV2MK-zZugfC14Xtw@mail.gmail.com
2017-04-06 17:51:53 -03:00
Peter Eisentraut e6c9a5a9bc Fix mixup of bool and ternary value
Not currently a problem, but could be with stricter bool behavior under
stdbool or C++.

Reviewed-by: Andres Freund <andres@anarazel.de>
2017-04-06 13:09:42 -04:00
Simon Riggs cd0cebaf7d Always SnapshotResetXmin() during ClearTransaction()
Avoid corner cases during 2PC with 6bad580d9e
2017-04-06 10:30:22 -04:00
Peter Eisentraut 3217327053 Identity columns
This is the SQL standard-conforming variant of PostgreSQL's serial
columns.  It fixes a few usability issues that serial columns have:

- CREATE TABLE / LIKE copies default but refers to same sequence
- cannot add/drop serialness with ALTER TABLE
- dropping default does not drop sequence
- need to grant separate privileges to sequence
- other slight weirdnesses because serial is some kind of special macro

Reviewed-by: Vitaly Burovoy <vitaly.burovoy@gmail.com>
2017-04-06 08:41:37 -04:00
Simon Riggs 6bad580d9e Avoid SnapshotResetXmin() during AtEOXact_Snapshot()
For normal commits and aborts we already reset PgXact->xmin,
so we can simply avoid running SnapshotResetXmin() twice.

During performance tests by Alexander Korotkov, diagnosis
by Andres Freund showed PgXact array as a bottleneck. After
manual analysis by me of the code paths that touch those
memory locations, I was able to identify extraneous code
in the main transaction commit path.

Avoiding touching highly contented shmem improves concurrent
performance slightly on all workloads, confirmed by tests
run by Ashutosh Sharma and Alexander Korotkov.

Simon Riggs

Discussion: CANP8+jJdXE9b+b9F8CQT-LuxxO0PBCB-SZFfMVAdp+akqo4zfg@mail.gmail.com
2017-04-06 08:31:52 -04:00
Robert Haas 633e15ea0f Fix pageinspect failures on hash indexes.
Make every page in a hash index which isn't all-zeroes have a valid
special space, so that tools like pageinspect don't error out.

Also, make pageinspect cope with all-zeroes pages, because
_hash_alloc_buckets can leave behind large numbers of those until
they're consumed by splits.

Ashutosh Sharma and Robert Haas, reviewed by Amit Kapila.
Original trouble report from Jeff Janes.

Discussion: http://postgr.es/m/CAMkU=1y6NjKmqbJ8wLMhr=F74WzcMALYWcVFhEpm7i=mV=XsOg@mail.gmail.com
2017-04-05 14:18:15 -04:00
Robert Haas 75a1cbdc3c hash: Fix write-ahead logging bug.
The size of the data is not the same thing as the size of the size of
the data.

Reported off-list by Tushar Ahuja.  Fix by Ashutosh Sharma, reviewed
by Amit Kapila.

Discussion: http://postgr.es/m/CAE9k0PnmPDXfvf8HDObme7q_Ewc4E26ukHXUBPySoOs0ObqqaQ@mail.gmail.com
2017-04-05 11:45:35 -04:00
Peter Eisentraut afd79873a0 Capitalize names of PLs consistently
Author: Daniel Gustafsson <daniel@yesql.se>
2017-04-05 00:38:25 -04:00
Simon Riggs 9a3215026b Make min_wal_size/max_wal_size use MB internally
Previously they were defined using multiples of XLogSegSize.
Remove GUC_UNIT_XSEGS. Introduce GUC_UNIT_MB

Extracted from patch series on XLogSegSize infrastructure.

Beena Emerson
2017-04-04 18:00:01 -04:00
Simon Riggs cd740c0dbf Fix uninitialized variables in twophase.c 2017-04-04 17:50:02 -04:00
Simon Riggs 728bd991c3 Speedup 2PC recovery by skipping two phase state files in normal path
2PC state info held in shmem at PREPARE, then cleaned at COMMIT PREPARED/ABORT PREPARED,
avoiding writing/fsyncing any state information to disk in the normal path, greatly enhancing replay speed.
Prepared transactions that live past one checkpoint redo horizon will be written to disk as now.
Similar conceptually to 978b2f65aa and building upon
the infrastructure created by that commit.

Authors, in equal measure: Stas Kelvich, Nikhil Sontakke and Michael Paquier
Discussion: https://postgr.es/m/CAMGcDxf8Bn9ZPBBJZba9wiyQq-Qk5uqq=VjoMnRnW5s+fKST3w@mail.gmail.com
2017-04-04 15:56:56 -04:00
Robert Haas b38006ef6d Fix formula in _hash_spareindex.
This was correct in earlier versions of the patch that lead to
commit ea69a0dead, but somehow got
broken in the last version which I actually committed.

Mithun Cy, per an off-list report from Ashutosh Sharma

Discussion: http://postgr.es/m/CAD__OujbAwNU71v1y-RoQxZ8LZ6-V2UFTkex3v34MK6uZ3Xb5w@mail.gmail.com
2017-04-04 07:45:04 -04:00
Robert Haas ea69a0dead Expand hash indexes more gradually.
Since hash indexes typically have very few overflow pages, adding a
new splitpoint essentially doubles the on-disk size of the index,
which can lead to large and abrupt increases in disk usage (and
perhaps long delays on occasion).  To mitigate this problem to some
degree, divide larger splitpoints into four equal phases.  This means
that, for example, instead of growing from 4GB to 8GB all at once, a
hash index will now grow from 4GB to 5GB to 6GB to 7GB to 8GB, which
is perhaps still not as smooth as we'd like but certainly an
improvement.

This changes the on-disk format of the metapage, so bump HASH_VERSION
from 2 to 3.  This will force a REINDEX of all existing hash indexes,
but that's probably a good idea anyway.  First, hash indexes from
pre-10 versions of PostgreSQL could easily be corrupted, and we don't
want to confuse corruption carried over from an older release with any
corruption caused despite the new write-ahead logging in v10.  Second,
it will let us remove some backward-compatibility code added by commit
293e24e507.

Mithun Cy, reviewed by Amit Kapila, Jesper Pedersen and me.  Regression
test outputs updated by me.

Discussion: http://postgr.es/m/CAD__OuhG6F1gQLCgMQNnMNgoCvOLQZz9zKYJQNYvYmmJoM42gA@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoYty0jCf-pa+m+vYUJ716+AxM7nv_syvyanyf5O-L_i2A@mail.gmail.com
2017-04-03 23:46:33 -04:00
Robert Haas 93cd7684ee Properly acquire buffer lock for page-at-a-time hash vacuum.
In a couple of places, _hash_kill_items was mistakenly called with
the buffer lock not held.  Repair.

Ashutosh Sharma, per a report from Andreas Seltenreich

Discussion: http://postgr.es/m/87o9wo8o0j.fsf@credativ.de
2017-04-03 22:26:06 -04:00
Alvaro Herrera c655899ba9 BRIN de-summarization
When the BRIN summary tuple for a page range becomes too "wide" for the
values actually stored in the table (because the tuples that were
present originally are no longer present due to updates or deletes), it
can be useful to remove the outdated summary tuple, so that a future
summarization can install a tighter summary.

This commit introduces a SQL-callable interface to do so.

Author: Álvaro Herrera
Reviewed-by: Eiji Seki
Discussion: https://postgr.es/m/20170228045643.n2ri74ara4fhhfxf@alvherre.pgsql
2017-04-01 16:10:04 -03:00
Alvaro Herrera 7526e10224 BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur.  To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range.  Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.

This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations.  The only actual effect is to
change the timing, i.e. that it occurs earlier.  For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.

Most of the new code here is in autovacuum, which can now be told about
"work items" to process.  This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.

The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values().  It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.

Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 14:00:53 -03:00
Robert Haas 2113ac4cbb Don't use bgw_main even to specify in-core bgworker entrypoints.
On EXEC_BACKEND builds, this can fail if ASLR is in use.

Backpatch to 9.5.  On master, completely remove the bgw_main field
completely, since there is no situation in which it is safe for an
EXEC_BACKEND build.  On 9.6 and 9.5, leave the field intact to avoid
breaking things for third-party code that doesn't care about working
under EXEC_BACKEND.  Prior to 9.5, there are no in-core bgworker
entrypoints.

Petr Jelinek, reviewed by me.

Discussion: http://postgr.es/m/09d8ad33-4287-a09b-a77f-77f8761adb5e@2ndquadrant.com
2017-03-31 20:43:32 -04:00
Robert Haas c94e6942ce Don't allocate storage for partitioned tables.
Also, don't allow setting reloptions on them, since that would have no
effect given the lack of storage.  The patch does this by introducing
a new reloption kind for which there are currently no reloptions -- we
might have some in the future -- so it adjusts parseRelOptions to
handle that case correctly.

Bumped catversion.  System catalogs that contained reloptions for
partitioned tables are no longer valid; plus, there are now fewer
physical files on disk, which is not technically a catalog change but
still a good reason to re-initdb.

Amit Langote, reviewed by Maksim Milyutin and Kyotaro Horiguchi and
revised a bit by me.

Discussion: http://postgr.es/m/20170331.173326.212311140.horiguchi.kyotaro@lab.ntt.co.jp
2017-03-31 16:28:51 -04:00
Alvaro Herrera 2fd8685e7f Simplify check of modified attributes in heap_update
The old coding was getting more complicated as new things were added,
and it would be barely tolerable with upcoming WARM updates and other
future features such as indirect indexes.  The new coding incurs a small
performance cost in synthetic benchmark cases, and is barely measurable
in normal cases.  A much larger benefit is expected from WARM, which
could actually bolt its needs on top of the existing coding, but it is
much uglier and bug-prone than doing it on this new code.  Additional
optimization can be applied on top of this, if need be.

Reviewed-by: Pavan Deolasee, Amit Kapila, Mithun CY
Discussion: https://postgr.es/m/20161228232018.4hc66ndrzpz4g4wn@alvherre.pgsql
	https://postgr.es/m/CABOikdMJfz69dBNRTOZcB6s5A0tf8OMCyQVYQyR-WFFdoEwKMQ@mail.gmail.com
2017-03-29 14:01:14 -03:00
Alvaro Herrera ce96ce60ca Remove direct uses of ItemPointer.{ip_blkid,ip_posid}
There are no functional changes here; this simply encapsulates knowledge
of the ItemPointerData struct so that a future patch can change things
without more breakage.

All direct users of ip_blkid and ip_posid are changed to use existing
macros ItemPointerGetBlockNumber and ItemPointerGetOffsetNumber
respectively.  For callers where that's inappropriate (because they
Assert that the itempointer is is valid-looking), add
ItemPointerGetBlockNumberNoCheck and ItemPointerGetOffsetNumberNoCheck,
which lack the assertion but are otherwise identical.

Author: Pavan Deolasee
Discussion: https://postgr.es/m/CABOikdNnFon4cJiL=h1mZH3bgUeU+sWHuU4Yr8AB=j3A2p1GiA@mail.gmail.com
2017-03-28 19:02:23 -03:00
Simon Riggs a99f77021f Correct grammar in error message
"could not generate" rather than "could not generation"
from commit 818fd4a67d
2017-03-28 13:24:39 -04:00
Tom Lane 8cfeaecfc7 Suppress implicit-conversion warnings seen with newer clang versions.
We were assigning values near 255 through "char *" pointers.  On machines
where char is signed, that's not entirely kosher, and it's reasonable
for compilers to warn about it.

A better solution would be to change the pointer type to "unsigned char *",
but that would be vastly more invasive.  For the moment, let's just apply
this simple backpatchable solution.

Aleksander Alekseev

Discussion: https://postgr.es/m/20170220141239.GD12278@e733.localdomain
Discussion: https://postgr.es/m/2839.1490714708@sss.pgh.pa.us
2017-03-28 13:16:19 -04:00
Robert Haas c4c51541e2 Still more code review for single-page hash vacuuming.
Most seriously, fix use of incorrect block ID, per a report from
Jeff Janes that it causes a crash and a diagnosis from Amit Kapila.

Improve consistency between the hash and btree versions of this
code by adding back a PANIC that btree has, and by registering
data in the xlog record in the same way, per complaints from
Jeff Janes and Amit Kapila.

Tidy up some minor cosmetic points, per complaints from Amit
Kapila.

Patch by Ashutosh Sharma, reviewed by Amit Kapila, and tested by
Jeff Janes.

Discussion: http://postgr.es/m/CAMkU=1w-9Qe=Ff1o6bSaXpNO9wqpo7_9GL8_CVhw4BoVVHasqg@mail.gmail.com
2017-03-27 12:51:10 -04:00
Teodor Sigaev 1b02be21f2 Fsync directory after creating or unlinking file.
If file was created/deleted just before powerloss it's possible that
file system will miss that. To prevent it, call fsync() where creating/
unlinkg file is critical.

Author: Michael Paquier
Reviewed-by: Ashutosh Bapat, Takayuki Tsunakawa, me
2017-03-27 19:33:01 +03:00
Robert Haas f0a6046bcb Fix comment.
Cut-and-paste led to something silly.

Ashutosh Sharma, reviewed by Amit Kapila and by me

Discussion: http://postgr.es/m/CAE9k0PmUbvQSBY7kwN_OkuqBYyHRXBX-c1ZkuAgR5vgF0GeWzQ@mail.gmail.com
2017-03-26 22:15:50 -04:00
Tom Lane 2f0903ea19 Improve performance of ExecEvalWholeRowVar.
In commit b8d7f053c, we needed to fix ExecEvalWholeRowVar to not change
the state of the slot it's copying.  The initial quick hack at that
required two rounds of tuple construction, which is not very nice.
To fix, add another primitive to tuptoaster.c that does precisely what
we need.  (I initially tried to do this by refactoring one of the
existing functions into two pieces; but it looked like that might hurt
performance for the existing case, and the amount of code that could
be shared is not very large, so I gave up on that.)

Discussion: https://postgr.es/m/26088.1490315792@sss.pgh.pa.us
2017-03-26 19:14:57 -04:00
Simon Riggs 3428ef7911 Reverting 42b4b0b241
Buildfarm issues and other reported issues
2017-03-24 17:56:17 +00:00
Simon Riggs 42b4b0b241 Avoid SnapshotResetXmin() during AtEOXact_Snapshot()
For normal commits and aborts we already reset PgXact->xmin
Avoiding touching highly contented shmem improves concurrent
performance.

Simon Riggs

Discussion: CANP8+jJdXE9b+b9F8CQT-LuxxO0PBCB-SZFfMVAdp+akqo4zfg@mail.gmail.com
2017-03-24 14:20:59 +00:00
Teodor Sigaev 78874531ba Fix backup canceling
Assert-enabled build crashes but without asserts it works by wrong way:
it may not reset forcing full page write and preventing from starting
exclusive backup with the same name as cancelled.
Patch replaces pair of booleans
nonexclusive_backup_running/exclusive_backup_running to single enum to
correctly describe backup state.

Backpatch to 9.6 where bug was introduced

Reported-by: David Steele
Authors: Michael Paquier, David Steele
Reviewed-by: Anastasia Lubennikova

https://commitfest.postgresql.org/13/1068/
2017-03-24 13:53:40 +03:00
Robert Haas ea42cc18c3 Track the oldest XID that can be safely looked up in CLOG.
This provides infrastructure for looking up arbitrary, user-supplied
XIDs without a risk of scary-looking failures from within the clog
module.  Normally, the oldest XID that can be safely looked up in CLOG
is the same as the oldest XID that can reused without causing
wraparound, and the latter is already tracked.  However, while
truncation is in progress, the values are different, so we must
keep track of them separately.

Craig Ringer, reviewed by Simon Riggs and by me.

Discussion: http://postgr.es/m/CAMsr+YHQiWNEi0daCTboS40T+V5s_+dst3PYv_8v2wNVH+Xx4g@mail.gmail.com
2017-03-23 14:26:31 -04:00
Teodor Sigaev 218f51584d Reduce page locking in GIN vacuum
GIN vacuum during cleaning posting tree can lock this whole tree for a long
time with by holding LockBufferForCleanup() on root. Patch changes it with
two ways: first, cleanup lock will be taken only if there is an empty page
(which should be deleted) and, second, it tries to lock only subtree, not the
whole posting tree.

Author: Andrey Borodin with minor editorization by me
Reviewed-by: Jeff Davis, me

https://commitfest.postgresql.org/13/896/
2017-03-23 19:38:47 +03:00
Simon Riggs 6912acc04f Replication lag tracking for walsenders
Adds write_lag, flush_lag and replay_lag cols to pg_stat_replication.

Implements a lag tracker module that reports the lag times based upon
measurements of the time taken for recent WAL to be written, flushed and
replayed and for the sender to hear about it. These times
represent the commit lag that was (or would have been) introduced by each
synchronous commit level, if the remote server was configured as a
synchronous standby.  For an asynchronous standby, the replay_lag column
approximates the delay before recent transactions became visible to queries.
If the standby server has entirely caught up with the sending server and
there is no more WAL activity, the most recently measured lag times will
continue to be displayed for a short time and then show NULL.

Physical replication lag tracking is automatic. Logical replication tracking
is possible but is the responsibility of the logical decoding plugin.
Tracking is a private module operating within each walsender individually,
with values reported to shared memory. Module not used outside of walsender.

Design and code is good enough now to commit - kudos to the author.
In many ways a difficult topic, with important and subtle behaviour so this
shoudl be expected to generate discussion and multiple open items: Test now!

Author: Thomas Munro, following designs by Fujii Masao and Simon Riggs
Review: Simon Riggs, Ian Barwick and Craig Ringer
2017-03-23 14:05:28 +00:00
Stephen Frost 017e4f2588 Expose waitforarchive option through pg_stop_backup()
Internally, we have supported the option to either wait for all of the
WAL associated with a backup to be archived, or to return immediately.
This option is useful to users of pg_stop_backup() as well, when they
are reading the stop backup record position and checking that the WAL
they need has been archived independently.

This patch adds an additional, optional, argument to pg_stop_backup()
which allows the user to indicate if they wish to wait for the WAL to be
archived or not.  The default matches current behavior, which is to
wait.

Author: David Steele, with some minor changes, doc updates by me.
Reviewed by: Takayuki Tsunakawa, Fujii Masao
Discussion: https://postgr.es/m/758e3fd1-45b4-5e28-75cd-e9e7f93a4c02@pgmasters.net
2017-03-22 23:44:58 -04:00
Simon Riggs af4b1a0869 Refactor GetOldestXmin() to use flags
Replace ignoreVacuum parameter with more flexible flags.

Author: Eiji Seki
Review: Haribabu Kommi
2017-03-22 16:51:01 +00:00
Simon Riggs 9b013dc238 Improve performance of replay of AccessExclusiveLocks
A hot standby replica keeps a list of Access Exclusive locks for a top
level transaction. These locks are released when the top level transaction
ends. Searching of this list is O(N^2), and each transaction had to pay the
price of searching this list for locks, even if it didn't take any AE
locks itself.

This patch optimizes this case by having the master server track which
transactions took AE locks, and passes that along to the standby server in
the commit/abort record. This allows the standby to only try to release
locks for transactions which actually took any, avoiding the majority of
the performance issue.

Refactor MyXactAccessedTempRel into MyXactFlags to allow minimal additional
cruft with this.

Analysis and initial patch by David Rowley
Author: David Rowley and Simon Riggs
2017-03-22 13:09:36 +00:00
Simon Riggs 1148e22a82 Teach xlogreader to follow timeline switches
Uses page-based mechanism to ensure we’re using the correct timeline.

Tests are included to exercise the functionality using a cold disk-level copy
of the master that's started up as a replica with slots intact, but the
intended use of the functionality is with later features.

Craig Ringer, reviewed by Simon Riggs and Andres Freund
2017-03-22 07:05:12 +00:00
Robert Haas 9abbf4727d Another fix for single-page hash index vacuum.
The WAL consistency checking code needed to be updated for the new
page status bit, but that didn't get done previously.

Ashutosh Sharma, reviewed by Amit Kapila

Discussion: http://postgr.es/m/CAA4eK1LP_oz4EfMen14OjJuzN5CqPdfRkFFuA-MfkcfeE8zGyg@mail.gmail.com
2017-03-20 15:55:27 -04:00
Robert Haas 953477ca35 Fixes for single-page hash index vacuum.
Clear LH_PAGE_HAS_DEAD_TUPLES during replay, similar to what gets done
for btree.  Update hashdesc.c for xl_hash_vacuum_one_page.

Oversights in commit 6977b8b7f4 spotted
by Amit Kapila.  Patch by Ashutosh Sharma.

Bump WAL version.  The original patch to make hash indexes write-ahead
logged probably should have done this, and the single page vacuuming
patch probably should have done it again, but better late than never.

Discussion: http://postgr.es/m/CAA4eK1Kd=mJ9xreovcsh0qMiAj-QqCphHVQ_Lfau1DR9oVjASQ@mail.gmail.com
2017-03-20 15:49:09 -04:00
Robert Haas 249cf070e3 Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add and
6f3bd98ebf, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O.  Add wait events for those operations, too.

Rushabh Lathia, with further hacking by me.  Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.

Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 07:43:01 -04:00
Robert Haas 88e66d193f Rename "pg_clog" directory to "pg_xact".
Names containing the letters "log" sometimes confuse users into
believing that only non-critical data is present.  It is hoped
this renaming will discourage ill-considered removals of transaction
status data.

Michael Paquier

Discussion: http://postgr.es/m/CA+Tgmoa9xFQyjRZupbdEFuwUerFTvC6HjZq1ud6GYragGDFFgA@mail.gmail.com
2017-03-17 09:48:38 -04:00
Robert Haas 6977b8b7f4 Port single-page btree vacuum logic to hash indexes.
This is advantageous for hash indexes for the same reasons it's good
for btrees: it accelerates space recycling, reducing bloat.

Ashutosh Sharma, reviewed by Amit Kapila and by me.  A bit of
additional hacking by me.

Discussion: http://postgr.es/m/CAE9k0PkRSyzx8dOnokEpUi2A-RFZK72WN0h9DEMv_ut9q6bPRw@mail.gmail.com
2017-03-15 22:18:56 -04:00
Robert Haas f7b711c8bc Cosmetic fixes for hash index write-ahead logging.
Amit Kapila.  One of these was reported by Tom Lane.

Discussion: http://postgr.es/m/5515.1489514099@sss.pgh.pa.us
2017-03-15 07:22:49 -04:00
Robert Haas bb4a39637a hash: Support WAL consistency checking.
Kuntal Ghosh, reviewed by Amit Kapila and Ashutosh Sharma, with
a few tweaks by me.

Discussion: http://postgr.es/m/CAGz5QCJLERUn_zoO0eDv6_Y_d0o4tNTMPeR7ivTLBg4rUrJdwg@mail.gmail.com
2017-03-14 14:58:56 -04:00
Robert Haas c11453ce0a hash: Add write-ahead logging support.
The warning about hash indexes not being write-ahead logged and their
use being discouraged has been removed.  "snapshot too old" is now
supported for tables with hash indexes.  Most importantly, barring
bugs, hash indexes will now be crash-safe and usable on standbys.

This commit doesn't yet add WAL consistency checking for hash
indexes, as we now have for other index types; a separate patch has
been submitted to cure that lack.

Amit Kapila, reviewed and slightly modified by me.  The larger patch
series of which this is a part has been reviewed and tested by Álvaro
Herrera, Ashutosh Sharma, Mark Kirkwood, Jeff Janes, and Jesper
Pedersen.

Discussion: http://postgr.es/m/CAA4eK1JOBX=YU33631Qh-XivYXtPSALh514+jR8XeD7v+K3r_Q@mail.gmail.com
2017-03-14 13:27:02 -04:00
Peter Eisentraut a47b38c9ee Spelling fixes
From: Josh Soref <jsoref@gmail.com>
2017-03-14 12:58:39 -04:00
Peter Eisentraut f97a028d8e Spelling fixes in code comments
From: Josh Soref <jsoref@gmail.com>
2017-03-14 12:58:39 -04:00
Tom Lane 5ed6fff6b7 Make logging about multixact wraparound protection less chatty.
The original messaging design, introduced in commit 068cfadf9, seems too
chatty now that some time has elapsed since the bug fix; most installations
will be in good shape and don't really need a reminder about this on every
postmaster start.

Hence, arrange to suppress the "wraparound protections are now enabled"
message during startup (specifically, during the TrimMultiXact() call).
The message will still appear if protection becomes effective at some
later point.

Discussion: https://postgr.es/m/17211.1489189214@sss.pgh.pa.us
2017-03-14 12:47:53 -04:00
Peter Eisentraut 1e6de941e3 Change xlog to WAL in some error messages 2017-03-13 15:42:10 -04:00
Noah Misch 3a0d473192 Use wrappers of PG_DETOAST_DATUM_PACKED() more.
This makes almost all core code follow the policy introduced in the
previous commit.  Specific decisions:

- Text search support functions with char* and length arguments, such as
  prsstart and lexize, may receive unaligned strings.  I doubt
  maintainers of non-core text search code will notice.

- Use plain VARDATA() on values detoasted or synthesized earlier in the
  same function.  Use VARDATA_ANY() on varlenas sourced outside the
  function, even if they happen to always have four-byte headers.  As an
  exception, retain the universal practice of using VARDATA() on return
  values of SendFunctionCall().

- Retain PG_GETARG_BYTEA_P() in pageinspect.  (Page images are too large
  for a one-byte header, so this misses no optimization.)  Sites that do
  not call get_page_from_raw() typically need the four-byte alignment.

- For now, do not change btree_gist.  Its use of four-byte headers in
  memory is partly entangled with storage of 4-byte headers inside
  GBT_VARKEY, on disk.

- For now, do not change gtrgm_consistent() or gtrgm_distance().  They
  incorporate the varlena header into a cache, and there are multiple
  credible implementation strategies to consider.
2017-03-12 19:35:34 -04:00
Noah Misch 2fd26b23b6 Assume deconstruct_array() outputs are untoasted.
In functions that issue a deconstruct_array() call, consistently use
plain VARSIZE()/VARDATA() on the array elements.  Prior practice was
divided between those and VARSIZE_ANY_EXHDR()/VARDATA_ANY().
2017-03-12 19:35:31 -04:00
Robert Haas 390811750d Revert "Use group updates when setting transaction status in clog."
This reverts commit ccce90b398.  This
optimization is unsafe, at least, of rollbacks and rollbacks to
savepoints, but I'm concerned there may be other problematic cases as
well.  Therefore, I've decided to revert this pending further
investigation.
2017-03-10 14:49:56 -05:00
Robert Haas ccce90b398 Use group updates when setting transaction status in clog.
Commit 0e141c0fbb introduced a mechanism
to reduce contention on ProcArrayLock by having a single process clear
XIDs in the procArray on behalf of multiple processes, reducing the
need to hand the lock around.  Use a similar mechanism to reduce
contention on CLogControlLock.  Testing shows that this very
significantly reduces the amount of time waiting for CLogControlLock
on high-concurrency pgbench tests run on a large multi-socket
machines; whether that translates into a TPS improvement depends on
how much of that contention is simply shifted to some other lock,
particularly WALWriteLock.

Amit Kapila, with some cosmetic changes by me.  Extensively reviewed,
tested, and benchmarked over a period of about 15 months by Simon
Riggs, Robert Haas, Andres Freund, Jesper Pedersen, and especially by
Tomas Vondra and Dilip Kumar.

Discussion: http://postgr.es/m/CAA4eK1L_snxM_JcrzEstNq9P66++F4kKFce=1r5+D1vzPofdtg@mail.gmail.com
Discussion: http://postgr.es/m/CAA4eK1LyR2A+m=RBSZ6rcPEwJ=rVi1ADPSndXHZdjn56yqO6Vg@mail.gmail.com
Discussion: http://postgr.es/m/91d57161-d3ea-0cc2-6066-80713e4f90d7@2ndquadrant.com
2017-03-09 17:49:01 -05:00
Tom Lane 86dbbf20d8 Put back <float.h> in a few files that need it for _isnan().
Further fallout from commit c29aff959: there are some files that need
<float.h>, and were getting it from datatype/timestamp.h, but it was not
apparent in my (tgl's) testing because the requirement for <float.h>
exists only on certain Windows toolchains.

Report and patch by David Rowley.

Discussion: https://postgr.es/m/CAKJS1f-BHceaFzZScFapDV48gUVM2CAOBfhkgffdqXzFb+kwew@mail.gmail.com
2017-03-08 15:38:34 -05:00
Robert Haas f35742ccb7 Support parallel bitmap heap scans.
The index is scanned by a single process, but then all cooperating
processes can iterate jointly over the resulting set of heap blocks.
In the future, we might also want to support using a parallel bitmap
index scan to set up for a parallel bitmap heap scan, but that's a
job for another day.

Dilip Kumar, with some corrections and cosmetic changes by me.  The
larger patch set of which this is a part has been reviewed and tested
by (at least) Andres Freund, Amit Khandekar, Tushar Ahuja, Rafia
Sabih, Haribabu Kommi, Thomas Munro, and me.

Discussion: http://postgr.es/m/CAFiTN-uc4=0WxRGfCzs-xfkMYcSEWUC-Fon6thkJGjkh9i=13A@mail.gmail.com
2017-03-08 12:05:43 -05:00
Robert Haas 98e6e89040 tidbitmap: Support shared iteration.
When a shared iterator is used, each call to tbm_shared_iterate()
returns a result that has not yet been returned to any process
attached to the shared iterator.  In other words, each cooperating
processes gets a disjoint subset of the full result set, but all
results are returned exactly once.

This is infrastructure for parallel bitmap heap scan.

Dilip Kumar.  The larger patch set of which this is a part has been
reviewed and tested by (at least) Andres Freund, Amit Khandekar,
Tushar Ahuja, Rafia Sabih, Haribabu Kommi, and Thomas Munro.

Discussion: http://postgr.es/m/CAFiTN-uc4=0WxRGfCzs-xfkMYcSEWUC-Fon6thkJGjkh9i=13A@mail.gmail.com
2017-03-08 08:09:38 -05:00
Robert Haas 38305398cd hash: Refactor hash index creation.
The primary goal here is to move all of the related page modifications
to a single section of code, in preparation for adding write-ahead
logging.  In passing, rename _hash_metapinit to _hash_init, since it
initializes more than just the metapage.

Amit Kapila.  The larger patch series of which this is a part has been
reviewed and tested by Álvaro Herrera, Ashutosh Sharma, Mark Kirkwood,
Jeff Janes, and Jesper Pedersen.
2017-03-07 17:03:51 -05:00
Heikki Linnakangas 818fd4a67d Support SCRAM-SHA-256 authentication (RFC 5802 and 7677).
This introduces a new generic SASL authentication method, similar to the
GSS and SSPI methods. The server first tells the client which SASL
authentication mechanism to use, and then the mechanism-specific SASL
messages are exchanged in AuthenticationSASLcontinue and PasswordMessage
messages. Only SCRAM-SHA-256 is supported at the moment, but this allows
adding more SASL mechanisms in the future, without changing the overall
protocol.

Support for channel binding, aka SCRAM-SHA-256-PLUS is left for later.

The SASLPrep algorithm, for pre-processing the password, is not yet
implemented. That could cause trouble, if you use a password with
non-ASCII characters, and a client library that does implement SASLprep.
That will hopefully be added later.

Authorization identities, as specified in the SCRAM-SHA-256 specification,
are ignored. SET SESSION AUTHORIZATION provides more or less the same
functionality, anyway.

If a user doesn't exist, perform a "mock" authentication, by constructing
an authentic-looking challenge on the fly. The challenge is derived from
a new system-wide random value, "mock authentication nonce", which is
created at initdb, and stored in the control file. We go through these
motions, in order to not give away the information on whether the user
exists, to unauthenticated users.

Bumps PG_CONTROL_VERSION, because of the new field in control file.

Patch by Michael Paquier and Heikki Linnakangas, reviewed at different
stages by Robert Haas, Stephen Frost, David Steele, Aleksander Alekseev,
and many others.

Discussion: https://www.postgresql.org/message-id/CAB7nPqRbR3GmFYdedCAhzukfKrgBLTLtMvENOmPrVWREsZkF8g%40mail.gmail.com
Discussion: https://www.postgresql.org/message-id/CAB7nPqSMXU35g%3DW9X74HVeQp0uvgJxvYOuA4A-A3M%2B0wfEBv-w%40mail.gmail.com
Discussion: https://www.postgresql.org/message-id/55192AFE.6080106@iki.fi
2017-03-07 14:25:40 +02:00
Simon Riggs 21d4e2e206 Reduce lock levels for table storage params related to planning
The following parameters are now updateable with ShareUpdateExclusiveLock
effective_io_concurrency
parallel_workers
seq_page_cost
random_page_cost
n_distinct
n_distinct_inherited

Simon Riggs and Fabrízio Mello
2017-03-06 16:04:31 +05:30
Robert Haas 5a73e17317 Improve error reporting for tuple-routing failures.
Currently, the whole row is shown without column names.  Instead,
adopt a style similar to _bt_check_unique() in ExecFindPartition()
and show the failing key: (key1, ...) = (val1, ...).

Amit Langote, per a complaint from Simon Riggs.  Reviewed by me;
I also adjusted the grammar in one of the comments.

Discussion: http://postgr.es/m/9f9dc7ae-14f0-4a25-5485-964d9bfc19bd@lab.ntt.co.jp
2017-03-03 09:09:52 +05:30
Robert Haas 21a3cf4128 hash: Refactor and clean up bucket split code.
As with commit 30df93f698 and commit
b0f18cb77f, the goal here is to move all
of the related page modifications to a single section of code, in
preparation for adding write-ahead logging.

Amit Kapila, with slight changes by me.  The larger patch series of
which this is a part has been reviewed and tested by Álvaro Herrera,
Ashutosh Sharma, Mark Kirkwood, Jeff Janes, and Jesper Pedersen.
2017-03-01 14:43:38 +05:30
Magnus Hagander 016c990834 Fix incorrect variable datatype
Both datatypes map to the same underlying one which is why it still
worked, but we should use the correct type.

Author: Kyotaro HORIGUCHI
2017-02-28 12:20:35 +01:00
Tom Lane 9b88f27cb4 Allow index AMs to return either HeapTuple or IndexTuple format during IOS.
Previously, only IndexTuple format was supported for the output data of
an index-only scan.  This is fine for btree, which is just returning a
verbatim index tuple anyway.  It's not so fine for SP-GiST, which can
return reconstructed data that's much larger than a page.

To fix, extend the index AM API so that index-only scan data can be
returned in either HeapTuple or IndexTuple format.  There's other ways
we could have done it, but this way avoids an API break for index AMs
that aren't concerned with the issue, and it costs little except a couple
more fields in IndexScanDescs.

I changed both GiST and SP-GiST to use the HeapTuple method.  I'm not
very clear on whether GiST can reconstruct data that's too large for an
IndexTuple, but that seems possible, and it's not much of a code change to
fix.

Per a complaint from Vik Fearing.  Reviewed by Jason Li.

Discussion: https://postgr.es/m/49527f79-530d-0bfe-3dad-d183596afa92@2ndquadrant.fr
2017-02-27 17:20:34 -05:00
Robert Haas 30df93f698 hash: Refactor overflow page allocation.
As with commit b0f18cb77f, the goal
here is to move all of the related page modifications to a single
section of code, in preparation for adding write-ahead logging.

Amit Kapila, with slight changes by me.  The larger patch series
of which this is a part has been reviewed and tested by Álvaro
Herrera, Ashutosh Sharma, Mark Kirkwood, Jeff Janes, and Jesper
Pedersen, all of whom should also have been credited in the
previous commit message.
2017-02-27 22:59:55 +05:30
Robert Haas b0f18cb77f hash: Refactor bucket squeeze code.
In preparation for adding write-ahead logging to hash indexes,
refactor _hash_freeovflpage and _hash_squeezebucket so that all
related page modifications happen in a single section of code.  The
previous coding assumed that it would be fine to move tuples one at a
time, and also that the various operations involved in freeing an
overflow page didn't necessarily all need to be done together, all
of which is true if you don't care about write-ahead logging.

Amit Kapila, with slight changes by me.
2017-02-27 22:34:21 +05:30
Tom Lane 9e3755ecb2 Remove useless duplicate inclusions of system header files.
c.h #includes a number of core libc header files, such as <stdio.h>.
There's no point in re-including these after having read postgres.h,
postgres_fe.h, or c.h; so remove code that did so.

While at it, also fix some places that were ignoring our standard pattern
of "include postgres[_fe].h, then system header files, then other Postgres
header files".  While there's not any great magic in doing it that way
rather than system headers last, it's silly to have just a few files
deviating from the general pattern.  (But I didn't attempt to enforce this
globally, only in files I was touching anyway.)

I'd be the first to say that this is mostly compulsive neatnik-ism,
but over time it might save enough compile cycles to be useful.
2017-02-25 16:12:55 -05:00
Tom Lane c29aff959d Consistently declare timestamp variables as TimestampTz.
Twiddle the replication-related code so that its timestamp variables
are declared TimestampTz, rather than the uninformative "int64" that
was previously used for meant-to-be-always-integer timestamps.
This resolves the int64-vs-TimestampTz declaration inconsistencies
introduced by commit 7c030783a, though in the opposite direction to
what was originally suggested.

This required including datatype/timestamp.h in a couple more places
than before.  I decided it would be a good idea to slim down that
header by not having it pull in <float.h> etc, as those headers are
no longer at all relevant to its purpose.  Unsurprisingly, a small number
of .c files turn out to have been depending on those inclusions, so add
them back in the .c files as needed.

Discussion: https://postgr.es/m/26788.1487455319@sss.pgh.pa.us
Discussion: https://postgr.es/m/27694.1487456324@sss.pgh.pa.us
2017-02-23 15:57:08 -05:00
Tom Lane d28aafb6dd Remove pg_control's enableIntTimes field.
We don't need it any more.

pg_controldata continues to report that date/time type storage is
"64-bit integers", but that's now a hard-wired behavior not something
it sees in the data.  This avoids breaking pg_upgrade, and perhaps other
utilities that inspect pg_control this way.  Ditto for pg_resetwal.

I chose to remove the "bigint_timestamps" output column of
pg_control_init(), though, as that function hasn't been around long
and probably doesn't have ossified users.

Discussion: https://postgr.es/m/26788.1487455319@sss.pgh.pa.us
2017-02-23 12:23:12 -05:00
Robert Haas 5262f7a4fc Add optimizer and executor support for parallel index scans.
In combination with 569174f1be, which
taught the btree AM how to perform parallel index scans, this allows
parallel index scan plans on btree indexes.  This infrastructure
should be general enough to support parallel index scans for other
index AMs as well, if someone updates them to support parallel
scans.

Amit Kapila, reviewed and tested by Anastasia Lubennikova, Tushar
Ahuja, and Haribabu Kommi, and me.
2017-02-15 13:53:24 -05:00
Robert Haas 569174f1be btree: Support parallel index scans.
This isn't exposed to the optimizer or the executor yet; we'll add
support for those things in a separate patch.  But this puts the
basic mechanism in place: several processes can attach to a parallel
btree index scan, and each one will get a subset of the tuples that
would have been produced by a non-parallel scan.  Each index page
becomes the responsibility of a single worker, which then returns
all of the TIDs on that page.

Rahila Syed, Amit Kapila, Robert Haas, reviewed and tested by
Anastasia Lubennikova, Tushar Ahuja, and Haribabu Kommi.
2017-02-15 07:41:14 -05:00
Robert Haas 8da9a22636 Split index xlog headers from other private index headers.
The xlog-specific headers need to be included in both frontend code -
specifically, pg_waldump - and the backend, but the remainder of the
private headers for each index are only needed by the backend.  By
splitting the xlog stuff out into separate headers, pg_waldump pulls
in fewer backend headers, which is a good thing.

Patch by me, reviewed by Michael Paquier and Andres Freund, per a
complaint from Dilip Kumar.

Discussion: http://postgr.es/m/CA+TgmoZ=F=GkxV0YEv-A8tb+AEGy_Qa7GSiJ8deBKFATnzfEug@mail.gmail.com
2017-02-14 15:37:59 -05:00
Robert Haas fb47544d0c Minor fixes for WAL consistency checking.
Michael Paquier, reviewed and slightly revised by me.

Discussion: http://postgr.es/m/CAB7nPqRzCQb=vdfHvMtP0HMLBHU6z1aGdo4GJsUP-HP8jx+Pkw@mail.gmail.com
2017-02-14 12:41:01 -05:00
Robert Haas 3f01fd4ca0 Rename dtrace probes for ongoing xlog -> wal conversion.
xlog-switch becomes wal-switch, and xlog-insert becomes wal-insert.
2017-02-09 16:40:19 -05:00
Robert Haas 85c11324ca Rename user-facing tools with "xlog" in the name to say "wal".
This means pg_receivexlog because pg_receivewal, pg_resetxlog
becomes pg_resetwal, and pg_xlogdump becomes pg_waldump.
2017-02-09 16:23:46 -05:00
Robert Haas 806091c96f Remove all references to "xlog" from SQL-callable functions in pg_proc.
Commit f82ec32ac3 renamed the pg_xlog
directory to pg_wal.  To make things consistent, and because "xlog" is
terrible terminology for either "transaction log" or "write-ahead log"
rename all SQL-callable functions that contain "xlog" in the name to
instead contain "wal".  (Note that this may pose an upgrade hazard for
some users.)

Similarly, rename the xlog_position argument of the functions that
create slots to be called wal_position.

Discussion: https://www.postgresql.org/message-id/CA+Tgmob=YmA=H3DbW1YuOXnFVgBheRmyDkWcD9M8f=5bGWYEoQ@mail.gmail.com
2017-02-09 15:10:09 -05:00
Robert Haas fc8219dc54 pageinspect: Fix hash_bitmap_info not to read the underlying page.
It did that to verify that the page was an overflow page rather than
anything else, but that means that checking the status of all the
overflow bits requires reading the entire index.  So don't do that.
The new code validates that the page is not a primary bucket page
or bitmap page by looking at the metapage, so that using this on
large numbers of pages can be reasonably efficient.

Ashutosh Sharma, per a complaint from me, and with further
modifications by me.
2017-02-09 14:34:34 -05:00
Tom Lane 86d911ec0f Allow index AMs to cache data across aminsert calls within a SQL command.
It's always been possible for index AMs to cache data across successive
amgettuple calls within a single SQL command: the IndexScanDesc.opaque
field is meant for precisely that.  However, no comparable facility
exists for amortizing setup work across successive aminsert calls.
This patch adds such a feature and teaches GIN, GIST, and BRIN to use it
to amortize catalog lookups they'd previously been doing on every call.
(The other standard index AMs keep everything they need in the relcache,
so there's little to improve there.)

For GIN, the overall improvement in a statement that inserts many rows
can be as much as 10%, though it seems a bit less for the other two.
In addition, this makes a really significant difference in runtime
for CLOBBER_CACHE_ALWAYS tests, since in those builds the repeated
catalog lookups are vastly more expensive.

The reason this has been hard up to now is that the aminsert function is
not passed any useful place to cache per-statement data.  What I chose to
do is to add suitable fields to struct IndexInfo and pass that to aminsert.
That's not widening the index AM API very much because IndexInfo is already
within the ken of ambuild; in fact, by passing the same info to aminsert
as to ambuild, this is really removing an inconsistency in the AM API.

Discussion: https://postgr.es/m/27568.1486508680@sss.pgh.pa.us
2017-02-09 11:52:12 -05:00
Robert Haas a507b86900 Add WAL consistency checking facility.
When the new GUC wal_consistency_checking is set to a non-empty value,
it triggers recording of additional full-page images, which are
compared on the standby against the results of applying the WAL record
(without regard to those full-page images).  Allowable differences
such as hints are masked out, and the resulting pages are compared;
any difference results in a FATAL error on the standby.

Kuntal Ghosh, based on earlier patches by Michael Paquier and Heikki
Linnakangas.  Extensively reviewed and revised by Michael Paquier and
by me, with additional reviews and comments from Amit Kapila, Álvaro
Herrera, Simon Riggs, and Peter Eisentraut.
2017-02-08 15:45:30 -05:00
Robert Haas 94708c0e8c Fix compiler warning.
Mithun Cy, per a report by Erik Rijkers
2017-02-07 15:09:14 -05:00
Robert Haas 293e24e507 Cache hash index's metapage in rel->rd_amcache.
This avoids a very significant amount of buffer manager traffic and
contention when scanning hash indexes, because it's no longer
necessary to lock and pin the metapage for every scan.  We do need
some way of figuring out when the cache is too stale to use any more,
so that when we lock the primary bucket page to which the cached
metapage points us, we can tell whether a split has occurred since we
cached the metapage data.  To do that, we use the hash_prevblkno field
in the primary bucket page, which would otherwise always be set to
InvalidBuffer.

This patch contains code so that it will continue working (although
less efficiently) with hash indexes built before this change, but
perhaps we should consider bumping the hash version and ripping out
the compatibility code.  That decision can be made later, though.

Mithun Cy, reviewed by Jesper Pedersen, Amit Kapila, and by me.
Before committing, I made a number of cosmetic changes to the last
posted version of the patch, adjusted _hash_getcachedmetap to be more
careful about order of operation, and made some necessary updates to
the pageinspect documentation and regression tests.
2017-02-07 12:35:45 -05:00
Heikki Linnakangas 181bdb90ba Fix typos in comments.
Backpatch to all supported versions, where applicable, to make backpatching
of future fixes go more smoothly.

Josh Soref

Discussion: https://www.postgresql.org/message-id/CACZqfqCf+5qRztLPgmmosr-B0Ye4srWzzw_mo4c_8_B_mtjmJQ@mail.gmail.com
2017-02-06 11:33:58 +02:00
Robert Haas 38c363adf4 Improve grammar of message about two-phase state files.
When there's only one two-phase state file, there's also only one
long-running prepared transaction.  Adjust the message text
accordingly.

Nikhil Sontakke

Discussion: http://postgr.es/m/CAMGcDxcmR_DWZXXndGoPzVQx=B17A5=RviEA1qNaF=FWLy5Whw@mail.gmail.com
2017-02-03 17:16:54 -05:00
Robert Haas 08bf6e5295 pageinspect: Support hash indexes.
Patch by Jesper Pedersen and Ashutosh Sharma, with some error handling
improvements by me.  Tests from Peter Eisentraut.  Reviewed by Álvaro
Herrera, Michael Paquier, Jesper Pedersen, Jeff Janes, Peter
Eisentraut, Amit Kapila, Mithun Cy, and me.

Discussion: http://postgr.es/m/e2ac6c58-b93f-9dd9-f4e6-d6d30add7fdf@redhat.com
2017-02-02 14:19:32 -05:00
Andrew Dunstan f1169ab501 Don't count background workers against a user's connection limit.
Doing so doesn't seem to be within the purpose of the per user
connection limits, and has particularly unfortunate effects in
conjunction with parallel queries.

Backpatch to 9.6 where parallel queries were introduced.

David Rowley, reviewed by Robert Haas and Albe Laurenz.
2017-02-01 18:02:43 -05:00
Robert Haas bbd8550bce Refactor other replication commands to use DestRemoteSimple.
Commit a84069d935 added a new type of
DestReceiver to avoid duplicating the existing code for the SHOW
command, but it turns out we can leverage that new DestReceiver
type in a few more places, saving some code.

Michael Paquier, reviewed by Andres Freund and by me.

Discussion: http://postgr.es/m/CAB7nPqSdFOQC0evc0r1nJeQyGBqjBrR41MC4rcMqUUpoJaZbtQ%40mail.gmail.com
Discussion: http://postgr.es/m/CAB7nPqT2K4XFT1JgqufFBjsOc-NUKXg5qBDucHPMbk6Xi1kYaA@mail.gmail.com
2017-02-01 13:42:41 -05:00
Robert Haas 8a815e3fc3 Move comment about test slightly closer to test.
The addition of a TestForOldSnapshot() call here has made the
referent of this comment slightly less clear, so move the comment
to compensate.

Amit Kapila (as part of the parallel index scan patch)
2017-01-31 17:21:02 -05:00
Robert Haas 3838074f86 Be more aggressive in avoiding tuple conversion.
According to the comments in tupconvert.c, it's necessary to perform
tuple conversion when either table has OIDs, and this was previously
checked by ensuring that the tdtypeid value matched between the tables
in question.  However, that's overly stringent: we have access to
tdhasoid and can test directly whether OIDs are present, which lets us
avoid conversion in cases where the type OIDs are different but the
tuple descriptors are entirely the same (and neither has OIDs).  This
is useful to the partitioning code, which can thereby avoid converting
tuples when inserting into a partition whose columns appear in the
same order as the parent columns, the normal case.  It's possible
for the tuple routing code to avoid some additional overhead in this
case as well, so do that, too.

It's not clear whether it would be OK to skip this when both tables
have OIDs: do callers count on this to build a new tuple (losing the
previous OID) in such instances?  Until we figure it out, leave the
behavior in that case alone.

Amit Langote, reviewed by me.
2017-01-24 21:53:38 -05:00
Robert Haas d1ecd53947 Add a SHOW command to the replication command language.
This is useful infrastructure for an upcoming proposed patch to
allow the WAL segment size to be changed at initdb time; tools like
pg_basebackup need the ability to interrogate the server setting.
But it also doesn't seem like a bad thing to have independently of
that; it may find other uses in the future.

Robert Haas and Beena Emerson.  (The original patch here was by
Beena, but I rewrote it to such a degree that most of the code
being committed here is mine.)

Discussion: http://postgr.es/m/CA+TgmobNo4qz06wHEmy9DszAre3dYx-WNhHSCbU9SAwf+9Ft6g@mail.gmail.com
2017-01-24 17:04:12 -05:00
Robert Haas a84069d935 Add a new DestReceiver for printing tuples without catalog access.
If you create a DestReciver of type DestRemote and try to use it from
a replication connection that is not bound to a specific daabase, or
any other hypothetical type of backend that is not bound to a specific
database, it will fail because it doesn't have a pg_proc catalog to
look up properties of the types being printed.  In general, that's
an unavoidable problem, but we can hardwire the properties of a few
builtin types in order to support utility commands.  This new
DestReceiver of type DestRemoteSimple does just that.

Patch by me, reviewed by Michael Paquier.

Discussion: http://postgr.es/m/CA+TgmobNo4qz06wHEmy9DszAre3dYx-WNhHSCbU9SAwf+9Ft6g@mail.gmail.com
2017-01-24 16:53:56 -05:00
Robert Haas 7b4ac19982 Extend index AM API for parallel index scans.
This patch doesn't actually make any index AM parallel-aware, but it
provides the necessary functions at the AM layer to do so.

Rahila Syed, Amit Kapila, Robert Haas
2017-01-24 16:42:58 -05:00
Robert Haas b1ecb9b3fc Fix interaction of partitioned tables with BulkInsertState.
When copying into a partitioned table, the target heap may change from
one tuple to next.  We must ask ReadBufferBI() to get a new buffer
every time such change occurs.  To do that, use new function
ReleaseBulkInsertStatePin().  This fixes the bug that tuples ended up
being inserted into the wrong partition, which occurred exactly
because the wrong buffer was used.

Amit Langote, per a suggestion from Robert Haas.  Some cosmetic
adjustments by me.

Reports by 高增琦 (Gao Zengqi), Venkata B Nagothi, and
Ragnar Ouchterlony.

Discussion: http://postgr.es/m/CAFmBtr32FDOqofo8yG-4mjzL1HnYHxXK5S9OGFJ%3D%3DcJpgEW4vA%40mail.gmail.com
Discussion: http://postgr.es/m/CAEyp7J9WiX0L3DoiNcRrY-9iyw%3DqP%2Bj%3DDLsAnNFF1xT2J1ggfQ%40mail.gmail.com
Discussion: http://postgr.es/m/16d73804-c9cd-14c5-463e-5caad563ff77%40agama.tv
Discussion: http://postgr.es/m/CA+TgmoaiZpDVUUN8LZ4jv1qFE_QyR+H9ec+79f5vNczYarg5Zg@mail.gmail.com
2017-01-24 08:50:16 -05:00
Peter Eisentraut f21a563d25 Move some things from builtins.h to new header files
This avoids that builtins.h has to include additional header files.
2017-01-20 20:29:53 -05:00
Peter Eisentraut 665d1fad99 Logical replication
- Add PUBLICATION catalogs and DDL
- Add SUBSCRIPTION catalog and DDL
- Define logical replication protocol and output plugin
- Add logical replication workers

From: Petr Jelinek <petr@2ndquadrant.com>
Reviewed-by: Steve Singer <steve@ssinger.info>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Erik Rijkers <er@xs4all.nl>
Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
2017-01-20 09:04:49 -05:00
Alvaro Herrera 8eace46d34 Fix race condition in reading commit timestamps
If a user requests the commit timestamp for a transaction old enough
that its data is concurrently being truncated away by vacuum at just the
right time, they would receive an ugly internal file-not-found error
message from slru.c rather than the expected NULL return value.

In a primary server, the window for the race is very small: the lookup
has to occur exactly between the two calls by vacuum, and there's not a
lot that happens between them (mostly just a multixact truncate).  In a
standby server, however, the window is larger because the truncation is
executed as soon as the WAL record for it is replayed, but the advance
of the oldest-Xid is not executed until the next checkpoint record.

To fix in the primary, simply reverse the order of operations in
vac_truncate_clog.  To fix in the standby, augment the WAL truncation
record so that the standby is aware of the new oldest-XID value and can
apply the update immediately.  WAL version bumped because of this.

No backpatch, because of the low importance of the bug and its rarity.

Author: Craig Ringer
Reviewed-By: Petr Jelínek, Peter Eisentraut
Discussion: https://postgr.es/m/CAMsr+YFhVtRQT1VAwC+WGbbxZZRzNou=N9Ed-FrCqkwQ8H8oJQ@mail.gmail.com
2017-01-19 18:24:17 -03:00
Robert Haas e37360d5df Improve comment in hashsearch.c.
Typo fix from Mithun Cy; other improvements by me.
2017-01-18 16:36:48 -05:00
Peter Eisentraut 352a24a1f9 Generate fmgr prototypes automatically
Gen_fmgrtab.pl creates a new file fmgrprotos.h, which contains
prototypes for all functions registered in pg_proc.h.  This avoids
having to manually maintain these prototypes across a random variety of
header files.  It also automatically enforces a correct function
signature, and since there are warnings about missing prototypes, it
will detect functions that are defined but not registered in
pg_proc.h (or otherwise used).

Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
2017-01-17 14:06:07 -05:00
Fujii Masao 974ece58bb Fix an assertion failure related to an exclusive backup.
Previously multiple sessions could execute pg_start_backup() and
pg_stop_backup() to start and stop an exclusive backup at the same time.
This could trigger the assertion failure of
"FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)".
This happend because, even while pg_start_backup() was starting
an exclusive backup, other session could run pg_stop_backup()
concurrently and mark the backup as not-in-progress unconditionally.

This patch introduces ExclusiveBackupState indicating the state of
an exclusive backup. This state is used to ensure that there is only
one session running pg_start_backup() or pg_stop_backup() at
the same time, to avoid the assertion failure.

Back-patch to all supported versions.

Author: Michael Paquier
Reviewed-By: Kyotaro Horiguchi and me
Reported-By: Andreas Seltenreich
Discussion: <87mvktojme.fsf@credativ.de>
2017-01-17 17:27:32 +09:00
Robert Haas e898437460 Improve coding in _hash_addovflpage.
Instead of relying on the page contents to know whether we have
advanced from the primary bucket page to an overflow page, track
that explicitly.

Amit Kapila, per a complaint by me.
2017-01-10 08:31:03 -05:00
Alvaro Herrera 3957b58b88 Fix ALTER TABLE / SET TYPE for irregular inheritance
If inherited tables don't have exactly the same schema, the USING clause
in an ALTER TABLE / SET DATA TYPE misbehaves when applied to the
children tables since commit 9550e8348b.  Starting with that commit,
the attribute numbers in the USING expression are fixed during parse
analysis.  This can lead to bogus errors being reported during
execution, such as:
   ERROR:  attribute 2 has wrong type
   DETAIL:  Table has type smallint, but query expects integer.

Since it wouldn't do to revert to the original coding, we now apply a
transformation to map the attribute numbers to the correct ones for each
child.

Reported by Justin Pryzby
Analysis by Tom Lane; patch by me.
Discussion: https://postgr.es/m/20170102225618.GA10071@telsasoft.com
2017-01-09 19:26:58 -03:00
Alvaro Herrera 7403561c0f BRIN revmap pages are not standard pages ...
... and therefore we ought not to tell XLogRegisterBuffer the opposite,
when writing XLog for a brin update that moves the index tuple to a
different page.  Otherwise, xlog insertion would try to "compress the
hole" when producing a full-page image for it; but since we don't update
pd_lower/upper, the hole covers the whole page.  On WAL replay, the
revmap page becomes empty and so the entire portion of the index is
useless and needs to be recomputed.

This is low-probability: a BRIN update only moves an index tuple to a
different page when the summary tuple is larger than the existing one,
which doesn't happen with fixed-width datatypes.  Also, the revmap
page must be first after a checkpoint.

Report and patch: Kuntal Ghosh
Bug is alleged to have detected by a WAL-consistency-checking tool.
Discussion: https://postgr.es/m/CAGz5QCJ=00UQjScSEFbV=0qO5ShTZB9WWz_Fm7+Wd83zPs9Geg@mail.gmail.com

I posted a test case demonstrating the problem, but I'm refraining from
adding it to the test suite; if the WAL consistency tool makes it in,
that will be a better way to catch this from regressing.  (We should
definitely have someting that causes not-same-page updates, though.)
2017-01-09 18:19:29 -03:00
Bruce Momjian 1d25779284 Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
Tom Lane 54386f3578 Remove triggerable Assert in hashname().
hashname() asserted that the key string it is given is shorter than
NAMEDATALEN.  That should surely always be true if the input is in fact a
regular value of type "name".  However, for reasons of coding convenience,
we allow plain old C strings to be treated as "name" values in many places.
Some SQL functions accept arbitrary "text" inputs, convert them to C
strings, and pass them otherwise-untransformed to syscache lookups for name
columns, allowing an overlength input value to trigger hashname's Assert.

This would be a DOS problem, except that it only happens in assert-enabled
builds which aren't recommended for production.  In a production build,
you'll just get a name lookup error, since regardless of the hash value
computed by hashname, the later equality comparison checks can't match.
Likewise, if the catalog lookup is done by seqscan or indexscan searches,
there will just be a lookup error, since the name comparison functions
don't contain any similar length checks, and will see an overlength input
as unequal to any stored entry.

After discussion we concluded that we should simply remove this Assert.
It's inessential to hashname's own functionality, and having such an
assertion in only some paths for name lookup is more of a foot-gun than
a useful check.  There may or may not be a case for the affected callers
to do something other than let the name lookup fail, but we'll consider
that separately; in any case we probably don't want to change such
behavior in the back branches.

Per report from Tushar Ahuja.  Back-patch to all supported branches.

Report: https://postgr.es/m/7d0809ee-6f25-c9d6-8e74-5b2967830d49@enterprisedb.com
Discussion: https://postgr.es/m/17691.1482523168@sss.pgh.pa.us
2016-12-26 14:58:02 -05:00
Robert Haas 7819ba1ef6 Remove _hash_chgbufaccess().
This is basically for the same reasons I got rid of _hash_wrtbuf()
in commit 25216c98938495fd741bf585dcbef45b3a9ffd40: it's not
convenient to have a function which encapsulates MarkBufferDirty(),
especially as we move towards having hash indexes be WAL-logged.

Patch by me, reviewed (but not entirely endorsed) by Amit Kapila.
2016-12-23 07:14:37 -05:00
Andres Freund 6ef2eba3f5 Skip checkpoints, archiving on idle systems.
Some background activity (like checkpoints, archive timeout, standby
snapshots) is not supposed to happen on an idle system. Unfortunately
so far it was not easy to determine when a system is idle, which
defeated some of the attempts to avoid redundant activity on an idle
system.

To make that easier, allow to make individual WAL insertions as not
being "important". By checking whether any important activity happened
since the last time an activity was performed, it now is easy to check
whether some action needs to be repeated.

Use the new facility for checkpoints, archive timeout and standby
snapshots.

The lack of a facility causes some issues in older releases, but in my
opinion the consequences (superflous checkpoints / archived segments)
aren't grave enough to warrant backpatching.

Author: Michael Paquier, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Amit Kapila, Kyotaro HORIGUCHI
Bug: #13685
Discussion:
    https://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org
    https://www.postgresql.org/message-id/CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com
Backpatch: -
2016-12-22 11:31:50 -08:00
Robert Haas 097e41439d Fix broken error check in _hash_doinsert.
You can't just cast a HashMetaPage to a Page, because the meta page
data is stored after the page header, not at offset 0.  Fortunately,
this didn't break anything because it happens to find hashm_bsize
at the offset at which it expects to find pd_pagesize_version, and
the values are close enough to the same that this works out.

Still, it's a bug, so back-patch to all supported versions.

Mithun Cy, revised a bit by me.
2016-12-22 13:59:01 -05:00
Robert Haas dd728826c5 Fix locking problem in _hash_squeezebucket() / _hash_freeovflpage().
A bucket squeeze operation needs to lock each page of the bucket
before releasing the prior page, but the previous coding fumbled the
locking when freeing an overflow page during a bucket squeeze
operation.  Commit 6d46f4783e
introduced this bug.

Amit Kapila, with help from Kuntal Ghosh and Dilip Kumar, after
an initial trouble report by Jeff Janes.  Reviewed by me.  I also
fixed a problem with a comment.
2016-12-19 12:31:50 -05:00
Robert Haas 3761fe3c20 Simplify LWLock tranche machinery by removing array_base/array_stride.
array_base and array_stride were added so that we could identify the
offset of an LWLock within a tranche, but this facility is only very
marginally used apart from the main tranche.  So, give every lock in
the main tranche its own tranche ID and get rid of array_base,
array_stride, and all that's attached.  For debugging facilities
(Trace_lwlocks and LWLOCK_STATS) print the pointer address of the
LWLock using %p instead of the offset.  This is arguably more useful,
and certainly a lot cheaper.  Drop the offset-within-tranche from
the information reported to dtrace and from one can't-happen message
inside lwlock.c.

The main user-visible impact of this change is that pg_stat_activity
will now report all waits for LWLocks as "LWLock" rather than
reporting some as "LWLockTranche" and others as "LWLockNamed".

The main motivation for this change is that the need to specify an
array_base and an array_stride is awkward for parallel query.  There
is only a very limited supply of tranche IDs so we can't just keep
allocating new ones, and if we try to use the same tranche IDs every
time then we run into trouble when multiple parallel contexts are
use simultaneously.  So if we didn't get rid of this mechanism we'd
have to make it even more complicated.  By simplifying it in this
way, we instead reduce the size of the generated code for lwlock.c
by about 5%.

Discussion: http://postgr.es/m/CA+TgmoYsFn6NUW1x0AZtupJGUAs1UDY4dJtCN47_Q6D0sP80PA@mail.gmail.com
2016-12-16 11:29:23 -05:00
Robert Haas 6a4fe1127c Fix more hash index bugs around marking buffers dirty.
In _hash_freeovflpage(), if we're freeing the overflow page that
immediate follows the page to which tuples are being moved (the
confusingly-named "write buffer"), don't forget to mark that
page dirty after updating its hasho_nextblkno.

In _hash_squeezebucket(), it's not necessary to mark the primary
bucket page dirty if there are no overflow pages, because there's
nothing to squeeze in that case.

Amit Kapila, with help from Kuntal Ghosh and Dilip Kumar, after
an initial trouble report by Jeff Janes.
2016-12-16 09:55:20 -05:00
Robert Haas 25216c9893 Remove _hash_wrtbuf() in favor of calling MarkBufferDirty().
The whole concept of _hash_wrtbuf() is that we need to know at the
time we're releasing the buffer lock (and pin) whether we dirtied the
buffer, but this is easy to get wrong.  This patch actually fixes one
non-obvious bug of that form: hashbucketcleanup forgot to signal
_hash_squeezebucket, which gets the primary bucket page already
locked, as to whether it had already dirtied the page.  Calling
MarkBufferDirty() at the places where we dirty the buffer is more
intuitive and lets us simplify the code in various places as well.

On top of all that, the ultimate goal here is to make hash indexes
WAL-logged, and as the comments to _hash_wrtbuf() note, it should
go away when that happens.  Making it go away a little earlier than
that seems like a good preparatory step.

Report by Jeff Janes.  Diagnosis by Amit Kapila, Kuntal Ghosh,
and Dilip Kumar.  Patch by me, after studying an alternative patch
submitted by Amit Kapila.

Discussion: http://postgr.es/m/CAA4eK1Kf6tOY0oVz_SEdngiNFkeXrA3xUSDPPORQvsWVPdKqnA@mail.gmail.com
2016-12-16 09:37:28 -05:00
Robert Haas 501c7b94bc Fix bug in hashbulkdelete.
Commit 6d46f4783e failed to account for
the possibility that hashbulkdelete() might encounter a bucket that
has been split since it began scanning the bucket array.  Repair.

Extracted from a larger pathc by Amit Kapila; I rewrote the comment.
2016-12-13 12:16:02 -05:00
Robert Haas 3856cf9607 Remove should_free arguments to tuplesort routines.
Since commit e94568ecc1, the answer is
always "false", and we do not need to complicate the API by arranging
to return a constant value.

Peter Geoghegan

Discussion: http://postgr.es/m/CAM3SWZQWZZ_N=DmmL7tKy_OUjGH_5mN=N=A6h7kHyyDvEhg2DA@mail.gmail.com
2016-12-12 15:57:35 -05:00
Robert Haas fa0f466d53 Log the creation of an init fork unconditionally.
Previously, it was thought that this only needed to be done for the
benefit of possible standbys, so wal_level = minimal skipped it.
But that's not safe, because during crash recovery we might replay
XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record which recursively
removes the directory that contains the new init fork.  So log it
always.

The user-visible effect of this bug is that if you create a database
or tablespace, then create an unlogged table, then crash without
checkpointing, then restart, accessing the table will fail, because
the it won't have been properly reset.  This commit fixes that.

Michael Paquier, per a report from Konstantin Knizhnik.  Wording of
the comments per a suggestion from me.
2016-12-08 14:12:08 -05:00
Robert Haas f0e44751d7 Implement table partitioning.
Table partitioning is like table inheritance and reuses much of the
existing infrastructure, but there are some important differences.
The parent is called a partitioned table and is always empty; it may
not have indexes or non-inherited constraints, since those make no
sense for a relation with no data of its own.  The children are called
partitions and contain all of the actual data.  Each partition has an
implicit partitioning constraint.  Multiple inheritance is not
allowed, and partitioning and inheritance can't be mixed.  Partitions
can't have extra columns and may not allow nulls unless the parent
does.  Tuples inserted into the parent are automatically routed to the
correct partition, so tuple-routing ON INSERT triggers are not needed.
Tuple routing isn't yet supported for partitions which are foreign
tables, and it doesn't handle updates that cross partition boundaries.

Currently, tables can be range-partitioned or list-partitioned.  List
partitioning is limited to a single column, but range partitioning can
involve multiple columns.  A partitioning "column" can be an
expression.

Because table partitioning is less general than table inheritance, it
is hoped that it will be easier to reason about properties of
partitions, and therefore that this will serve as a better foundation
for a variety of possible optimizations, including query planner
optimizations.  The tuple routing based which this patch does based on
the implicit partitioning constraints is an example of this, but it
seems likely that many other useful optimizations are also possible.

Amit Langote, reviewed and tested by Robert Haas, Ashutosh Bapat,
Amit Kapila, Rajkumar Raghuwanshi, Corey Huinker, Jaime Casanova,
Rushabh Lathia, Erik Rijkers, among others.  Minor revisions by me.
2016-12-07 13:17:55 -05:00
Robert Haas 2f4193c350 Fix race introduced by 6d46f4783e.
It's possible for the metapage contents to change after we release
the lock, so we must read them before releasing the lock.

Amit Kapila.  Submitted in response to a trouble report from
Andreas Seltenreich, though it is not certain this fixes the
problem.
2016-12-05 11:43:37 -05:00
Fujii Masao 5dc851afde Fix incorrect output from gin_desc().
Previously gin_desc() displayed incorrect output "unknown action 0"
for XLOG_GIN_INSERT and XLOG_GIN_VACUUM_DATA_LEAF_PAGE records with
valid actions. The cause of this problem was that gin_desc() wrongly
used XLogRecGetData() to extract data from those records.
Since they were registered by XLogRegisterBufData(), gin_desc() should
have used XLogRecGetBlockData(), instead, like gin_redo().
Also there were other differences about how to treat XLOG_GIN_INSERT
record between gin_desc() and gin_redo().

This commit fixes gin_desc() routine so that it treats those records
in the same way as gin_redo().

Batch-patch to 9.5 where WAL record format was revamped and
XLogRegisterBufData() was added.

Reported-By: Andres Freund
Reviewed-By: Tom Lane
Discussion: <20160509194645.7lewnpw647zegx2m@alap3.anarazel.de>
2016-12-05 20:29:41 +09:00
Robert Haas b460f5d669 Add max_parallel_workers GUC.
Increase the default value of the existing max_worker_processes GUC
from 8 to 16, and add a new max_parallel_workers GUC with a maximum
of 8.  This way, even if the maximum amount of parallel query is
happening, there is still room for background workers that do other
things, as originally envisioned when max_worker_processes was added.

Julien Rouhaud, reviewed by Amit Kapila and by revised by me.
2016-12-02 07:42:58 -05:00
Alvaro Herrera fa2fa99552 Permit dump/reload of not-too-large >1GB tuples
Our documentation states that our maximum field size is 1 GB, and that
our maximum row size of 1.6 TB.  However, while this might be attainable
in theory with enough contortions, it is not workable in practice; for
starters, pg_dump fails to dump tables containing rows larger than 1 GB,
even if individual columns are well below the limit; and even if one
does manage to manufacture a dump file containing a row that large, the
server refuses to load it anyway.

This commit enables dumping and reloading of such tuples, provided two
conditions are met:

1. no single column is larger than 1 GB (in output size -- for bytea
   this includes the formatting overhead)
2. the whole row is not larger than 2 GB

There are three related changes to enable this:

a. StringInfo's API now has two additional functions that allow creating
a string that grows beyond the typical 1GB limit (and "long" string).
ABI compatibility is maintained.  We still limit these strings to 2 GB,
though, for reasons explained below.

b. COPY now uses long StringInfos, so that pg_dump doesn't choke
trying to emit rows longer than 1GB.

c. heap_form_tuple now uses the MCXT_ALLOW_HUGE flag in its allocation
for the input tuple, which means that large tuples are accepted on
input.  Note that at this point we do not apply any further limit to the
input tuple size.

The main reason to limit to 2 GB is that the FE/BE protocol uses 32 bit
length words to describe each row; and because the documentation is
ambiguous on its signedness and libpq does consider it signed, we cannot
use the highest-order bit.  Additionally, the StringInfo API uses "int"
(which is 4 bytes wide in most platforms) in many places, so we'd need
to change that API too in order to improve, which has lots of fallout.

Backpatch to 9.5, which is the oldest that has
MemoryContextAllocExtended, a necessary piece of infrastructure.  We
could apply to 9.4 with very minimal additional effort, but any further
than that would require backpatching "huge" allocations too.

This is the largest set of changes we could find that can be
back-patched without breaking compatibility with existing systems.
Fixing a bigger set of problems (for example, dumping tuples bigger than
2GB, or dumping fields bigger than 1GB) would require changing the FE/BE
protocol and/or changing the StringInfo API in an ABI-incompatible way,
neither of which would be back-patchable.

Authors: Daniel Vérité, Álvaro Herrera
Reviewed by: Tomas Vondra
Discussion: https://postgr.es/m/20160229183023.GA286012@alvherre.pgsql
2016-12-02 00:34:01 -03:00
Robert Haas 6d46f4783e Improve hash index bucket split behavior.
Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency.  Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.

In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel.  There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.

This patch also removes the unworldly assumption that a split will
never be interrupted.  With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion.  While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.

Amit Kapila.  I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's.  Also
reviewed by Jesper Pedersen, Jeff Janes, and others.

Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
2016-11-30 15:39:21 -05:00
Tom Lane dbdfd114f3 Bring some clarity to the defaults for the xxx_flush_after parameters.
Instead of confusingly stating platform-dependent defaults for these
parameters in the comments in postgresql.conf.sample (with the main
entry being a lie on Linux), teach initdb to install the correct
platform-dependent value in postgresql.conf, similarly to the way
we handle other platform-dependent defaults.  This won't do anything
for existing 9.6 installations, but since it's effectively only a
documentation improvement, that seems OK.

Since this requires initdb to have access to the default values,
move the #define's for those to pg_config_manual.h; the original
placement in bufmgr.h is unworkable because that file can't be
included by frontend programs.

Adjust the default value for wal_writer_flush_after so that it is 1MB
regardless of XLOG_BLCKSZ, conforming to what is stated in both the
SGML docs and postgresql.conf.  (We could alternatively make it scale
with XLOG_BLCKSZ, but I'm not sure I see the point.)

Copy-edit related SGML documentation.

Fabien Coelho and Tom Lane, per a gripe from Tomas Vondra.

Discussion: <30ebc6e3-8358-09cf-44a8-578252938424@2ndquadrant.com>
2016-11-25 18:36:10 -05:00
Alvaro Herrera 4aaddf2f00 Fix commit_ts for FrozenXid and BootstrapXid
Previously, requesting commit timestamp for transactions
FrozenTransactionId and BootstrapTransactionId resulted in an error.
But since those values can validly appear in committed tuples' Xmin,
this behavior is unhelpful and error prone: each caller would have to
special-case those values before requesting timestamp data for an Xid.
We already have a perfectly good interface for returning "the Xid you
requested is too old for us to have commit TS data for it", so let's use
that instead.

Backpatch to 9.5, where commit timestamps appeared.

Author: Craig Ringer
Discussion: https://www.postgresql.org/message-id/CAMsr+YFM5Q=+ry3mKvWEqRTxrB0iU3qUSRnS28nz6FJYtBwhJg@mail.gmail.com
2016-11-24 15:39:55 -03:00
Robert Haas e343dfa42b Remove barrier.h
A new thing also called a "barrier" is proposed, but whether we decide
to take that patch or not, this file seems to have outlived its
usefulness.

Thomas Munro
2016-11-22 20:28:24 -05:00
Robert Haas e8ac886c24 Support condition variables.
Condition variables provide a flexible way to sleep until a
cooperating process causes an arbitrary condition to become true.  In
simple cases, this can be accomplished with a WaitLatch/ResetLatch
loop; the cooperating process can call SetLatch after performing work
that might cause the condition to be satisfied, and the waiting
process can recheck the condition each time.  However, if the process
performing the work doesn't have an easy way to identify which
processes might be waiting, this doesn't work, because it can't
identify which latches to set.  Condition variables solve that problem
by internally maintaining a list of waiters; a process that may have
caused some waiter's condition to be satisfied must "signal" or
"broadcast" on the condition variable.

Robert Haas and Thomas Munro
2016-11-22 14:27:11 -05:00
Robert Haas a43f1939d5 Remove or reduce verbosity of some debug messages.
The debug messages that merely print StartTransactionCommand,
CommitTransactionCommand, ProcessUtilty, or ProcessQuery with no
additional details seem to be useless.  Get rid of them.

The transaction status messages produced by ShowTransactionState are
occasionally useful, but they are extremely verbose, producing
multiple lines of log output every time they fire, which can happens
multiple times per transaction.  So, reduce the level to DEBUG5; avoid
emitting an extra line just to explain which debug point is at issue;
and tighten up the rest of the message so it doesn't use quite so much
horizontal space.

With these changes, it's possible to run a somewhat busy system with a
log level even as high as DEBUG4, whereas previously anything above
DEBUG2 would flood the log with output that probably wasn't really all
that useful.
2016-11-17 17:05:16 -05:00
Alvaro Herrera f65b94f639 Avoid pin scan for replay of XLOG_BTREE_VACUUM in all cases
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to
require complex interlocking that matched the requirements on the
master. This required an O(N) operation that became a significant
problem with large indexes, causing replication delays of seconds or in
some cases minutes while the XLOG_BTREE_VACUUM was replayed.

This commit skips the “pin scan” that was previously required, by
observing in detail when and how it is safe to do so, with full
documentation. The pin scan is skipped only in replay; the VACUUM code
path on master is not touched here.

No tests included. Manual tests using an additional patch to view WAL records
and their timing have shown the change in WAL records and their handling has
successfully reduced replication delay.

This is a back-patch of commits 687f2cd7a0, 3e4b7d8798, b602842613
by Simon Riggs, to branches 9.4 and 9.5.  No further backpatch is
possible because this depends on catalog scans being MVCC.  I (Álvaro)
additionally updated a slight problem in the README, which explains why
this touches the 9.6 and master branches.
2016-11-17 13:31:30 -03:00
Tom Lane 9257f07872 Replace uses of SPI_modifytuple that intend to allocate in current context.
Invent a new function heap_modify_tuple_by_cols() that is functionally
equivalent to SPI_modifytuple except that it always allocates its result
by simple palloc.  I chose however to make the API details a bit more
like heap_modify_tuple: pass a tupdesc rather than a Relation, and use
bool convention for the isnull array.

Use this function in place of SPI_modifytuple at all call sites where the
intended behavior is to allocate in current context.  (There actually are
only two call sites left that depend on the old behavior, which makes me
wonder if we should just drop this function rather than keep it.)

This new function is easier to use than heap_modify_tuple() for purposes
of replacing a single column (or, really, any fixed number of columns).
There are a number of places where it would simplify the code to change
over, but I resisted that temptation for the moment ... everywhere except
in plpgsql's exec_assign_value(); changing that might offer some small
performance benefit, so I did it.

This is on the way to removing SPI_push/SPI_pop, but it seems like
good code cleanup in its own right.

Discussion: <9633.1478552022@sss.pgh.pa.us>
2016-11-08 15:36:44 -05:00
Robert Haas f0e72a25b0 Improve handling of dead tuples in hash indexes.
When squeezing a bucket during vacuum, it's not necessary to retain
any tuples already marked as dead, so ignore them when deciding which
tuples must be moved in order to empty a bucket page.  Similarly, when
splitting a bucket, relocating dead tuples to the new bucket is a
waste of effort; instead, just ignore them.

Amit Kapila, reviewed by me.  Testing help provided by Ashutosh
Sharma.
2016-11-08 10:52:51 -05:00
Tom Lane 5485c99e7f Fix silly nil-pointer-dereference bug introduced in commit d5f6f13f8.
Don't fetch record->xl_info before we've verified that record isn't
NULL.  Per Coverity.

Michael Paquier
2016-11-06 11:29:40 -05:00
Tom Lane d5f6f13f8e Be more consistent about masking xl_info with ~XLR_INFO_MASK.
Generally, WAL resource managers are only supposed to examine the
top 4 bits of a WAL record's xl_info; the rest are reserved for
the WAL mechanism itself.  A few places were not consistent about
doing this with respect to XLOG_CHECKPOINT and XLOG_SWITCH records.
There's no bug currently, since no additional bits ever get set in
these specific record types, but that might not be true forever.
Let's follow the generic coding rule here too.

Michael Paquier
2016-11-04 13:26:49 -04:00
Robert Haas 33839b5ffb Fix leftover reference to background writer performing checkpoints.
This was changed in PostgreSQL 9.2, but somehow this comment never
got updated.
2016-10-28 09:09:00 -04:00
Robert Haas f267c1c244 Fix possible pg_basebackup failure on standby with "include WAL".
If a restartpoint flushed no dirty buffers, it could fail to update
the minimum recovery point, leading to a minimum recovery point prior
to the starting REDO location.  perform_base_backup() would interpret
that as meaning that no WAL files at all needed to be included in the
backup, failing an internal sanity check.  To fix, have restartpoints
always update the minimum recovery point to just after the checkpoint
record itself, so that the file (or files) containing the checkpoint
record will always be included in the backup.

Code by Amit Kapila, per a design suggestion by me, with some
additional work on the code comment by me.  Test case by Michael
Paquier.  Report by Kyotaro Horiguchi.
2016-10-27 11:19:51 -04:00
Alvaro Herrera 00f15338b2 Preserve commit timestamps across clean restart
An oversight in setting the boundaries of known commit timestamps during
startup caused old commit timestamps to become inaccessible after a
server restart.

Author and reporter: Julien Rouhaud
Review, test code: Craig Ringer
2016-10-24 09:45:48 -03:00
Robert Haas 919c811ca1 Fix comment formatting. 2016-10-21 12:04:21 -04:00
Robert Haas f82ec32ac3 Rename "pg_xlog" directory to "pg_wal".
"xlog" is not a particularly clear abbreviation for "write-ahead log",
and it sometimes confuses users into believe that the contents of the
"pg_xlog" directory are not critical data, leading to unpleasant
consequences.  So, rename the directory to "pg_wal".

This patch modifies pg_upgrade and pg_basebackup to understand both
the old and new directory layouts; the former is necessary given the
purpose of the tool, while the latter merely avoids an unnecessary
backward-compatibility break.

We may wish to consider renaming other programs, switches, and
functions which still use the old "xlog" naming to also refer to
"wal".  However, that's still under discussion, so let's do just this
much for now.

Discussion: CAB7nPqTeC-8+zux8_-4ZD46V7YPwooeFxgndfsq5Rg8ibLVm1A@mail.gmail.com

Michael Paquier
2016-10-20 11:32:18 -04:00
Heikki Linnakangas 917dc7d239 Fix WAL-logging of FSM and VM truncation.
When a relation is truncated, it is important that the FSM is truncated as
well. Otherwise, after recovery, the FSM can return a page that has been
truncated away, leading to errors like:

ERROR:  could not read block 28991 in file "base/16390/572026": read only 0
of 8192 bytes

We were using MarkBufferDirtyHint() to dirty the buffer holding the last
remaining page of the FSM, but during recovery, that might in fact not
dirty the page, and the FSM update might be lost.

To fix, use the stronger MarkBufferDirty() function. MarkBufferDirty()
requires us to do WAL-logging ourselves, to protect from a torn page, if
checksumming is enabled.

Also fix an oversight in visibilitymap_truncate: it also needs to WAL-log
when checksumming is enabled.

Analysis by Pavan Deolasee.

Discussion: <CABOikdNr5vKucqyZH9s1Mh0XebLs_jRhKv6eJfNnD2wxTn=_9A@mail.gmail.com>
2016-10-19 14:26:05 +03:00
Tom Lane 5c80642aa8 Remove unnecessary int2vector-specific hash function and equality operator.
These functions were originally added in commit d8cedf67a to support
use of int2vector columns as catcache lookup keys.  However, there are
no catcaches that use such columns.  (Indeed I now think it must always
have been dead code: a catcache with such a key column would need an
underlying unique index on the column, but we've never had an int2vector
btree opclass.)

Getting rid of the int2vector-specific operator and function does not
lose any functionality, because operations on int2vectors will now fall
back to the generic anyarray support.  This avoids a wart that a btree
index on an int2vector column (made using anyarray_ops) would fail to
match equality searches, because int2vectoreq wasn't a member of the
opclass.  We don't really care much about that, since int2vector is not
meant as a type for users to use, but it's silly to have extra code and
less functionality.

If we ever do want a catcache to be indexed by an int2vector column,
we'd need to put back full btree and hash opclasses for int2vector,
comparable to the support for oidvector.  (The anyarray code can't be
used at such a low level, because it needs to do catcache lookups.)
But we'll deal with that if/when the need arises.

Also worth noting is that removal of the hash int2vector_ops opclass will
break any user-created hash indexes on int2vector columns.  While hash
anyarray_ops would serve the same purpose, it would probably not compute
the same hash values and thus wouldn't be on-disk-compatible.  Given that
int2vector isn't a user-facing type and we're planning other incompatible
changes in hash indexes for v10 anyway, this doesn't seem like something
to worry about, but it's probably worth mentioning here.

Amit Langote

Discussion: <d9bb74f8-b194-7307-9ebd-90645d377e45@lab.ntt.co.jp>
2016-10-12 14:54:08 -04:00
Tom Lane 2f1eaf87e8 Drop server support for FE/BE protocol version 1.0.
While this isn't a lot of code, it's been essentially untestable for
a very long time, because libpq doesn't support anything older than
protocol 2.0, and has not since release 6.3.  There's no reason to
believe any other client-side code still uses that protocol, either.

Discussion: <2661.1475849167@sss.pgh.pa.us>
2016-10-11 12:19:18 -04:00
Robert Haas 6f3bd98ebf Extend framework from commit 53be0b1ad to report latch waits.
WaitLatch, WaitLatchOrSocket, and WaitEventSetWait now taken an
additional wait_event_info parameter; legal values are defined in
pgstat.h.  This makes it possible to uniquely identify every point in
the core code where we are waiting for a latch; extensions can pass
WAIT_EXTENSION.

Because latches were the major wait primitive not previously covered
by this patch, it is now possible to see information in
pg_stat_activity on a large number of important wait events not
previously addressed, such as ClientRead, ClientWrite, and SyncRep.

Unfortunately, many of the wait events added by this patch will fail
to appear in pg_stat_activity because they're only used in background
processes which don't currently appear in pg_stat_activity.  We should
fix this either by creating a separate view for such information, or
else by deciding to include them in pg_stat_activity after all.

Michael Paquier and Robert Haas, reviewed by Alexander Korotkov and
Thomas Munro.
2016-10-04 11:01:42 -04:00
Tom Lane fdc9186f7e Replace the built-in GIN array opclasses with a single polymorphic opclass.
We had thirty different GIN array opclasses sharing the same operators and
support functions.  That still didn't cover all the built-in types, nor
did it cover arrays of extension-added types.  What we want is a single
polymorphic opclass for "anyarray".  There were two missing features needed
to make this possible:

1. We have to be able to declare the index storage type as ANYELEMENT
when the opclass is declared to index ANYARRAY.  This just takes a few
more lines in index_create().  Although this currently seems of use only
for GIN, there's no reason to make index_create() restrict it to that.

2. We have to be able to identify the proper GIN compare function for
the index storage type.  This patch proceeds by making the compare function
optional in GIN opclass definitions, and specifying that the default btree
comparison function for the index storage type will be looked up when the
opclass omits it.  Again, that seems pretty generically useful.

Since the comparison function lookup is done in initGinState(), making
use of the second feature adds an additional cache lookup to GIN index
access setup.  It seems unlikely that that would be very noticeable given
the other costs involved, but maybe at some point we should consider
making GinState data persist longer than it now does --- we could keep it
in the index relcache entry, perhaps.

Rather fortuitously, we don't seem to need to do anything to get this
change to play nice with dump/reload or pg_upgrade scenarios: the new
opclass definition is automatically selected to replace existing index
definitions, and the on-disk data remains compatible.  Also, if a user has
created a custom opclass definition for a non-builtin type, this doesn't
break that, since CREATE INDEX will prefer an exact match to opcintype
over a match to ANYARRAY.  However, if there's anyone out there with
handwritten DDL that explicitly specifies _bool_ops or one of the other
replaced opclass names, they'll need to adjust that.

Tom Lane, reviewed by Enrique Meneses

Discussion: <14436.1470940379@sss.pgh.pa.us>
2016-09-26 14:52:44 -04:00
Tom Lane 959ea7fa76 Remove useless code.
Apparent copy-and-pasteo in standby_desc_invalidations() had two
entries for msg->id == SHAREDINVALRELMAP_ID.

Aleksander Alekseev

Discussion: <20160923090814.GB1238@e733>
2016-09-23 10:44:50 -04:00
Peter Eisentraut ebdf5bf7d1 Delay updating control file to "in production"
Move the updating of the control file to "in production" status until
the point where WAL writes are allowed.  Before, there could be a
significant gap between the control file update and write transactions
actually being allowed.  This makes it more reliable to use the control
status to verify the end of a promotion.

From: Michael Paquier <michael.paquier@gmail.com>
2016-09-21 12:00:00 -04:00
Heikki Linnakangas 45310221a9 Fix outdated comments, GIST search queue is not an RBTree anymore.
The GiST search queue is implemented as a pairing heap rather than as
Red-Black Tree, since 9.5 (commit e7032610). I neglected these comments
in that commit.
2016-09-20 11:38:25 +03:00
Robert Haas 445a38aba2 Have heapam.h include lockdefs.h rather than lock.h.
lockdefs.h was only split from lock.h relatively recently, and
represents a minimal subset of the old lock.h.  heapam.h only needs
that smaller subset, so adjust it to include only that.  This requires
some corresponding adjustments elsewhere.

Peter Geoghegan
2016-09-13 09:21:35 -04:00
Tom Lane 24992c6db9 Rewrite PageIndexDeleteNoCompact into a form that only deletes 1 tuple.
The full generality of deleting an arbitrary number of tuples is no longer
needed, so let's save some code and cycles by replacing the original coding
with an implementation based on PageIndexTupleDelete.

We can always get back the old code from git if we need it again for new
callers (though I don't care for its willingness to mess with line pointers
it wasn't told to mess with).

Discussion: <552.1473445163@sss.pgh.pa.us>
2016-09-09 19:00:59 -04:00
Tom Lane b1328d78f8 Invent PageIndexTupleOverwrite, and teach BRIN and GiST to use it.
PageIndexTupleOverwrite performs approximately the same function as
PageIndexTupleDelete (or PageIndexDeleteNoCompact) followed by PageAddItem
targeting the same item pointer offset.  But in the case where the new
tuple is the same size as the old, it avoids shuffling other data around on
the page, because the new tuple is placed where the old one was rather than
being appended to the end of the page.  This has been shown to provide a
substantial speedup for some GiST use-cases.

Also, this change allows some API simplifications: we can get rid of
the rather klugy and error-prone PAI_ALLOW_FAR_OFFSET flag for
PageAddItemExtended, since that was used only to cover a corner case
for BRIN that's better expressed by using PageIndexTupleOverwrite.

Note that this patch causes a rather subtle WAL incompatibility: the
physical page content change represented by certain WAL records is now
different than it was before, because while the tuples have the same
itempointer line numbers, the tuples themselves are in different places.
I have not bumped the WAL version number because I think it doesn't matter
unless you are trying to do bitwise comparisons of original and replayed
pages, and in any case we're early in a devel cycle and there will probably
be more WAL changes before v10 gets out the door.

There is probably room to make use of PageIndexTupleOverwrite in SP-GiST
and GIN too, but that is left for a future patch.

Andrey Borodin, reviewed by Anastasia Lubennikova, whacked around a bit
by me

Discussion: <CAJEAwVGQjGGOj6mMSgMwGvtFd5Kwe6VFAxY=uEPZWMDjzbn4VQ@mail.gmail.com>
2016-09-09 18:02:36 -04:00
Alvaro Herrera 5c609a742f Fix locking a tuple updated by an aborted (sub)transaction
When heap_lock_tuple decides to follow the update chain, it tried to
also lock any version of the tuple that was created by an update that
was subsequently rolled back.  This is pointless, since for all intents
and purposes that tuple exists no more; and moreover it causes
misbehavior, as reported independently by Marko Tiikkaja and Marti
Raudsepp: some SELECT FOR UPDATE/SHARE queries may fail to return
the tuples, and assertion-enabled builds crash.

Fix by having heap_lock_updated_tuple test the xmin and return success
immediately if the tuple was created by an aborted transaction.

The condition where tuples become invisible occurs when an updated tuple
chain is followed by heap_lock_updated_tuple, which reports the problem
as HeapTupleSelfUpdated to its caller heap_lock_tuple, which in turn
propagates that code outwards possibly leading the calling code
(ExecLockRows) to believe that the tuple exists no longer.

Backpatch to 9.3.  Only on 9.5 and newer this leads to a visible
failure, because of commit 27846f02c176; before that, heap_lock_tuple
skips the whole dance when the tuple is already locked by the same
transaction, because of the ancient HeapTupleSatisfiesUpdate behavior.
Still, the buggy condition may also exist in more convoluted scenarios
involving concurrent transactions, so it seems safer to fix the bug in
the old branches too.

Discussion:
	https://www.postgresql.org/message-id/CABRT9RC81YUf1=jsmWopcKJEro=VoeG2ou6sPwyOUTx_qteRsg@mail.gmail.com
	https://www.postgresql.org/message-id/48d3eade-98d3-8b9a-477e-1a8dc32a724d@joh.to
2016-09-09 15:54:29 -03:00
Simon Riggs ec253de1fd Fix corruption of 2PC recovery with subxacts
Reading 2PC state files during recovery was borked, causing corruptions during
recovery. Effect limited to servers with 2PC, subtransactions and
recovery/replication.

Stas Kelvich, reviewed by Michael Paquier and Pavan Deolasee
2016-09-09 11:55:12 +01:00
Simon Riggs 67c6bd1ca3 Fix minor memory leak in Standby startup
StandbyRecoverPreparedTransactions() leaked the buffer
used for two phase state file. This was leaked once
at startup and at every shutdown checkpoint seen.

Backpatch to 9.6

Stas Kelvich
2016-09-08 10:32:58 +01:00
Peter Eisentraut 49eb0fd097 Add location field to DefElem
Add a location field to the DefElem struct, used to parse many utility
commands.  Update various error messages to supply error position
information.

To propogate the error position information in a more systematic way,
create a ParseState in standard_ProcessUtility() and pass that to
interested functions implementing the utility commands.  This seems
better than passing the query string and then reassembling a parse state
ad hoc, which violates the encapsulation of the ParseState type.

Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
2016-09-06 12:00:00 -04:00
Tom Lane 60893786d5 Fix corrupt GIN_SEGMENT_ADDITEMS WAL records on big-endian hardware.
computeLeafRecompressWALData() tried to produce a uint16 WAL log field by
memcpy'ing the first two bytes of an int-sized variable.  That accidentally
works on little-endian hardware, but not at all on big-endian.  Replay then
thinks it's looking at an ADDITEMS action with zero entries, and reads the
first two bytes of the first TID therein as the next segno/action,
typically leading to "unexpected GIN leaf action" errors during replay.
Even if replay failed to crash, the resulting GIN index page would surely
be incorrect.  To fix, just declare the variable as uint16 instead.

Per bug #14295 from Spencer Thomason (much thanks to Spencer for turning
his problem into a self-contained test case).  This likely also explains
a previous report of the same symptom from Bernd Helmle.

Back-patch to 9.4 where the problem was introduced (by commit 14d02f0bb).

Discussion: <20160826072658.15676.7628@wrigleys.postgresql.org>
Possible-Report: <2DA7350F7296B2A142272901@eje.land.credativ.lan>
2016-09-03 13:28:53 -04:00
Simon Riggs 35250b6ad7 New recovery target recovery_target_lsn
Michael Paquier
2016-09-03 17:48:01 +01:00
Heikki Linnakangas 9f85784cae Support multiple iterators in the Red-Black Tree implementation.
While we don't need multiple iterators at the moment, the interface is
nicer and less dangerous this way.

Aleksander Alekseev, with some changes by me.
2016-09-02 08:39:39 +03:00
Tom Lane 0e0f43d6fd Prevent starting a standalone backend with standby_mode on.
This can't really work because standby_mode expects there to be more
WAL arriving, which there will not ever be because there's no WAL
receiver process to fetch it.  Moreover, if standby_mode is on then
hot standby might also be turned on, causing even more strangeness
because that expects read-only sessions to be executing in parallel.
Bernd Helmle reported a case where btree_xlog_delete_get_latestRemovedXid
got confused, but rather than band-aiding individual problems it seems
best to prevent getting anywhere near this state in the first place.
Back-patch to all supported branches.

In passing, also fix some omissions of errcodes in other ereport's in
readRecoveryCommandFile().

Michael Paquier (errcode hacking by me)

Discussion: <00F0B2CEF6D0CEF8A90119D4@eje.credativ.lan>
2016-08-31 08:52:13 -04:00
Alvaro Herrera 8e1e3f958f Split hash.h → hash_xlog.h
Since the hash AM is going to be revamped to have WAL, this is a good
opportunity to clean up the include file a little bit to avoid including
a lot of extra stuff in the future.

Author: Amit Kapila
2016-08-29 18:55:49 -03:00
Fujii Masao bd082231ed Fix typos in comments. 2016-08-29 16:06:40 +09:00
Fujii Masao bab7823a49 Fix pg_xlogdump so that it handles cross-page XLP_FIRST_IS_CONTRECORD record.
Previously pg_xlogdump failed to dump the contents of the WAL file
if the file starts with the continuation WAL record which spans
more than one pages. Since pg_xlogdump assumed that the continuation
record always fits on a page, it could not find the valid WAL record to
start reading from in that case.

This patch changes pg_xlogdump so that it can handle a continuation
WAL record which crosses a page boundary and find the valid record
to start reading from.

Back-patch to 9.3 where pg_xlogdump was introduced.

Author: Pavan Deolasee
Reviewed-By: Michael Paquier and Craig Ringer
Discussion: CABOikdPsPByMiG6J01DKq6om2+BNkxHTPkOyqHM2a4oYwGKsqQ@mail.gmail.com
2016-08-29 14:34:58 +09:00
Tom Lane ea268cdc9a Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters.  While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea.  Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.

While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.

In passing, change TopMemoryContext to use the default allocation
parameters.  Formerly it could only be extended 8K at a time.  That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there.  There seems no good reason
not to let it use growing blocks like most other contexts.

Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain.  The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.

Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 17:50:38 -04:00
Tom Lane 78dcd027e8 Fix potential memory leakage from HandleParallelMessages().
HandleParallelMessages leaked memory into the caller's context.  Since it's
called from ProcessInterrupts, there is basically zero certainty as to what
CurrentMemoryContext is, which means we could be leaking into long-lived
contexts.  Over the processing of many worker messages that would grow to
be a problem.  Things could be even worse than just a leak, if we happened
to service the interrupt while ErrorContext is current: elog.c thinks it
can reset that on its own whim, possibly yanking storage out from under
HandleParallelMessages.

Give HandleParallelMessages its own dedicated context instead, which we can
reset during each call to ensure there's no accumulation of wasted memory.

Discussion: <16610.1472222135@sss.pgh.pa.us>
2016-08-26 15:04:05 -04:00
Tom Lane fbf28b6b52 Fix logic for adding "parallel worker" context line to worker errors.
The previous coding here was capable of adding a "parallel worker" context
line to errors that were not, in fact, returned from a parallel worker.
Instead of using an errcontext callback to add that annotation, just paste
it onto the message by hand; this looks uglier but is more reliable.

Discussion: <19757.1472151987@sss.pgh.pa.us>
2016-08-26 10:07:28 -04:00
Tom Lane ae4760d667 Fix small query-lifespan memory leak in bulk updates.
When there is an identifiable REPLICA IDENTITY index on the target table,
heap_update leaks the id_attrs bitmapset.  That's not many bytes, but it
adds up over enough rows, since the code typically runs in a query-lifespan
context.  Bug introduced in commit e55704d8b, which did a rather poor job
of cloning the existing use-pattern for RelationGetIndexAttrBitmap().

Per bug #14293 from Zhou Digoal.  Back-patch to 9.4 where the bug was
introduced.

Report: <20160824114320.15676.45171@wrigleys.postgresql.org>
2016-08-24 22:20:25 -04:00
Tom Lane 71e006f031 Suppress compiler warnings in non-cassert builds.
With Asserts off, these variables are set but never used, resulting
in warnings from pickier compilers.  Fix that with our standard solution.
Per report from Jeff Janes.
2016-08-23 23:21:10 -04:00
Tom Lane d2ddee63b4 Improve SP-GiST opclass API to better support unlabeled nodes.
Previously, the spgSplitTuple action could only create a new upper tuple
containing a single labeled node.  This made it useless for opclasses
that prefer to work with fixed sets of nodes (labeled or otherwise),
which meant that restrictive prefixes could not be used with such
node definitions.  Change the output field set for the choose() method
to allow it to specify any valid node set for the new upper tuple,
and to specify which of these nodes to place the modified lower tuple in.

In addition to its primary use for fixed node sets, this feature could
allow existing opclasses that use variable node sets to skip a separate
spgAddNode action when splitting a tuple, by setting up the node needed
for the incoming value as part of the spgSplitTuple action.  However, care
would have to be taken to add the extra node only when it would not make
the tuple bigger than before.  (spgAddNode can enlarge the tuple,
spgSplitTuple can't.)

This is a prerequisite for an upcoming SP-GiST inet opclass, but is
being committed separately to increase the visibility of the API change.

In passing, improve the documentation about the traverse-values feature
that was added by commit ccd6eb49a.

Emre Hasegeli, with cosmetic adjustments and documentation rework by me

Discussion: <CAE2gYzxtth9qatW_OAqdOjykS0bxq7AYHLuyAQLPgT7H9ZU0Cw@mail.gmail.com>
2016-08-23 12:10:34 -04:00
Andres Freund 07ef035129 Fix deletion of speculatively inserted TOAST on conflict
INSERT ..  ON CONFLICT runs a pre-check of the possible conflicting
constraints before performing the actual speculative insertion.  In case
the inserted tuple included TOASTed columns the ON CONFLICT condition
would be handled correctly in case the conflict was caught by the
pre-check, but if two transactions entered the speculative insertion
phase at the same time, one would have to re-try, and the code for
aborting a speculative insertion did not handle deleting the
speculatively inserted TOAST datums correctly.

TOAST deletion would fail with "ERROR: attempted to delete invisible
tuple" as we attempted to remove the TOAST tuples using
simple_heap_delete which reasoned that the given tuples should not be
visible to the command that wrote them.

This commit updates the heap_abort_speculative() function which aborts
the conflicting tuple to use itself, via toast_delete, for deleting
associated TOAST datums.  Like before, the inserted toast rows are not
marked as being speculative.

This commit also adds a isolationtester spec test, exercising the
relevant code path. Unfortunately 9.5 cannot handle two waiting
sessions, and thus cannot execute this test.

Reported-By: Viren Negi, Oskari Saarenmaa
Author: Oskari Saarenmaa, edited a bit by me
Bug: #14150
Discussion: <20160519123338.12513.20271@wrigleys.postgresql.org>
Backpatch: 9.5, where ON CONFLICT was introduced
2016-08-17 17:03:36 -07:00
Peter Eisentraut f0fe1c8f70 Fix typos
From: Alexander Law <exclusion@gmail.com>
2016-08-16 14:52:29 -04:00
Tom Lane b5bce6c1ec Final pgindent + perltidy run for 9.6. 2016-08-15 13:42:51 -04:00
Tom Lane ed0097e4f9 Add SQL-accessible functions for inspecting index AM properties.
Per discussion, we should provide such functions to replace the lost
ability to discover AM properties by inspecting pg_am (cf commit
65c5fcd35).  The added functionality is also meant to displace any code
that was looking directly at pg_index.indoption, since we'd rather not
believe that the bit meanings in that field are part of any client API
contract.

As future-proofing, define the SQL API to not assume that properties that
are currently AM-wide or index-wide will remain so unless they logically
must be; instead, expose them only when inquiring about a specific index
or even specific index column.  Also provide the ability for an index
AM to override the behavior.

In passing, document pg_am.amtype, overlooked in commit 473b93287.

Andrew Gierth, with kibitzing by me and others

Discussion: <87mvl5on7n.fsf@news-spur.riddles.org.uk>
2016-08-13 18:31:14 -04:00
Tom Lane e89526d4f3 In B-tree page deletion, clean up properly after page deletion failure.
In _bt_unlink_halfdead_page(), we might fail to find an immediate left
sibling of the target page, perhaps because of corruption of the page
sibling links.  The code intends to cope with this by just abandoning
the deletion attempt; but what actually happens is that it fails outright
due to releasing the same buffer lock twice.  (And error recovery masks
a second problem, which is possible leakage of a pin on another page.)
Seems to have been introduced by careless refactoring in commit efada2b8e.
Since there are multiple cases to consider, let's make releasing the buffer
lock in the failure case the responsibility of _bt_unlink_halfdead_page()
not its caller.

Also, avoid fetching the leaf page's left-link again after we've dropped
lock on the page.  This is probably harmless, but it's not exactly good
coding practice.

Per report from Kyotaro Horiguchi.  Back-patch to 9.4 where the faulty code
was introduced.

Discussion: <20160803.173116.111915228.horiguchi.kyotaro@lab.ntt.co.jp>
2016-08-06 14:28:37 -04:00
Robert Haas 81c766b3fd Change InitToastSnapshot to a macro.
tqual.h is included in some front-end compiles, and a static inline
breaks on buildfarm member castoroides.  Since the macro is never
referenced, it should dodge that problem, although this doesn't
seem like the cleanest way of hiding things from front-end compiles.

Report and review by Tom Lane; patch by me.
2016-08-05 11:58:03 -04:00
Andres Freund e7caacf733 Fix hard to hit race condition in heapam's tuple locking code.
As mentioned in its commit message, eca0f1db left open a race condition,
where a page could be marked all-visible, after the code checked
PageIsAllVisible() to pin the VM, but before the page is locked.  Plug
that hole.

Reviewed-By: Robert Haas, Andres Freund
Author: Amit Kapila
Discussion: CAEepm=3fWAbWryVW9swHyLTY4sXVf0xbLvXqOwUoDiNCx9mBjQ@mail.gmail.com
Backpatch: -
2016-08-04 20:07:16 -07:00
Robert Haas 3e2f3c2e42 Prevent "snapshot too old" from trying to return pruned TOAST tuples.
Previously, we tested for MVCC snapshots to see whether they were too
old, but not TOAST snapshots, which can lead to complaints about missing
TOAST chunks if those chunks are subject to early pruning.  Ideally,
the threshold lsn and timestamp for a TOAST snapshot would be that of
the corresponding MVCC snapshot, but since we have no way of deciding
which MVCC snapshot was used to fetch the TOAST pointer, use the oldest
active or registered snapshot instead.

Reported by Andres Freund, who also sketched out what the fix should
look like.  Patch by me, reviewed by Amit Kapila.
2016-08-03 16:50:01 -04:00
Tom Lane b6a97b91ff Block interrupts during HandleParallelMessages().
As noted by Alvaro, there are CHECK_FOR_INTERRUPTS() calls in the shm_mq.c
functions called by HandleParallelMessages().  I believe they're all
unreachable since we always pass nowait = true, but it doesn't seem like
a great idea to assume that no such call will ever be reachable from
HandleParallelMessages().  If that did happen, there would be a risk of a
recursive call to HandleParallelMessages(), which it does not appear to be
designed for --- for example, there's nothing that would prevent
out-of-order processing of received messages.  And certainly such cases
cannot easily be tested.  So let's prevent it by holding off interrupts for
the duration of the function.  Back-patch to 9.5 which contains identical
code.

Discussion: <14869.1470083848@sss.pgh.pa.us>
2016-08-02 16:39:16 -04:00
Tom Lane a5fe473ad7 Minor cleanup for access/transam/parallel.c.
ParallelMessagePending *must* be marked volatile, because it's set
by a signal handler.  On the other hand, it's pointless for
HandleParallelMessageInterrupt to save/restore errno; that must be,
and is, done at the outer level of the SIGUSR1 signal handler.

Calling CHECK_FOR_INTERRUPTS() inside HandleParallelMessages, which itself
is called from CHECK_FOR_INTERRUPTS(), seems both useless and hazardous.
The comment claiming that this is needed to handle the error queue going
away is certainly misguided, in any case.

Improve a couple of error message texts, and use
ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE to report loss of parallel worker
connection, since that's what's used in e.g. tqueue.c.  (Maybe it would be
worth inventing a dedicated ERRCODE for this type of failure?  But I do not
think ERRCODE_INTERNAL_ERROR is appropriate.)

Minor stylistic cleanups.
2016-08-01 16:12:01 -04:00
Peter Eisentraut ef5d4a3cfa Message style improvements 2016-07-28 16:34:44 -04:00
Robert Haas 1091402b5a Remove unused structure member.
Michael Paquier
2016-07-21 11:53:44 -04:00
Andres Freund eca0f1db14 Clear all-frozen visibilitymap status when locking tuples.
Since a892234 & fd31cd265 the visibilitymap's freeze bit is used to
avoid vacuuming the whole relation in anti-wraparound vacuums. Doing so
correctly relies on not adding xids to the heap without also unsetting
the visibilitymap flag.  Tuple locking related code has not done so.

To allow selectively resetting all-frozen - to avoid pessimizing
heap_lock_tuple - allow to selectively reset the all-frozen with
visibilitymap_clear(). To avoid having to use
visibilitymap_get_status (e.g. via VM_ALL_FROZEN) inside a critical
section, have visibilitymap_clear() return whether any bits have been
reset.

There's a remaining issue (denoted by XXX): After the PageIsAllVisible()
check in heap_lock_tuple() and heap_lock_updated_tuple_rec() the page
status could theoretically change. Practically that currently seems
impossible, because updaters will hold a page level pin already.  Due to
the next beta coming up, it seems better to get the required WAL magic
bump done before resolving this issue.

The added flags field fields to xl_heap_lock and xl_heap_lock_updated
require bumping the WAL magic. Since there's already been a catversion
bump since the last beta, that's not an issue.

Reviewed-By: Robert Haas, Amit Kapila and Andres Freund
Author: Masahiko Sawada, heavily revised by Andres Freund
Discussion: CAEepm=3fWAbWryVW9swHyLTY4sXVf0xbLvXqOwUoDiNCx9mBjQ@mail.gmail.com
Backpatch: -
2016-07-18 02:01:13 -07:00
Tom Lane 9563d5b5e4 Add regression test case exercising the sorting path for hash index build.
We've broken this code path at least twice in the past, so it's prudent
to have a test case that covers it.  To allow exercising the code path
without creating a very large (and slow to run) test case, redefine the
sort threshold to be bounded by maintenance_work_mem as well as the number
of available buffers.  While at it, fix an ancient oversight that when
building a temp index, the number of available buffers is not NBuffers but
NLocBuffer.  Also, if assertions are enabled, apply a direct test that the
sort actually does return the tuples in the expected order.

Peter Geoghegan

Patch: <CAM3SWZTBAo4hjbBd780+MrOKiKp_TMo1N3A0Rw9_im8gbD7fQA@mail.gmail.com>
2016-07-16 15:30:15 -04:00
Andres Freund bfa2ab56bb Fix torn-page, unlogged xid and further risks from heap_update().
When heap_update needs to look for a page for the new tuple version,
because the current one doesn't have sufficient free space, or when
columns have to be processed by the tuple toaster, it has to release the
lock on the old page during that. Otherwise there'd be lock ordering and
lock nesting issues.

To avoid concurrent sessions from trying to update / delete / lock the
tuple while the page's content lock is released, the tuple's xmax is set
to the current session's xid.

That unfortunately was done without any WAL logging, thereby violating
the rule that no XIDs may appear on disk, without an according WAL
record.  If the database were to crash / fail over when the page level
lock is released, and some activity lead to the page being written out
to disk, the xid could end up being reused; potentially leading to the
row becoming invisible.

There might be additional risks by not having t_ctid point at the tuple
itself, without having set the appropriate lock infomask fields.

To fix, compute the appropriate xmax/infomask combination for locking
the tuple, and perform WAL logging using the existing XLOG_HEAP_LOCK
record. That allows the fix to be backpatched.

This issue has existed for a long time. There appears to have been
partial attempts at preventing dangers, but these never have fully been
implemented, and were removed a long time ago, in
11919160 (cf. HEAP_XMAX_UNLOGGED).

In master / 9.6, there's an additional issue, namely that the
visibilitymap's freeze bit isn't reset at that point yet. Since that's a
new issue, introduced only in a892234f83, that'll be fixed in a
separate commit.

Author: Masahiko Sawada and Andres Freund
Reported-By: Different aspects by Thomas Munro, Noah Misch, and others
Discussion: CAEepm=3fWAbWryVW9swHyLTY4sXVf0xbLvXqOwUoDiNCx9mBjQ@mail.gmail.com
Backpatch: 9.1/all supported versions
2016-07-15 17:49:48 -07:00
Andres Freund a4d357bfbd Make HEAP_LOCK/HEAP2_LOCK_UPDATED replay reset HEAP_XMAX_INVALID.
0ac5ad5 started to compress infomask bits in WAL records. Unfortunately
the replay routines for XLOG_HEAP_LOCK/XLOG_HEAP2_LOCK_UPDATED forgot to
reset the HEAP_XMAX_INVALID (and some other) hint bits.

Luckily that's not problematic in the majority of cases, because after a
crash/on a standby row locks aren't meaningful. Unfortunately that does
not hold true in the presence of prepared transactions. This means that
after a crash, or after promotion, row level locks held by a prepared,
but not yet committed, prepared transaction might not be enforced.

Discussion: 20160715192319.ubfuzim4zv3rqnxv@alap3.anarazel.de
Backpatch: 9.3, the oldest branch on which 0ac5ad5 is present.
2016-07-15 14:45:37 -07:00
Alvaro Herrera 533e9c6b06 Avoid serializability errors when locking a tuple with a committed update
When key-share locking a tuple that has been not-key-updated, and the
update is a committed transaction, in some cases we raised
serializability errors:
    ERROR:  could not serialize access due to concurrent update

Because the key-share doesn't conflict with the update, the error is
unnecessary and inconsistent with the case that the update hasn't
committed yet.  This causes problems for some usage patterns, even if it
can be claimed that it's sufficient to retry the aborted transaction:
given a steady stream of updating transactions and a long locking
transaction, the long transaction can be starved indefinitely despite
multiple retries.

To fix, we recognize that HeapTupleSatisfiesUpdate can return
HeapTupleUpdated when an updating transaction has committed, and that we
need to deal with that case exactly as if it were a non-committed
update: verify whether the two operations conflict, and if not, carry on
normally.  If they do conflict, however, there is a difference: in the
HeapTupleBeingUpdated case we can just sleep until the concurrent
transaction is gone, while in the HeapTupleUpdated case this is not
possible and we must raise an error instead.

Per trouble report from Olivier Dony.

In addition to a couple of test cases that verify the changed behavior,
I added a test case to verify the behavior that remains unchanged,
namely that errors are raised when a update that modifies the key is
used.  That must still generate serializability errors.  One
pre-existing test case changes behavior; per discussion, the new
behavior is actually the desired one.

Discussion: https://www.postgresql.org/message-id/560AA479.4080807@odoo.com
  https://www.postgresql.org/message-id/20151014164844.3019.25750@wrigleys.postgresql.org

Backpatch to 9.3, where the problem appeared.
2016-07-15 14:17:20 -04:00
Tom Lane 1acf757255 Fix GiST index build for NaN values in geometric types.
GiST index build could go into an infinite loop when presented with boxes
(or points, circles or polygons) containing NaN component values.  This
happened essentially because the code assumed that x == x is true for any
"double" value x; but it's not true for NaNs.  The looping behavior was not
the only problem though: we also attempted to sort the items using simple
double comparisons.  Since NaNs violate the trichotomy law, qsort could
(in principle at least) get arbitrarily confused and mess up the sorting of
ordinary values as well as NaNs.  And we based splitting choices on box size
calculations that could produce NaNs, again resulting in undesirable
behavior.

To fix, replace all comparisons of doubles in this logic with
float8_cmp_internal, which is NaN-aware and is careful to sort NaNs
consistently, higher than any non-NaN.  Also rearrange the box size
calculation to not produce NaNs; instead it should produce an infinity
for a box with NaN on one side and not-NaN on the other.

I don't by any means claim that this solves all problems with NaNs in
geometric values, but it should at least make GiST index insertion work
reliably with such data.  It's likely that the index search side of things
still needs some work, and probably regular geometric operations too.
But with this patch we're laying down a convention for how such cases
ought to behave.

Per bug #14238 from Guang-Dih Lei.  Back-patch to 9.2; the code used before
commit 7f3bd86843 is quite different and doesn't lock up on my simple
test case, nor on the submitter's dataset.

Report: <20160708151747.1426.60150@wrigleys.postgresql.org>
Discussion: <28685.1468246504@sss.pgh.pa.us>
2016-07-14 18:45:59 -04:00
Magnus Hagander 87d84d67bb Fix start WAL filename for concurrent backups from standby
On a standby, ThisTimelineID is always 0, so we would generate a
filename in timeline 0 even for other timelines. Instead, use starttli
which we have retreived from the controlfile.

Report by: Francesco Canovai in bug #14230
Author: Marco Nenciarini
Reviewed by: Michael Paquier and Amit Kapila
2016-07-11 12:02:31 +02:00
Robert Haas 10c0558ffe Fix several mistakes around parallel workers and client_encoding.
Previously, workers sent data to the leader using the client encoding.
That mostly worked, but the leader the converted the data back to the
server encoding.  Since not all encoding conversions are reversible,
that could provoke failures.  Fix by using the database encoding for
all communication between worker and leader.

Also, while temporary changes to GUC settings, as from the SET clause
of a function, are in general OK for parallel query, changing
client_encoding this way inside of a parallel worker is not OK.
Previously, that would have confused the leader; with these changes,
it would not confuse the leader, but it wouldn't do anything either.
So refuse such changes in parallel workers.

Also, the previous code naively assumed that when it received a
NotifyResonse from the worker, it could pass that directly back to the
user.  But now that worker-to-leader communication always uses the
database encoding, that's clearly no longer correct - though,
actually, the old way was always broken for V2 clients.  So
disassemble and reconstitute the message instead.

Issues reported by Peter Eisentraut.  Patch by me, reviewed by
Peter Eisentraut.
2016-06-30 18:35:32 -04:00
Alvaro Herrera b78364df18 Remove unused arguments in two GiST subroutines
These arguments became unused in commit 2c03216d83.  Noticed while
skimming code for unrelated development.

This is cosmetic, so no backpatch.
2016-06-28 16:01:13 -04:00
Alvaro Herrera e3ad3ffa68 Fix handling of multixacts predating pg_upgrade
After pg_upgrade, it is possible that some tuples' Xmax have multixacts
corresponding to the old installation; such multixacts cannot have
running members anymore.  In many code sites we already know not to read
them and clobber them silently, but at least when VACUUM tries to freeze
a multixact or determine whether one needs freezing, there's an attempt
to resolve it to its member transactions by calling GetMultiXactIdMembers,
and if the multixact value is "in the future" with regards to the
current valid multixact range, an error like this is raised:
    ERROR:  MultiXactId 123 has not been created yet -- apparent wraparound
and vacuuming fails.  Per discussion with Andrew Gierth, it is completely
bogus to try to resolve multixacts coming from before a pg_upgrade,
regardless of where they stand with regards to the current valid
multixact range.

It's possible to get from under this problem by doing SELECT FOR UPDATE
of the problem tuples, but if tables are large, this is slow and
tedious, so a more thorough solution is desirable.

To fix, we realize that multixacts in xmax created in 9.2 and previous
have a specific bit pattern that is never used in 9.3 and later (we
already knew this, per comments and infomask tests sprinkled in various
places, but we weren't leveraging this knowledge appropriately).
Whenever the infomask of the tuple matches that bit pattern, we just
ignore the multixact completely as if Xmax wasn't set; or, in the case
of tuple freezing, we act as if an unwanted value is set and clobber it
without decoding.  This guarantees that no errors will be raised, and
that the values will be progressively removed until all tables are
clean.  Most callers of GetMultiXactIdMembers are patched to recognize
directly that the value is a removable "empty" multixact and avoid
calling GetMultiXactIdMembers altogether.

To avoid changing the signature of GetMultiXactIdMembers() in back
branches, we keep the "allow_old" boolean flag but rename it to
"from_pgupgrade"; if the flag is true, we always return an empty set
instead of looking up the multixact.  (I suppose we could remove the
argument in the master branch, but I chose not to do so in this commit).

This was broken all along, but the error-facing message appeared first
because of commit 8e9a16ab8f and was partially fixed in a25c2b7c4d.
This fix, backpatched all the way back to 9.3, goes approximately in the
same direction as a25c2b7c4d but should cover all cases.

Bug analysis by Andrew Gierth and Álvaro Herrera.

A number of public reports match this bug:
  https://www.postgresql.org/message-id/20140330040029.GY4582@tamriel.snowman.net
  https://www.postgresql.org/message-id/538F3D70.6080902@publicrelay.com
  https://www.postgresql.org/message-id/556439CF.7070109@pscs.co.uk
  https://www.postgresql.org/message-id/SG2PR06MB0760098A111C88E31BD4D96FB3540@SG2PR06MB0760.apcprd06.prod.outlook.com
  https://www.postgresql.org/message-id/20160615203829.5798.4594@wrigleys.postgresql.org
2016-06-24 18:29:28 -04:00
Tom Lane 8cf739de85 Fix building of large (bigger than shared_buffers) hash indexes.
When the index is predicted to need more than NBuffers buckets,
CREATE INDEX attempts to sort the index entries by hash key before
insertion, so as to reduce thrashing.  This code path got broken by
commit 9f03ca9151, which overlooked that _hash_form_tuple() is not
just an alias for index_form_tuple().  The index got built anyway, but
with garbage data, so that searches for pre-existing tuples always
failed.  Fix by refactoring to separate construction of the indexable
data from calling index_form_tuple().

Per bug #14210 from Daniel Newman.  Back-patch to 9.5 where the
bug was introduced.

Report: <20160623162507.17237.39471@wrigleys.postgresql.org>
2016-06-24 16:57:36 -04:00
Alvaro Herrera 1ca4a1b5d2 Finish up XLOG_HINT renaming
Commit b8fd1a09f3 renamed XLOG_HINT to XLOG_FPI, but neglected two
places.

Backpatch to 9.3, like that commit.
2016-06-17 18:05:55 -04:00
Robert Haas 71d05a2c7b pg_visibility: Add pg_truncate_visibility_map function.
This requires some core changes as well so that we can properly
WAL-log the truncation.  Specifically, it changes the format of the
XLOG_SMGR_TRUNCATE WAL record, so bump XLOG_PAGE_MAGIC.

Patch by me, reviewed but not fully endorsed by Andres Freund.
2016-06-17 17:37:30 -04:00
Robert Haas 9c188a8454 Fix typo.
Thomas Munro
2016-06-17 12:55:30 -04:00
Robert Haas 292794f82b Remove PID from 'parallel worker' context message.
Discussion: <bfd204ab-ab1a-792a-b345-0274a09a4b5f@2ndquadrant.com>
2016-06-17 09:26:17 -04:00
Tom Lane bfb937427b Fix fuzzy thinking in ReinitializeParallelDSM().
The fact that no workers were successfully launched in the previous
iteration does not excuse us from setting up properly to try again.
This appears to explain crashes I saw in parallel regression testing
due to error_mqh being NULL when it shouldn't be.

Minor other cosmetic fixes too.
2016-06-16 15:20:29 -04:00
Robert Haas 38e9f90a22 Fix lazy_scan_heap so that it won't mark pages all-frozen too soon.
Commit a892234f83 added a new bit per
page to the visibility map fork indicating whether the page is
all-frozen, but incorrectly assumed that if lazy_scan_heap chose to
freeze a tuple then that tuple would not need to later be frozen
again. This turns out to be false, because xmin and xmax (and
conceivably xvac, if dealing with tuples from very old releases) could
be frozen at separate times.

Thanks to Andres Freund for help in uncovering and tracking down this
issue.
2016-06-15 14:30:06 -04:00
Tom Lane cae1c788b9 Improve the situation for parallel query versus temp relations.
Transmit the leader's temp-namespace state to workers.  This is important
because without it, the workers do not really have the same search path
as the leader.  For example, there is no good reason (and no extant code
either) to prevent a worker from executing a temp function that the
leader created previously; but as things stood it would fail to find the
temp function, and then either fail or execute the wrong function entirely.

We still prohibit a worker from creating a temp namespace on its own.
In effect, a worker can only see the session's temp namespace if the leader
had created it before starting the worker, which seems like the right
semantics.

Also, transmit the leader's BackendId to workers, and arrange for workers
to use that when determining the physical file path of a temp relation
belonging to their session.  While the original intent was to prevent such
accesses entirely, there were a number of holes in that, notably in places
like dbsize.c which assume they can safely access temp rels of other
sessions anyway.  We might as well get this right, as a small down payment
on someday allowing workers to access the leader's temp tables.  (With
this change, directly using "MyBackendId" as a relation or buffer backend
ID is deprecated; you should use BackendIdForTempRelations() instead.
I left a couple of such uses alone though, as they're not going to be
reachable in parallel workers until we do something about localbuf.c.)

Move the thou-shalt-not-access-thy-leader's-temp-tables prohibition down
into localbuf.c, which is where it actually matters, instead of having it
in relation_open().  This amounts to recognizing that access to temp
tables' catalog entries is perfectly safe in a worker, it's only the data
in local buffers that is problematic.

Having done all that, we can get rid of the test in has_parallel_hazard()
that says that use of a temp table's rowtype is unsafe in parallel workers.
That test was unduly expensive, and if we really did need such a
prohibition, that was not even close to being a bulletproof guard for it.
(For example, any user-defined function executed in a parallel worker
might have attempted such access.)
2016-06-09 20:16:11 -04:00
Robert Haas 4bc424b968 pgindent run for 9.6 2016-06-09 18:02:36 -04:00
Robert Haas c9ce4a1c61 Eliminate "parallel degree" terminology.
This terminology provoked widespread complaints.  So, instead, rename
the GUC max_parallel_degree to max_parallel_workers_per_gather
(leaving room for a possible future GUC max_parallel_workers that acts
as a system-wide limit), and rename the parallel_degree reloption to
parallel_workers.  Rename structure members to match.

These changes create a dump/restore hazard for users of PostgreSQL
9.6beta1 who have set the reloption (or applied the GUC using ALTER
USER or ALTER DATABASE).
2016-06-09 10:00:26 -04:00
Peter Eisentraut 5c6d2a5e7c Message style and wording fixes 2016-06-07 14:18:55 -04:00
Robert Haas c6dbf1fe79 Stop the executor if no more tuples can be sent from worker to leader.
If a Gather node has read as many tuples as it needs (for example, due
to Limit) it may detach the queue connecting it to the worker before
reading all of the worker's tuples.  Rather than let the worker
continue to generate and send all of the results, have it stop after
sending the next tuple.

More could be done here to stop the worker even quicker, but this is
about as well as we can hope to do for 9.6.

This is in response to a problem report from Andreas Seltenreich.
Commit 44339b892a should be actually be
sufficient to fix that example even without this change, but it seems
better to do this, too, since we might otherwise waste quite a large
amount of effort in one or more workers.

Discussion: CAA4eK1KOKGqmz9bGu+Z42qhRwMbm4R5rfnqsLCNqFs9j14jzEA@mail.gmail.com

Amit Kapila
2016-06-06 14:52:58 -04:00
Robert Haas 932b97a011 Fix typo.
Jim Nasby
2016-06-06 07:58:50 -04:00
Greg Stark e1623c3959 Fix various common mispellings.
Mostly these are just comments but there are a few in documentation
and a handful in code and tests. Hopefully this doesn't cause too much
unnecessary pain for backpatching. I relented from some of the most
common like "thru" for that reason. The rest don't seem numerous
enough to cause problems.

Thanks to Kevin Lyda's tool https://pypi.python.org/pypi/misspellings
2016-06-03 16:08:45 +01:00
Robert Haas fdfaccfa79 Cosmetic improvements to freeze map code.
Per post-commit review comments from Andres Freund, improve variable
names, comments, and in one place, slightly improve the code structure.

Masahiko Sawada
2016-06-03 08:43:41 -04:00
Kevin Grittner 7392eed7c2 Fix btree mark/restore bug.
Commit 2ed5b87f96 introduced a bug in
mark/restore, in an attempt to optimize repeated restores to the
same page.  This caused an assertion failure during a merge join
which fed directly from an index scan, although the impact would
not be limited to that case.  Revert the bad chunk of code from
that commit.

While investigating this bug it was discovered that a particular
"paranoia" set of the mark position field would not prevent bad
behavior; it would just make it harder to diagnose.  Change that
into an assertion, which will draw attention to any future problem
in that area more directly.

Backpatch to 9.5, where the bug was introduced.

Bug #14169 reported by Shinta Koyanagi.
Preliminary analysis by Tom Lane identified which commit caused
the bug.
2016-06-02 12:23:01 -05:00
Alvaro Herrera 975ad4e602 Fix PageAddItem BRIN bug
BRIN was relying on the ability to remove a tuple from an index page,
then putting another tuple in the same line pointer.  But PageAddItem
refuses to add a tuple beyond the first free item past the last used
item, and in particular, it rejects an attempt to add an item to an
empty page anywhere other than the first line pointer.  PageAddItem
issues a WARNING and indicates to the caller that it failed, which in
turn causes the BRIN calling code to issue a PANIC, so the whole
sequence looks like this:
	WARNING:  specified item offset is too large
	PANIC:  failed to add BRIN tuple

To fix, create a new function PageAddItemExtended which is like
PageAddItem except that the two boolean arguments become a flags bitmap;
the "overwrite" and "is_heap" boolean flags in PageAddItem become
PAI_OVERWITE and PAI_IS_HEAP flags in the new function, and a new flag
PAI_ALLOW_FAR_OFFSET enables the behavior required by BRIN.
PageAddItem() retains its original signature, for compatibility with
third-party modules (other callers in core code are not modified,
either).

Also, in the belt-and-suspenders spirit, I added a new sanity check in
brinGetTupleForHeapBlock to raise an error if an TID found in the revmap
is not marked as live by the page header.  This causes it to react with
"ERROR: corrupted BRIN index" to the bug at hand, rather than a hard
crash.

Backpatch to 9.5.

Bug reported by Andreas Seltenreich as detected by his handy sqlsmith
fuzzer.
Discussion: https://www.postgresql.org/message-id/87mvni77jh.fsf@elite.ansel.ydns.eu
2016-05-30 14:47:22 -04:00
Tom Lane 1e0d6512e5 Fix BTREE_BUILD_STATS build.
Commit 65c5fcd353 broke this by removing a
header include directive that is conditionally required.  Add that back
to nbtree.c, with annotation to keep pgrminclude from re-breaking it.

Peter Geoghegan

Report: <CAM3SWZTNjHFYW_UG8bu0BnogqQ2HfsTgkzXLueuUhfTcYbu5HA@mail.gmail.com>
2016-05-23 19:41:11 -04:00
Teodor Sigaev 7c979c95a3 Allocate all page images at once in generic wal interface
That reduces number of allocation.

Per gripe from Michael Paquier and Tom Lane suggestion.
2016-05-17 22:09:22 +03:00
Teodor Sigaev 7c8345f67f Correctly align page's images in generic wal API
Page image should be MAXALIGN'ed because existing code could directly align
pointers in page instead of align offset from beginning of page.

Found during play with indexes as extenstion, Alexander Korotkov and me
2016-05-17 00:01:35 +03:00
Alvaro Herrera cca2a27860 Fix bogus comments
Some comments mentioned XLogReplayBuffer, but there's no such function:
that was an interim name for a function that got renamed to
XLogReadBufferForRedo, before commit 2c03216d83 was pushed.
2016-05-12 16:07:07 -03:00
Alvaro Herrera bdb9e3dc1d Fix obsolete comment 2016-05-12 15:36:51 -03:00
Tom Lane 9eb7a0ac6b Fix poorly-worded log message.
Euler Taveira
2016-05-08 01:37:07 -04:00
Robert Haas c7ea68ff8d Limit maximum parallel degree to 1024.
This new limit affects both the max_parallel_degree GUC and the
parallel_degree reloption.  There may some day be a use case for using
more than 1024 CPUs for a single query, but that's surely not the case
right now.  Not only do not very many people have that many CPUs, but
the code hasn't been tested at that kind of scale and is very unlikely
to perform well, or even work at all, without a lot more work.  The
issue addressed by commit 06bd458cb8 is
probably just one problem of many.

The idea of a more reasonable limit here was suggested by Tom Lane;
the value of 1024 was suggested by Amit Kapila.
2016-05-06 14:50:54 -04:00
Robert Haas 06bd458cb8 Use mul_size when multiplying by the number of parallel workers.
That way, if the result overflows size_t, you'll get an error instead
of undefined behavior, which seems like a plus.  This also has the
effect of casting the number of workers from int to Size, which is
better because it's harder to overflow int than size_t.

Dilip Kumar reported this issue and provided a patch upon which this
patch is based, but his version did use mul_size.
2016-05-06 14:32:58 -04:00
Kevin Grittner 2cc41acd8f Fix hash index vs "snapshot too old" problemms
Hash indexes are not WAL-logged, and so do not maintain the LSN of
index pages.  Since the "snapshot too old" feature counts on
detecting error conditions using the LSN of a table and all indexes
on it, this makes it impossible to safely do early vacuuming on any
table with a hash index, so add this to the tests for whether the
xid used to vacuum a table can be adjusted based on
old_snapshot_threshold.

While at it, add a paragraph to the docs for old_snapshot_threshold
which specifically mentions this and other aspects of the feature
which may otherwise surprise users.

Problem reported and patch reviewed by Amit Kapila
2016-05-06 07:47:12 -05:00
Alvaro Herrera c1543a81a7 Revert timeline following in replication slots
This reverts commits f07d18b6e9, 82c83b3372, 3a3b309041, and
24c5f1a103.

This feature has shown enough immaturity that it was deemed better to
rip it out before rushing some more fixes at the last minute.  There are
discussions on larger changes in this area for the next release.
2016-05-04 17:32:22 -03:00
Teodor Sigaev f8467f7da8 Prevent to use magic constants
Use macroses for definition amstrategies/amsupport fields instead of
hardcoded values.

Author: Nikolay Shaplov with addition for contrib/bloom
2016-04-28 16:39:25 +03:00
Teodor Sigaev e2c79e14d9 Prevent multiple cleanup process for pending list in GIN.
Previously, ginInsertCleanup could exit early if it detects that someone else
is cleaning up the pending list, without waiting for that someone else to
finish the job. But in this case vacuum could miss tuples to be deleted.

Cleanup process now locks metapage with a help of heavyweight
LockPage(ExclusiveLock), and it guarantees that there is no another cleanup
process at the same time. Lock is taken differently depending on caller of
cleanup process: any vacuums and gin_clean_pending_list() will be blocked
until lock becomes available, ordinary insert uses conditional lock to
prevent indefinite waiting on lock.

Insert into pending list doesn't use this lock, so insertion isn't blocked.

Also, patch adds stopping of cleanup process when at-start-cleanup-tail is
reached in order to prevent infinite cleanup in case of massive insertion. But
it will stop only for automatic maintenance tasks like autovacuum.

Patch introduces choice of limit of memory to use: autovacuum_work_mem,
maintenance_work_mem or work_mem depending on call path.

Patch for previous releases should be reworked due to changes between 9.6 and
previous ones in this area.

Discover and diagnostics by Jeff Janes and Tomas Vondra

Patch by me with some ideas of Jeff Janes
2016-04-28 16:21:42 +03:00
Andres Freund c6ff84b06a Emit invalidations to standby for transactions without xid.
So far, when a transaction with pending invalidations, but without an
assigned xid, committed, we simply ignored those invalidation
messages. That's problematic, because those are actually sent for a
reason.

Known symptoms of this include that existing sessions on a hot-standby
replica sometimes fail to notice new concurrently built indexes and
visibility map updates.

The solution is to WAL log such invalidations in transactions without an
xid. We considered to alternatively force-assign an xid, but that'd be
problematic for vacuum, which might be run in systems with few xids.

Important: This adds a new WAL record, but as the patch has to be
back-patched, we can't bump the WAL page magic. This means that standbys
have to be updated before primaries; otherwise
"PANIC: standby_redo: unknown op code 32" errors can be encountered.

XXX:

Reported-By: Васильев Дмитрий, Masahiko Sawada
Discussion:
    CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com
    CAD21AoDpZ6Xjg=gFrGPnSn4oTRRcwK1EBrWCq9OqOHuAcMMC=w@mail.gmail.com
2016-04-26 20:21:54 -07:00
Tom Lane bde361fef5 Fix memory leak and other bugs in ginPlaceToPage() & subroutines.
Commit 36a35c550a turned the interface between ginPlaceToPage and
its subroutines in gindatapage.c and ginentrypage.c into a royal mess:
page-update critical sections were started in one place and finished in
another place not even in the same file, and the very same subroutine
might return having started a critical section or not.  Subsequent patches
band-aided over some of the problems with this design by making things
even messier.

One user-visible resulting problem is memory leaks caused by the need for
the subroutines to allocate storage that would survive until ginPlaceToPage
calls XLogInsert (as reported by Julien Rouhaud).  This would not typically
be noticeable during retail index updates.  It could be visible in a GIN
index build, in the form of memory consumption swelling to several times
the commanded maintenance_work_mem.

Another rather nasty problem is that in the internal-page-splitting code
path, we would clear the child page's GIN_INCOMPLETE_SPLIT flag well before
entering the critical section that it's supposed to be cleared in; a
failure in between would leave the index in a corrupt state.  There were
also assorted coding-rule violations with little immediate consequence but
possible long-term hazards, such as beginning an XLogInsert sequence before
entering a critical section, or calling elog(DEBUG) inside a critical
section.

To fix, redefine the API between ginPlaceToPage() and its subroutines
by splitting the subroutines into two parts.  The "beginPlaceToPage"
subroutine does what can be done outside a critical section, including
full computation of the result pages into temporary storage when we're
going to split the target page.  The "execPlaceToPage" subroutine is called
within a critical section established by ginPlaceToPage(), and it handles
the actual page update in the non-split code path.  The critical section,
as well as the XLOG insertion call sequence, are both now always started
and finished in ginPlaceToPage().  Also, make ginPlaceToPage() create and
work in a short-lived memory context to eliminate the leakage problem.
(Since a short-lived memory context had been getting created in the most
common code path in the subroutines, this shouldn't cause any noticeable
performance penalty; we're just moving the overhead up one call level.)

In passing, fix a bunch of comments that had gone unmaintained throughout
all this klugery.

Report: <571276DD.5050303@dalibo.com>
2016-04-20 14:25:15 -04:00
Kevin Grittner a343e223a5 Revert no-op changes to BufferGetPage()
The reverted changes were intended to force a choice of whether any
newly-added BufferGetPage() calls needed to be accompanied by a
test of the snapshot age, to support the "snapshot too old"
feature.  Such an accompanying test is needed in about 7% of the
cases, where the page is being used as part of a scan rather than
positioning for other purposes (such as DML or vacuuming).  The
additional effort required for back-patching, and the doubt whether
the intended benefit would really be there, have indicated it is
best just to rely on developers to do the right thing based on
comments and existing usage, as we do with many other conventions.

This change should have little or no effect on generated executable
code.

Motivated by the back-patching pain of Tom Lane and Robert Haas
2016-04-20 08:31:19 -05:00
Tom Lane f0e766bd7f Fix memory leak in GIN index scans.
The code had a query-lifespan memory leak when encountering GIN entries
that have posting lists (rather than posting trees, ie, there are a
relatively small number of heap tuples containing this index key value).
With a suitable data distribution this could add up to a lot of leakage.
Problem seems to have been introduced by commit 36a35c550, so back-patch
to 9.4.

Julien Rouhaud
2016-04-15 00:02:26 -04:00
Tom Lane 5713f03973 Improve API of GenericXLogRegister().
Rename this function to GenericXLogRegisterBuffer() to make it clearer
what it does, and leave room for other sorts of "register" actions in
future.  Also, replace its "bool isNew" argument with an integer flags
argument, so as to allow adding more flags in future without an API
break.

Alexander Korotkov, adjusted slightly by me
2016-04-12 11:42:06 -04:00
Tom Lane bdf7db8192 In generic WAL application and replay, ensure page "hole" is always zero.
The previous coding could allow the contents of the "hole" between pd_lower
and pd_upper to diverge during replay from what it had been when the update
was originally applied.  This would pose a problem if checksums were in
use, and in any case would complicate forensic comparisons between master
and slave servers.  So force the "hole" to contain zeroes, both at initial
application of a generically-logged action, and at replay.

Alexander Korotkov, adjusted slightly by me
2016-04-12 11:14:00 -04:00
Stephen Frost cd13471f2e Correct copyright for newly added genericdesc.c
It's 2016 these days (no, not entirely sure how we got here either).

Pointed out by Amit Langote
2016-04-12 08:45:09 -04:00
Tom Lane 660d5fb856 Further minor improvement in generic_xlog.c: always say REGBUF_STANDARD.
Since we're requiring pages handled by generic_xlog.c to be standard
format, specify REGBUF_STANDARD when doing a full-page image, so that
xloginsert.c can compress out the "hole" between pd_lower and pd_upper.
Given the current API in which this path will be taken only for a newly
initialized page, the hole is likely to be particularly large in such
cases, so that this oversight could easily be performance-significant.
I don't notice any particular change in the runtime of contrib/bloom's
regression test, though.
2016-04-10 00:24:28 -04:00
Tom Lane 68689c66ef Micro-optimize GenericXLogFinish().
Make the inner comparison loops of computeDelta() as tight as possible by
pulling considerations of valid and invalid ranges out of the inner loops,
and extending a match or non-match detection as far as possible before
deciding what to do next.  To keep this tractable, give up the possibility
of merging fragments across the pd_lower to pd_upper gap.  The fraction of
pages where that could happen (ie, there are 4 or fewer bytes in the gap,
*and* data changes immediately adjacent to it on both sides) is too small
to be worth spending cycles on.

Also, avoid two BLCKSZ-length memcpy()s by computing the delta before
moving data into the target buffer, instead of after.  This doesn't save
nearly as many cycles as being tenser about computeDelta(), but it still
seems worth doing.

On my machine, this patch cuts a full 40% off the runtime of
contrib/bloom's regression test.
2016-04-09 19:30:56 -04:00
Tom Lane 08e785436f Get rid of GenericXLogUnregister().
This routine is unsafe as implemented, because it invalidates the page
image pointers returned by previous GenericXLogRegister() calls.

Rather than complicate the API or the implementation to avoid that,
let's just get rid of it; the use-case for having it seems much
too thin to justify a lot of work here.

While at it, do some wordsmithing on the SGML docs for generic WAL.
2016-04-09 16:39:30 -04:00
Tom Lane db03cf375d Code review/prettification for generic_xlog.c.
Improve commentary, use more specific names for the delta fields,
const-ify pointer arguments where possible, avoid assuming that
initializing only the first element of a local array will guarantee
that the remaining elements end up as we need them.  (I think that
code in generic_redo actually worked, but only because InvalidBuffer
is zero; this is a particularly ugly way of depending on that ...)
2016-04-09 15:02:19 -04:00
Tom Lane 2dd318d277 Run pgindent on generic_xlog.c.
This code desperately needs some micro-optimization, and I'd like it
to be formatted a bit more nicely while I work on it.
2016-04-09 13:33:33 -04:00
Kevin Grittner 848ef42bb8 Add the "snapshot too old" feature
This feature is controlled by a new old_snapshot_threshold GUC.  A
value of -1 disables the feature, and that is the default.  The
value of 0 is just intended for testing.  Above that it is the
number of minutes a snapshot can reach before pruning and vacuum
are allowed to remove dead tuples which the snapshot would
otherwise protect.  The xmin associated with a transaction ID does
still protect dead tuples.  A connection which is using an "old"
snapshot does not get an error unless it accesses a page modified
recently enough that it might not be able to produce accurate
results.

This is similar to the Oracle feature, and we use the same SQLSTATE
and error message for compatibility.
2016-04-08 14:36:30 -05:00
Kevin Grittner 8b65cf4c5e Modify BufferGetPage() to prepare for "snapshot too old" feature
This patch is a no-op patch which is intended to reduce the chances
of failures of omission once the functional part of the "snapshot
too old" patch goes in.  It adds parameters for snapshot, relation,
and an enum to specify whether the snapshot age check needs to be
done for the page at this point.  This initial patch passes NULL
for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the
third.  The follow-on patch will change the places where the test
needs to be made.
2016-04-08 14:30:10 -05:00
Teodor Sigaev 8b99edefca Revert CREATE INDEX ... INCLUDING ...
It's not ready yet, revert two commits
690c543550 - unstable test output
386e3d7609 - patch itself
2016-04-08 21:52:13 +03:00
Teodor Sigaev 386e3d7609 CREATE INDEX ... INCLUDING (column[, ...])
Now indexes (but only B-tree for now) can contain "extra" column(s) which
doesn't participate in index structure, they are just stored in leaf
tuples. It allows to use index only scan by using single index instead
of two or more indexes.

Author: Anastasia Lubennikova with minor editorializing by me
Reviewers: David Rowley, Peter Geoghegan, Jeff Janes
2016-04-08 19:45:59 +03:00
Andres Freund 5364b357fb Increase maximum number of clog buffers.
Benchmarking has shown that the current number of clog buffers limits
scalability. We've previously increased the number in 33aaa139, but
that's not sufficient with a large number of clients.

We've benchmarked the cost of increasing the limit by benchmarking worst
case scenarios; testing showed that 128 buffers don't cause a
regression, even in contrived scenarios, whereas 256 does

There are a number of more complex patches flying around to address
various clog scalability problems, but this is simple enough that we can
get it into 9.6; and is beneficial even after those patches have been
applied.

It is a bit unsatisfactory to increase this in small steps every few
releases, but a better solution seems to require a rewrite of slru.c;
not something done quickly.

Author: Amit Kapila and Andres Freund
Discussion: CAA4eK1+-=18HOrdqtLXqOMwZDbC_15WTyHiFruz7BvVArZPaAw@mail.gmail.com
2016-04-08 08:25:59 -07:00
Robert Haas 25fe8b5f1a Add a 'parallel_degree' reloption.
The code that estimates what parallel degree should be uesd for the
scan of a relation is currently rather stupid, so add a parallel_degree
reloption that can be used to override the planner's rather limited
judgement.

Julien Rouhaud, reviewed by David Rowley, James Sewell, Amit Kapila,
and me.  Some further hacking by me.
2016-04-08 11:14:56 -04:00
Robert Haas 719c84c1be Extend relations multiple blocks at a time to improve scalability.
Contention on the relation extension lock can become quite fierce when
multiple processes are inserting data into the same relation at the same
time at a high rate.  Experimentation shows the extending the relation
multiple blocks at a time improves scalability.

Dilip Kumar, reviewed by Petr Jelinek, Amit Kapila, and me.
2016-04-08 02:04:46 -04:00
Kevin Grittner fcff8a5751 Detect SSI conflicts before reporting constraint violations
While prior to this patch the user-visible effect on the database
of any set of successfully committed serializable transactions was
always consistent with some one-at-a-time order of execution of
those transactions, the presence of declarative constraints could
allow errors to occur which were not possible in any such ordering,
and developers had no good workarounds to prevent user-facing
errors where they were not necessary or desired.  This patch adds
a check for serialization failure ahead of duplicate key checking
so that if a developer explicitly (redundantly) checks for the
pre-existing value they will get the desired serialization failure
where the problem is caused by a concurrent serializable
transaction; otherwise they will get a duplicate key error.

While it would be better if the reads performed by the constraints
could count as part of the work of the transaction for
serialization failure checking, and we will hopefully get there
some day, this patch allows a clean and reliable way for developers
to work around the issue.  In many cases existing code will already
be doing the right thing for this to "just work".

Author: Thomas Munro, with minor editing of docs by me
Reviewed-by: Marko Tiikkaja, Kevin Grittner
2016-04-07 11:12:35 -05:00
Stephen Frost 1574783b4c Use GRANT system to manage access to sensitive functions
Now that pg_dump will properly dump out any ACL changes made to
functions which exist in pg_catalog, switch to using the GRANT system
to manage access to those functions.

This means removing 'if (!superuser()) ereport()' checks from the
functions themselves and then REVOKEing EXECUTE right from 'public' for
these functions in system_views.sql.

Reviews by Alexander Korotkov, Jose Luis Tallon
2016-04-06 21:45:32 -04:00
Simon Riggs 3fe3511d05 Generic Messages for Logical Decoding
API and mechanism to allow generic messages to be inserted into WAL that are
intended to be read by logical decoding plugins. This commit adds an optional
new callback to the logical decoding API.

Messages are either text or bytea. Messages can be transactional, or not, and
are identified by a prefix to allow multiple concurrent decoding plugins.

(Not to be confused with Generic WAL records, which are intended to allow crash
recovery of extensible objects.)

Author: Petr Jelinek and Andres Freund
Reviewers: Artur Zakirov, Tomas Vondra, Simon Riggs
Discussion: 5685F999.6010202@2ndquadrant.com
2016-04-06 10:05:41 +01:00
Magnus Hagander 7117685461 Implement backup API functions for non-exclusive backups
Previously non-exclusive backups had to be done using the replication protocol
and pg_basebackup. With this commit it's now possible to make them using
pg_start_backup/pg_stop_backup as well, as long as the backup program can
maintain a persistent connection to the database.

Doing this, backup_label and tablespace_map are returned as results from
pg_stop_backup() instead of being written to the data directory. This makes
the server safe from a crash during an ongoing backup, which can be a problem
with exclusive backups.

The old syntax of the functions remain and work exactly as before, but since the
new syntax is safer this should eventually be deprecated and removed.

Only reference documentation is included. The main section on backup still needs
to be rewritten to cover this, but since that is already scheduled for a separate
large rewrite, it's not included in this patch.

Reviewed by David Steele and Amit Kapila
2016-04-05 20:03:49 +02:00
Alvaro Herrera 890614d2b3 Display WAL pointer in rm_redo error callback
This makes it easier to identify the source of a recovery problem
in case of a bug or data corruption.
2016-04-04 18:12:12 -03:00
Alvaro Herrera c9ff752a85 Silence compiler warning
Reported by Peter Eisentraut to occur on 32bit systems
2016-04-04 17:07:23 -03:00
Simon Riggs 3e4b7d8798 Avoid pin scan for replay of XLOG_BTREE_VACUUM in all cases
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require
complex interlocking that matched the requirements on the master. This required
an O(N) operation that became a significant problem with large indexes, causing
replication delays of seconds or in some cases minutes while the
XLOG_BTREE_VACUUM was replayed.

This commit skips the pin scan that was previously required, by observing in
detail when and how it is safe to do so, with full documentation. The pin
scan is skipped only in replay; the VACUUM code path on master is not
touched here and WAL is identical.

The current commit applies in all cases, effectively replacing commit
687f2cd7a0.
2016-04-03 17:46:09 +01:00
Teodor Sigaev 65578341af Add Generic WAL interface
This interface is designed to give an access to WAL for extensions which
could implement new access method, for example. Previously it was
impossible because restoring from custom WAL would need to access system
catalog to find a redo custom function. This patch suggests generic way
to describe changes on page with standart layout.

Bump XLOG_PAGE_MAGIC because of new record type.

Author: Alexander Korotkov with a help of Petr Jelinek, Markus Nullmeier and
	minor editorization by my
Reviewers: Petr Jelinek, Alvaro Herrera, Teodor Sigaev, Jim Nasby,
	Michael Paquier
2016-04-01 12:21:48 +03:00
Alvaro Herrera 24c5f1a103 Enable logical slots to follow timeline switches
When decoding from a logical slot, it's necessary for xlog reading to be
able to read xlog from historical (i.e. not current) timelines;
otherwise, decoding fails after failover, because the archives are in
the historical timeline.  This is required to make "failover logical
slots" possible; it currently has no other use, although theoretically
it could be used by an extension that creates a slot on a standby and
continues to replay from the slot when the standby is promoted.

This commit includes a module in src/test/modules with functions to
manipulate the slots (which is not otherwise possible in SQL code) in
order to enable testing, and a new test in src/test/recovery to ensure
that the behavior is as expected.

Author: Craig Ringer
Reviewed-By: Oleksii Kliukin, Andres Freund, Petr Jelínek
2016-03-30 20:07:05 -03:00
Alvaro Herrera 3b02ea4f07 XLogReader general code cleanup
Some minor tweaks and comment additions, for cleanliness sake and to
avoid having the upcoming timeline-following patch be polluted with
unrelated cleanup.

Extracted from a larger patch by Craig Ringer, reviewed by Andres
Freund, with some additions by myself.
2016-03-30 18:56:13 -03:00
Teodor Sigaev ccd6eb49a4 Introduce traversalValue for SP-GiST scan
During scan sometimes it would be very helpful to know some information about
parent node or all 	ancestor nodes. Right now reconstructedValue could be used
but it's not a right usage of it (range opclass uses that).

traversalValue is arbitrary piece of memory in separate MemoryContext while
reconstructedVale should have the same type as indexed column.

Subsequent patches for range opclass and quad4d tree will use it.

Author: Alexander Lebedev, Teodor Sigaev
2016-03-30 18:29:28 +03:00
Robert Haas 314cbfc5da Add new replication mode synchronous_commit = 'remote_apply'.
In this mode, the master waits for the transaction to be applied on
the remote side, not just written to disk.  That means that you can
count on a transaction started on the standby to see all commits
previously acknowledged by the master.

To make this work, the standby sends a reply after replaying each
commit record generated with synchronous_commit >= 'remote_apply'.
This introduces a small inefficiency: the extra replies will be sent
even by standbys that aren't the current synchronous standby.  But
previously-existing synchronous_commit levels make no attempt at all
to optimize which replies are sent based on what the primary cares
about, so this is no worse, and at least avoids any extra replies for
people not using the feature at all.

Thomas Munro, reviewed by Michael Paquier and by me.  Some additional
tweaks by me.
2016-03-29 21:29:49 -04:00
Andres Freund 1a7a43672b Don't use !! but != 0/NULL to force boolean evaluation.
I introduced several uses of !! to force bit arithmetic to be boolean,
but per discussion the project prefers != 0/NULL.

Discussion: CA+TgmoZP5KakLGP6B4vUjgMBUW0woq_dJYi0paOz-My0Hwt_vQ@mail.gmail.com
2016-03-27 18:10:19 +02:00
Alvaro Herrera 473b932870 Support CREATE ACCESS METHOD
This enables external code to create access methods.  This is useful so
that extensions can add their own access methods which can be formally
tracked for dependencies, so that DROP operates correctly.  Also, having
explicit support makes pg_dump work correctly.

Currently only index AMs are supported, but we expect different types to
be added in the future.

Authors: Alexander Korotkov, Petr Jelínek
Reviewed-By: Teodor Sigaev, Petr Jelínek, Jim Nasby
Commitfest-URL: https://commitfest.postgresql.org/9/353/
Discussion: https://www.postgresql.org/message-id/CAPpHfdsXwZmojm6Dx+TJnpYk27kT4o7Ri6X_4OSWcByu1Rm+VA@mail.gmail.com
2016-03-23 23:01:35 -03:00
Peter Eisentraut b555ed8102 Merge wal_level "archive" and "hot_standby" into new name "replica"
The distinction between "archive" and "hot_standby" existed only because
at the time "hot_standby" was added, there was some uncertainty about
stability.  This is now a long time ago.  We would like to move forward
with simplifying the replication configuration, but this distinction is
in the way, because a primary server cannot tell (without asking a
standby or predicting the future) which one of these would be the
appropriate level.

Pick a new name for the combined setting to make it clearer that it
covers all (non-logical) backup and replication uses.  The old values
are still accepted but are converted internally.

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reviewed-by: David Steele <david@pgmasters.net>
2016-03-18 23:56:03 +01:00
Robert Haas 3aff33aa68 Fix typos.
Oskari Saarenmaa
2016-03-15 18:06:11 -04:00
Alvaro Herrera 5bcc413f80 Fix typos in comments 2016-03-15 17:57:17 -03:00
Tom Lane ab4ff2889d Fix memory leak in repeated GIN index searches.
Commit d88976cfa1 removed this code from ginFreeScanKeys():
-		if (entry->list)
-			pfree(entry->list);
evidently in the belief that that ItemPointer array is allocated in the
keyCtx and so would be reclaimed by the following MemoryContextReset.
Unfortunately, it isn't and it won't.  It'd likely be a good idea for
that to become so, but as a simple and back-patchable fix in the
meantime, restore this code to ginFreeScanKeys().

Also, add a similar pfree to where startScanEntry() is about to zero out
entry->list.  I am not sure if there are any code paths where this
change prevents a leak today, but it seems like cheap future-proofing.

In passing, make the initial allocation of so->entries[] use palloc
not palloc0.  The code doesn't depend on unused entries being zero;
if it did, the array-enlargement code in ginFillScanEntry() would be
wrong.  So using palloc0 initially can only serve to confuse readers
about what the invariant is.

Per report from Felipe de Jesús Molina Bravo, via Jaime Casanova in
<CAJGNTeMR1ndMU2Thpr8GPDUfiHTV7idELJRFusA5UXUGY1y-eA@mail.gmail.com>
2016-03-13 16:44:31 -04:00
Robert Haas 481c76abf4 Fix a typo, and remove unnecessary pgstat_report_wait_end().
Per Amit Kapila.
2016-03-11 07:34:00 -05:00
Robert Haas 53be0b1add Provide much better wait information in pg_stat_activity.
When a process is waiting for a heavyweight lock, we will now indicate
the type of heavyweight lock for which it is waiting.  Also, you can
now see when a process is waiting for a lightweight lock - in which
case we will indicate the individual lock name or the tranche, as
appropriate - or for a buffer pin.

Amit Kapila, Ildus Kurbangaliev, reviewed by me.  Lots of helpful
discussion and suggestions by many others, including Alexander
Korotkov, Vladimir Borodin, and many others.
2016-03-10 12:44:09 -05:00
Simon Riggs e0694cf9c7 Reduce size of two phase file header
Previously 2PC header was fixed at 200 bytes, which in most cases wasted
WAL space for a workload using 2PC heavily.

Pavan Deolasee, reviewed by Petr Jelinek
2016-03-10 12:51:46 +00:00
Simon Riggs fcb4bfddb6 Reduce lock level for altering fillfactor
Fabrízio de Royes Mello and Simon Riggs
2016-03-10 12:07:33 +00:00
Andres Freund 1d4a0ab19a Avoid unlikely data-loss scenarios due to rename() without fsync.
Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes. Use the previously added durable_rename()/durable_link_or_rename()
in various places where we previously just renamed files.

Most of the changed call sites are arguably not critical, but it seems
better to err on the side of too much durability.  The most prominent
known case where the previously missing fsyncs could cause data loss is
crashes at the end of a checkpoint. After the actual checkpoint has been
performed, old WAL files are recycled. When they're filled, their
contents are fdatasynced, but we did not fsync the containing
directory. An OS/hardware crash in an unfortunate moment could then end
up leaving that file with its old name, but new content; WAL replay
would thus not replay it.

Reported-By: Tomas Vondra
Author: Michael Paquier, Tomas Vondra, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
2016-03-09 18:53:53 -08:00
Tom Lane a298a1e06f Fix incorrect handling of NULL index entries in indexed ROW() comparisons.
An index search using a row comparison such as ROW(a, b) > ROW('x', 'y')
would stop upon reaching a NULL entry in the "b" column, ignoring the
fact that there might be non-NULL "b" values associated with later values
of "a".  This happens because _bt_mark_scankey_required() marks the
subsidiary scankey for "b" as required, which is just wrong: it's for
a column after the one with the first inequality key (namely "a"), and
thus can't be considered a required match.

This bit of brain fade dates back to the very beginnings of our support
for indexed ROW() comparisons, in 2006.  Kind of astonishing that no one
came across it before Glen Takahashi, in bug #14010.

Back-patch to all supported versions.

Note: the given test case doesn't actually fail in unpatched 9.1, evidently
because the fix for bug #6278 (i.e., stopping at nulls in either scan
direction) is required to make it fail.  I'm sure I could devise a case
that fails in 9.1 as well, perhaps with something involving making a cursor
back up; but it doesn't seem worth the trouble.
2016-03-09 14:51:22 -05:00
Robert Haas b6fb6471f6 Add a generic command progress reporting facility.
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.

As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else.  A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting.  But this gets the basic framework in place.

Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
2016-03-09 12:08:58 -05:00
Robert Haas 734f86d50d Add new flags argument for xl_heap_visible to heap2_desc.
Masahiko Sawada
2016-03-08 13:28:22 -05:00
Robert Haas 77a1d1e798 Department of second thoughts: remove PD_ALL_FROZEN.
Commit a892234f83 added a second bit per
page to the visibility map, which still seems like a good idea, but it
also added a second page-level bit alongside PD_ALL_VISIBLE to track
whether the visibility map bit was set.  That no longer seems like a
clever plan, because we don't really need that bit for anything.  We
always clear both bits when the page is modified anyway.

Patch by me, reviewed by Kyotaro Horiguchi and Masahiko Sawada.
2016-03-08 08:46:48 -05:00
Fujii Masao d34794f7d5 Ignore recovery_min_apply_delay until recovery has reached consistent state
Previously recovery_min_apply_delay was applied even before recovery
had reached consistency. This could cause us to wait a long time
unexpectedly for read-only connections to be allowed. It's problematic
because the standby was useless during that wait time.

This patch changes recovery_min_apply_delay so that it's applied once
the database has reached the consistent state. That is, even if the delay
is set, the standby tries to replay WAL records as fast as possible until
it has reached consistency.

Author: Michael Paquier
Reviewed-By: Julien Rouhaud
Reported-By: Greg Clough
Backpatch: 9.4, where recovery_min_apply_delay was added
Bug: #13770
Discussion: http://www.postgresql.org/message-id/20151111155006.2644.84564@wrigleys.postgresql.org
2016-03-06 02:29:04 +09:00
Robert Haas 6fcde8a5c8 Minor improvements to transaction manager README.
A simple SELECT is handled by PortalRunSelect, not ProcessQuery.  Also,
the previous indentation was unclear: change it so that a deeper level
of indentation indicates that the outer function calls the inner one.

Stas Kelvich
2016-03-04 14:12:28 -05:00
Robert Haas df4685fb0c Minor optimizations based on ParallelContext having nworkers_launched.
Originally, we didn't have nworkers_launched, so code that used parallel
contexts had to be preprared for the possibility that not all of the
workers requested actually got launched.  But now we can count on knowing
the number of workers that were successfully launched, which can shave
off a few cycles and simplify some code slightly.

Amit Kapila, reviewed by Haribabu Kommi, per a suggestion from Peter
Geoghegan.
2016-03-04 12:59:10 -05:00
Simon Riggs c7111d11b1 Revert buggy optimization of index scans
606c0123d6 attempted to reduce cost of index scans using > and <
strategies, though got that completely wrong in a few complex cases.

Revert whole patch until we find a safe optimization.
2016-03-03 09:53:43 +00:00
Robert Haas a892234f83 Change the format of the VM fork to add a second bit per page.
The new bit indicates whether every tuple on the page is already frozen.
It is cleared only when the all-visible bit is cleared, and it can be
set only when we vacuum a page and find that every tuple on that page is
both visible to every transaction and in no need of any future
vacuuming.

A future commit will use this new bit to optimize away full-table scans
that would otherwise be triggered by XID wraparound considerations.  A
page which is merely all-visible must still be scanned in that case, but
a page which is all-frozen need not be.  This commit does not attempt
that optimization, although that optimization is the goal here.  It
seems better to get the basic infrastructure in place first.

Per discussion, it's very desirable for pg_upgrade to automatically
migrate existing VM forks from the old format to the new format.  That,
too, will be handled in a follow-on patch.

Masahiko Sawada, reviewed by Kyotaro Horiguchi, Fujii Masao, Amit
Kapila, Simon Riggs, Andres Freund, and others, and substantially
revised by me.
2016-03-01 21:49:41 -05:00
Simon Riggs 481725c0ba Correct StartupSUBTRANS for page wraparound
StartupSUBTRANS() incorrectly handled cases near the max pageid in the subtrans
data structure, which in some cases could lead to errors in startup for Hot
Standby.
This patch wraps the pageids correctly, avoiding any such errors.
Identified by exhaustive crash testing by Jeff Janes.

Jeff Janes
2016-02-19 08:31:12 +00:00
Andres Freund 7975c5e0a9 Allow the WAL writer to flush WAL at a reduced rate.
Commit 4de82f7d7 increased the WAL flush rate, mainly to increase the
likelihood that hint bits can be set quickly. More quickly set hint bits
can reduce contention around the clog et al.  But unfortunately the
increased flush rate can have a significant negative performance impact,
I have measured up to a factor of ~4.  The reason for this slowdown is
that if there are independent writes to the underlying devices, for
example because shared buffers is a lot smaller than the hot data set,
or because a checkpoint is ongoing, the fdatasync() calls force cache
flushes to be emitted to the storage.

This is achieved by flushing WAL only if the last flush was longer than
wal_writer_delay ago, or if more than wal_writer_flush_after (new GUC)
unflushed blocks are pending. Based on some tests the default for
wal_writer_delay is 1MB, which seems to work well both on SSD and
rotational media.

To avoid negative performance impact due to 4de82f7d7 an earlier
commit (db76b1e) made SetHintBits() more likely to succeed; preventing
performance regressions in the pgbench tests I performed.

Discussion: 20160118163908.GW10941@awork2.anarazel.de
2016-02-16 00:56:34 +01:00
Joe Conway 59a884e985 Change delimiter used for display of NextXID
NextXID has been rendered in the form of a pg_lsn even though it
really is not. This can cause confusion, so change the format from
%u/%u to %u:%u, per discussion on hackers.

Complaint by me, patch by me and Bruce, reviewed by Michael Paquier
and Alvaro. Applied to HEAD only.

Author: Joe Conway, Bruce Momjian
Reviewed-by: Michael Paquier, Alvaro Herrera
Backpatch-through: master
2016-02-12 14:23:59 -08:00
Robert Haas 63461a63f9 Make builtin lwlock tranche names consistent.
Previously, we had a mix of styles.

Amit Kapila
2016-02-12 08:07:11 -05:00
Tom Lane d18643c4a6 Shift the responsibility for emitting "database system is shut down".
Historically this message has been emitted at the end of ShutdownXLOG().
That's not an insane place for it in a standalone backend, but in the
postmaster environment we've grown a fair amount of stuff that happens
later, including archiver/walsender shutdown, stats collector shutdown,
etc.  Recent buildfarm experimentation showed that on slower machines
there could be many seconds' delay between finishing ShutdownXLOG() and
actual postmaster exit.  That's fairly confusing, both for testing
purposes and for DBAs.  Hence, move the code that prints this message
into UnlinkLockFiles(), so that it comes out just after we remove the
postmaster's pidfile.  That is a more appropriate definition of "is shut
down" from the point of view of "pg_ctl stop", for example.  In general,
removing the pidfile should be the last externally-visible action of
either a postmaster or a standalone backend; compare commit
d73d14c271 for instance.  So this seems
like a reasonably future-proof approach.
2016-02-11 14:14:22 -05:00
Tom Lane c5e9b77127 Revert "Temporarily make pg_ctl and server shutdown a whole lot chattier."
This reverts commit 3971f64843 and a
couple of followon debugging commits; I think we've learned what we can
from them.
2016-02-10 16:01:04 -05:00
Tom Lane 7351e18286 Add more chattiness in server shutdown.
Early returns from the buildfarm show that there's a bit of a gap in the
logging I added in 3971f64843b02e4a: the portion of CreateCheckPoint()
after CheckPointGuts() can take a fair amount of time.  Add a few more
log messages in that section of code.  This too shall be reverted later.
2016-02-09 11:21:46 -05:00
Tom Lane 3971f64843 Temporarily make pg_ctl and server shutdown a whole lot chattier.
This is a quick hack, due to be reverted when its purpose has been served,
to try to gather information about why some of the buildfarm critters
regularly fail with "postmaster does not shut down" complaints.  Maybe they
are just really overloaded, but maybe something else is going on.  Hence,
instrument pg_ctl to print the current time when it starts waiting for
postmaster shutdown and when it gives up, and add a lot of logging of the
current time in the server's checkpoint and shutdown code paths.

No attempt has been made to make this pretty.  I'm not even totally sure
if it will build on Windows, but we'll soon find out.
2016-02-08 18:43:11 -05:00
Robert Haas 7c944bd903 Introduce a new GUC force_parallel_mode for testing purposes.
When force_parallel_mode = true, we enable the parallel mode restrictions
for all queries for which this is believed to be safe.  For the subset of
those queries believed to be safe to run entirely within a worker, we spin
up a worker and run the query there instead of running it in the
original process.  When force_parallel_mode = regress, make additional
changes to allow the regression tests to run cleanly even though parallel
workers have been injected under the hood.

Taken together, this facilitates both better user testing and better
regression testing of the parallelism code.

Robert Haas, with help from Amit Kapila and Rushabh Lathia.
2016-02-07 11:41:33 -05:00
Robert Haas a1c1af2a1f Introduce group locking to prevent parallel processes from deadlocking.
For locking purposes, we now regard heavyweight locks as mutually
non-conflicting between cooperating parallel processes.  There are some
possible pitfalls to this approach that are not to be taken lightly,
but it works OK for now and can be changed later if we find a better
approach.  Without this, it's very easy for parallel queries to
silently self-deadlock if the user backend holds strong relation locks.

Robert Haas, with help from Amit Kapila.  Thanks to Noah Misch and
Andres Freund for extensive discussion of possible issues with this
approach.
2016-02-07 10:16:13 -05:00
Teodor Sigaev f25d07d99f Fix lossy KNN GiST when ordering operator returns non-float8 value.
KNN GiST with recheck flag should return to executor the same type as ordering
operator, GiST detects this type by looking to return type of function which
implements ordering operator. But occasionally detecting code works after
replacing ordering operator function to distance support function.
Distance support function always returns float8, so, detecting code get float8
instead of actual return type of ordering operator.

Built-in opclasses don't have ordering operator which doesn't return
non-float8 value, so, tests are impossible here, at least now.

Backpatch to 9.5 where lozzy KNN was introduced.

Author: Alexander Korotkov
Report by: Artur Zakirov
2016-02-02 15:20:33 +03:00
Robert Haas 7191ce8bea Make all built-in lwlock tranche IDs fixed.
This makes the values more stable, which seems like a good thing for
anybody who needs to look at at them.

Alexander Korotkov and Amit Kapila
2016-02-02 06:45:55 -05:00
Heikki Linnakangas 61ce1e8f15 Fix misspelled function name in comment. 2016-02-01 10:10:24 +02:00
Peter Eisentraut 9217bf3961 Fix whitespace 2016-01-30 15:58:20 -05:00
Fujii Masao 7f46eaf035 Add gin_clean_pending_list function to clean up GIN pending list
This function cleans up the pending list of the GIN index by
moving entries in it to the main GIN data structure in bulk.
It returns the number of pages cleaned up from the pending list.

This function is useful, for example, when the pending list
needs to be cleaned up *quickly* to improve the performance of
the search using GIN index. VACUUM can do the same thing, too,
but it may take days to run on a large table.

Jeff Janes,
reviewed by Julien Rouhaud, Jaime Casanova, Alvaro Herrera and me.

Discussion: CAMkU=1x8zFkpfnozXyt40zmR3Ub_kHu58LtRmwHUKRgQss7=iQ@mail.gmail.com
2016-01-28 12:57:52 +09:00
Tom Lane cc988fbb0b Improve ResourceOwners' behavior for large numbers of owned objects.
The original coding was quite fast so long as objects were always
released in reverse order of addition; otherwise, it degenerated into
O(N^2) behavior due to searching for the array element to delete.
Improve matters by switching to hashed storage when the number of
objects of a given type exceeds 64.  (The cutover point is open to
discussion, of course, but some simple performance testing suggests
that hashing has enough overhead to be a loser below there.)

Also, refactor resowner.c so that we don't need N copies of the array
management code.  Since all the resource IDs the code currently needs
to deal with are either pointers or integers, it seems sufficient to
create a one-size-fits-all infrastructure in which everything is
converted to a Datum for storage.

Aleksander Alekseev, reviewed by Stas Kelvich, further fixes by me
2016-01-26 15:20:30 -05:00
Tom Lane d9b9289c83 Suppress compiler warning.
Given the limited range of i, these shifts should not cause any
problem, but that apparently doesn't stop some compilers from
whining about them.

David Rowley
2016-01-21 21:14:07 -05:00
Tom Lane be44ed27b8 Improve index AMs' opclass validation procedures.
The amvalidate functions added in commit 65c5fcd353 were on the
crude side.  Improve them in a few ways:

* Perform signature checking for operators and support functions.

* Apply more thorough checks for missing operators and functions,
where possible.

* Instead of reporting problems as ERRORs, report most problems as INFO
messages and make the amvalidate function return FALSE.  This allows
more than one problem to be discovered per run.

* Report object names rather than OIDs, and work a bit harder on making
the messages understandable.

Also, remove a few more opr_sanity regression test queries that are
now superseded by the amvalidate checks.
2016-01-21 19:47:15 -05:00
Fujii Masao 38710a374e Remove unused argument from ginInsertCleanup()
It's an oversight in commit dc943ad.
2016-01-22 01:22:56 +09:00
Simon Riggs c80b31d557 Refactor headers to split out standby defs
Jeff Janes
2016-01-20 18:51:34 -08:00
Simon Riggs 978b2f65aa Speedup 2PC by skipping two phase state files in normal path
2PC state info is written only to WAL at PREPARE, then read back from WAL at
COMMIT PREPARED/ABORT PREPARED. Prepared transactions that live past one bufmgr
checkpoint cycle will be written to disk in the same form as previously. Crash
recovery path is not altered. Measured performance gains of 50-100% for short
2PC transactions by completely avoiding writing files and fsyncing. Other
optimizations still available, further patches in related areas expected.

Stas Kelvich and heavily edited by Simon Riggs

Based upon earlier ideas and patches by Michael Paquier and Heikki Linnakangas,
a concrete example of how Postgres-XC has fed back ideas into PostgreSQL.

Reviewed by Michael Paquier, Jeff Janes and Andres Freund
Performance testing by Jesper Pedersen
2016-01-20 18:40:44 -08:00
Simon Riggs 422a55a687 Refactor to create generic WAL page read callback
Previously we didn’t have a generic WAL page read callback function,
surprisingly. Logical decoding has logical_read_local_xlog_page(), which was
actually generic, so move that to xlogfunc.c and rename to
read_local_xlog_page().
Maintain logical_read_local_xlog_page() so existing callers still work.

As requested by Michael Paquier, Alvaro Herrera and Andres Freund
2016-01-20 17:18:58 -08:00
Tom Lane 9ff60273e3 Fix assorted inconsistencies in GiST opclass support function declarations.
The conventions specified by the GiST SGML documentation were widely
ignored.  For example, the strategy-number argument for "consistent" and
"distance" functions is specified to be a smallint, but most of the
built-in support functions declared it as an integer, and for that matter
the core code passed it using Int32GetDatum not Int16GetDatum.  None of
that makes any real difference at runtime, but it's quite confusing for
newcomers to the code, and it makes it very hard to write an amvalidate()
function that checks support function signatures.  So let's try to instill
some consistency here.

Another similar issue is that the "query" argument is not of a single
well-defined type, but could have different types depending on the strategy
(corresponding to search operators with different righthand-side argument
types).  Some of the functions threw up their hands and declared the query
argument as being of "internal" type, which surely isn't right ("any" would
have been more appropriate); but the majority position seemed to be to
declare it as being of the indexed data type, corresponding to a search
operator with both input types the same.  So I've specified a convention
that that's what to do always.

Also, the result of the "union" support function actually must be of the
index's storage type, but the documentation suggested declaring it to
return "internal", and some of the functions followed that.  Standardize
on telling the truth, instead.

Similarly, standardize on declaring the "same" function's inputs as
being of the storage type, not "internal".

Also, somebody had forgotten to add the "recheck" argument to both
the documentation of the "distance" support function and all of their
SQL declarations, even though the C code was happily using that argument.
Clean that up too.

Fix up some other omissions in the docs too, such as documenting that
union's second input argument is vestigial.

So far as the errors in core function declarations go, we can just fix
pg_proc.h and bump catversion.  Adjusting the erroneous declarations in
contrib modules is more debatable: in principle any change in those
scripts should involve an extension version bump, which is a pain.
However, since these changes are purely cosmetic and make no functional
difference, I think we can get away without doing that.
2016-01-19 12:04:36 -05:00
Tom Lane 65c5fcd353 Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function.  All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function.  This is similar to
the designs we've adopted for FDWs and tablesample methods.  There
are multiple advantages.  For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.

A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL.  We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.

Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-17 19:36:59 -05:00
Simon Riggs e63bb4549a Add new user fn pg_current_xlog_flush_location()
Tomas Vondra, reviewed by Michael Paquier and Amit Kapila
Minor edits by me
2016-01-12 07:54:52 +00:00
Simon Riggs 1e29e6324c Maintain local LogwrtResult consistently
Teach GetFlushRecPtr() to update LogwrtResult cache as performed by all other
functions in xlog.c
2016-01-12 07:33:20 +00:00
Simon Riggs b602842613 Revoke change to rmgr desc of btree vacuum
Per discussion with Andres Freund
2016-01-09 18:31:08 +00:00
Simon Riggs 687f2cd7a0 Avoid pin scan for replay of XLOG_BTREE_VACUUM
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require
complex interlocking that matched the requirements on the master. This required
an O(N) operation that became a significant problem with large indexes, causing
replication delays of seconds or in some cases minutes while the
XLOG_BTREE_VACUUM was replayed.

This commit skips the “pin scan” that was previously required, by observing in
detail when and how it is safe to do so, with full documentation. The pin scan
is skipped only in replay; the VACUUM code path on master is not touched here.

The current commit still performs the pin scan for toast indexes, though this
can also be avoided if we recheck scans on toast indexes. Later patch will
address this.

No tests included. Manual tests using an additional patch to view WAL records
and their timing have shown the change in WAL records and their handling has
successfully reduced replication delay.
2016-01-09 10:10:08 +00:00
Tom Lane 7157fe80f4 Fix overly-strict assertions in spgtextproc.c.
spg_text_inner_consistent is capable of reconstructing an empty string
to pass down to the next index level; this happens if we have an empty
string coming in, no prefix, and a dummy node label.  (In practice, what
is needed to trigger that is insertion of a whole bunch of empty-string
values.)  Then, we will arrive at the next level with in->level == 0
and a non-NULL (but zero length) in->reconstructedValue, which is valid
but the Assert tests weren't expecting it.

Per report from Andreas Seltenreich.  This has no impact in non-Assert
builds, so should not be a problem in production, but back-patch to
all affected branches anyway.

In passing, remove a couple of useless variable initializations and
shorten the code by not duplicating DatumGetPointer() calls.
2016-01-02 16:24:50 -05:00
Bruce Momjian ee94300446 Update copyright for 2016
Backpatch certain files through 9.1
2016-01-02 13:33:40 -05:00
Noah Misch 3cd1ba147e Fix comments about WAL rule "write xlog before data" versus pg_multixact.
Recovery does not achieve its goal of zeroing all pg_multixact entries
whose accompanying WAL records never reached disk.  Remove that claim
and justify its expendability.  Detail the need for TrimMultiXact(),
which has little in common with the TrimCLOG() rationale.  Merge two
tightly-related comments.  Stop presenting pg_multixact as specific to
heap_lock_tuple(); PostgreSQL 9.3 extended its use to heap_update().

Noticed while investigating a report from Andres Freund.
2016-01-01 01:46:46 -05:00
Joe Conway 241448b23a Rename (new|old)estCommitTs to (new|old)estCommitTsXid
The variables newestCommitTs and oldestCommitTs sound as if they are
timestamps, but in fact they are the transaction Ids that correspond
to the newest and oldest timestamps rather than the actual timestamps.
Rename these variables to reflect that they are actually xids: to wit
newestCommitTsXid and oldestCommitTsXid respectively. Also modify
related code in a similar fashion, particularly the user facing output
emitted by pg_controldata and pg_resetxlog.

Complaint and patch by me, review by Tom Lane and Alvaro Herrera.
Backpatch to 9.5 where these variables were first introduced.
2015-12-28 12:34:11 -08:00
Tom Lane 3d2b31e30e Fix brin_summarize_new_values() to check index type and ownership.
brin_summarize_new_values() did not check that the passed OID was for
an index at all, much less that it was a BRIN index, and would fail in
obscure ways if it wasn't (possibly damaging data first?).  It also
lacked any permissions test; by analogy to VACUUM, we should only allow
the table's owner to summarize.

Noted by Jeff Janes, fix by Michael Paquier and me
2015-12-26 12:56:09 -05:00
Robert Haas 9a51698bae Fix typo in comment.
Amit Langote
2015-12-18 12:03:15 -05:00
Robert Haas 3fed417452 Provide a way to predefine LWLock tranche IDs.
It's a bit cumbersome to use LWLockNewTrancheId(), because the returned
value needs to be shared between backends so that each backend can call
LWLockRegisterTranche() with the correct ID.  So, for built-in tranches,
use a hard-coded value instead.

This is motivated by an upcoming patch adding further built-in tranches.

Andres Freund and Robert Haas
2015-12-15 11:48:19 -05:00
Andres Freund cca705a5d9 Fix bug in SetOffsetVacuumLimit() triggered by find_multixact_start() failure.
Previously, if find_multixact_start() failed, SetOffsetVacuumLimit() would
install 0 into MultiXactState->offsetStopLimit if it previously succeeded.
Luckily, there are no known cases where find_multixact_start() will return
an error in 9.5 and above. But if it were to happen, for example due to
filesystem permission issues, it'd be somewhat bad: GetNewMultiXactId()
could continue allocating mxids even if close to a wraparound, or it could
erroneously stop allocating mxids, even if no wraparound is looming.  The
wrong value would be corrected the next time SetOffsetVacuumLimit() is
called, or by a restart.

Reported-By: Noah Misch, although this is not his preferred fix
Discussion: 20151210140450.GA22278@alap3.anarazel.de
Backpatch: 9.5, where the bug was introduced as part of 4f627f
2015-12-14 11:34:16 +01:00
Alvaro Herrera 69e7235c93 Fix commit timestamp initialization
This module needs explicit initialization in order to replay WAL records
in recovery, but we had broken this recently following changes to make
other (stranger) scenarios work correctly.  To fix, rework the
initialization sequence so that it always takes place before WAL replay
commences for both master and standby.

I could have gone for a more localized fix that just added a "startup"
call for the master server, but it seemed better to restructure the
existing callers as well so that the whole thing made more sense.  As a
drawback, there is more control logic in xlog.c now than previously, but
doing otherwise meant passing down the ControlFile flag, which seemed
uglier as a whole.

This also meant adding a check to not re-execute ActivateCommitTs if it
had already been called.

Reported by Fujii Masao.

Backpatch to 9.5.
2015-12-11 14:30:43 -03:00
Peter Eisentraut a351705d8a Improve some messages 2015-12-10 22:05:27 -05:00
Andres Freund e3f4cfc7aa Fix bug leading to restoring unlogged relations from empty files.
At the end of crash recovery, unlogged relations are reset to the empty
state, using their init fork as the template. The init fork is copied to
the main fork without going through shared buffers. Unfortunately WAL
replay so far has not necessarily flushed writes from shared buffers to
disk at that point. In normal crash recovery, and before the
introduction of 'fast promotions' in fd4ced523 / 9.3, the
END_OF_RECOVERY checkpoint flushes the buffers out in time. But with
fast promotions that's not the case anymore.

To fix, force WAL writes targeting the init fork to be flushed
immediately (using the new FlushOneBuffer() function). In 9.5+ that
flush can centrally be triggered from the code dealing with restoring
full page writes (XLogReadBufferForRedoExtended), in earlier releases
that responsibility is in the hands of XLOG_HEAP_NEWPAGE's replay
function.

Backpatch to 9.1, even if this currently is only known to trigger in
9.3+. Flushing earlier is more robust, and it is advantageous to keep
the branches similar.

Typical symptoms of this bug are errors like
'ERROR:  index "..." contains unexpected zero page at block 0'
shortly after promoting a node.

Reported-By: Thom Brown
Author: Andres Freund and Michael Paquier
Discussion: 20150326175024.GJ451@alap3.anarazel.de
Backpatch: 9.1-
2015-12-10 16:29:26 +01:00
Alvaro Herrera 820ddb2c2f Further tweak commit_timestamp behavior
As pointed out by Fujii Masao, we weren't quite there on a standby
behaving sanely: first because we were failing to acquire the correct
state in the case where no XLOG_PARAMETER_CHANGE message was sent
(because a checkpoint had already happened after the setting was changed
in the master, and then the standby was restarted); and second because
promoting the standby with the feature enabled failed to activate it if
the master had the feature disabled.

This patch fixes both those misbehaviors hopefully without
re-introducing any old problems.

Also change the hint emitted in a standby together with the error
message about the feature being disabled, to make it point out that the
place to chance the setting is the master.  Otherwise, if the setting is
already enabled in the standby, it is very confusing to have it say that
the setting must be enabled ...

Authors: Álvaro Herrera, Petr Jelínek.
Backpatch to 9.5.
2015-12-03 19:22:31 -03:00
Robert Haas 74d0d5f3eb Fix typo in comment.
Amit Langote
2015-11-19 16:45:39 -05:00
Andres Freund d3c8ac114f Remove function names from some elog() calls in heapam.c.
At least one of the names was, due to a function renaming late in the
development of ON CONFLICT, wrong. Since including function names in
error messages is against the message style guide anyway, remove them
from the messages.

Discussion: CAM3SWZT8paz=usgMVHm0XOETkQvzjRtAUthATnmaHQQY0obnGw@mail.gmail.com
Backpatch: 9.5, where ON CONFLICT was introduced
2015-11-19 01:37:58 +01:00
Peter Eisentraut 5db837d3f2 Message improvements 2015-11-16 21:39:23 -05:00
Robert Haas fe702a7b3f Move each SLRU's lwlocks to a separate tranche.
This makes it significantly easier to identify these lwlocks in
LWLOCK_STATS or Trace_lwlocks output.  It's also arguably better
from a modularity standpoint, since lwlock.c no longer needs to
know anything about the LWLock needs of the higher-level SLRU
facility.

Ildus Kurbangaliev, reviewd by Álvaro Herrera and by me.
2015-11-12 14:59:09 -05:00
Robert Haas 64b2e7ad91 Pass extra data to bgworkers, and use this to fix parallel contexts.
Up until now, the total amount of data that could be passed to a
background worker at startup was one datum, which can be a small as
4 bytes on some systems.  That's enough to pass a dsm_handle or an
array index, but not much else.  Add a bgw_extra flag to the
BackgroundWorker struct, allowing up to 128 bytes to be passed to
a new worker on any platform.

Use this to fix a problem I recently discovered with the parallel
context machinery added in 9.5: the master assigns each worker an
array index, and each worker subsequently assigns itself an array
index, and there's nothing to guarantee that the two sets of indexes
match, leading to chaos.

Normally, I would not back-patch the change to add bgw_extra, since it
is basically a feature addition.  However, since 9.5 is still in beta
and there seems to be no other sensible way to repair the broken
parallel context machinery, back-patch to 9.5.  Existing background
worker code can ignore the bgw_extra field without a problem, but
might need to be recompiled since the structure size has changed.

Report and patch by me.  Review by Amit Kapila.
2015-11-05 12:13:56 -05:00
Kevin Grittner 585e2a3b1a Fix serialization anomalies due to race conditions on INSERT.
On insert the CheckForSerializableConflictIn() test was performed
before the page(s) which were going to be modified had been locked
(with an exclusive buffer content lock).  If another process
acquired a relation SIReadLock on the heap and scanned to a page on
which an insert was going to occur before the page was so locked,
a rw-conflict would be missed, which could allow a serialization
anomaly to be missed.  The window between the check and the page
lock was small, so the bug was generally not noticed unless there
was high concurrency with multiple processes inserting into the
same table.

This was reported by Peter Bailis as bug #11732, by Sean Chittenden
as bug #13667, and by others.

The race condition was eliminated in heap_insert() by moving the
check down below the acquisition of the buffer lock, which had been
the very next statement.  Because of the loop locking and unlocking
multiple buffers in heap_multi_insert() a check was added after all
inserts were completed.  The check before the start of the inserts
was left because it might avoid a large amount of work to detect a
serialization anomaly before performing the all of the inserts and
the related WAL logging.

While investigating this bug, other SSI bugs which were even harder
to hit in practice were noticed and fixed, an unnecessary check
(covered by another check, so redundant) was removed from
heap_update(), and comments were improved.

Back-patch to all supported branches.

Kevin Grittner and Thomas Munro
2015-10-31 14:43:34 -05:00
Robert Haas 3a1f8611f2 Update parallel executor support to reuse the same DSM.
Commit b0b0d84b3d purported to make it
possible to relaunch workers using the same parallel context, but it had
an unpleasant race condition: we might reinitialize after the workers
have sent their last control message but before they have dettached the
DSM, leaving to crashes.  Repair by introducing a new ParallelContext
operation, ReinitializeParallelDSM.

Adjust execParallel.c to use this new support, so that we can rescan a
Gather node by relaunching workers but without needing to recreate the
DSM.

Amit Kapila, with some adjustments by me.  Extracted from latest parallel
sequential scan patch.
2015-10-30 10:44:54 +01:00
Peter Eisentraut a8d585c091 Message style improvements
Message style, plurals, quoting, spelling, consistency with similar
messages
2015-10-28 20:38:36 -04:00
Alvaro Herrera 21a4e4a4c9 Fix BRIN free space computations
A bug in the original free space computation made it possible to
return a page which wasn't actually able to fit the item.  Since the
insertion code isn't prepared to deal with PageAddItem failing, a PANIC
resulted ("failed to add BRIN tuple [to new page]").  Add a macro to
encapsulate the correct computation, and use it in
brin_getinsertbuffer's callers before calling that routine, to raise an
early error.

I became aware of the possiblity of a problem in this area while working
on ccc4c07499.  There's no archived discussion about it, but it's
easy to reproduce a problem in the unpatched code with something like

CREATE TABLE t (a text);
CREATE INDEX ti ON t USING brin (a) WITH (pages_per_range=1);

for length in `seq 8000 8196`
do
	psql -f - <<EOF
TRUNCATE TABLE t;
INSERT INTO t VALUES ('z'), (repeat('a', $length));
EOF
done

Backpatch to 9.5, where BRIN was introduced.
2015-10-27 18:17:55 -03:00
Alvaro Herrera 531d21b75f Cleanup commit timestamp module activaction, again
Further tweak commit_ts.c so that on a standby the state is completely
consistent with what that in the master, rather than behaving
differently in the cases that the settings differ.  Now in standby and
master the module should always be active or inactive in lockstep.

Author: Petr Jelínek, with some further tweaks by Álvaro Herrera.

Backpatch to 9.5, where commit timestamps were introduced.

Discussion: http://www.postgresql.org/message-id/5622BF9D.2010409@2ndquadrant.com
2015-10-27 15:06:50 -03:00
Robert Haas 31ba62ce32 Fix typos in comments.
CharSyam
2015-10-22 14:52:23 -04:00
Robert Haas ee7ca559fc Add a C API for parallel heap scans.
Using this API, one backend can set up a ParallelHeapScanDesc to
which multiple backends can then attach.  Each tuple in the relation
will be returned to exactly one of the scanning backends.  Only
forward scans are supported, and rescans must be carefully
coordinated.

This is not exposed to the planner or executor yet.

The original version of this code was written by me.  Amit Kapila
reviewed it, tested it, and improved it, including adding support for
synchronized scans, per review comments from Jeff Davis.  Extensive
testing of this and related patches was performed by Haribabu Kommi.
Final cleanup of this patch by me.
2015-10-16 17:33:18 -04:00
Robert Haas b0b0d84b3d Allow a parallel context to relaunch workers.
This may allow some callers to avoid the overhead involved in tearing
down a parallel context and then setting up a new one, which means
releasing the DSM and then allocating and populating a new one.  I
suspect we'll want to revise the Gather node to make use of this new
capability, but even if not it may be useful elsewhere and requires
very little additional code.
2015-10-16 17:18:05 -04:00
Robert Haas a53c06a13e Prohibit parallel query when the isolation level is serializable.
In order for this to be safe, the code which hands true serializability
will need to taught that the SIRead locks taken by a parallel worker
pertain to the same transaction as those taken by the parallel leader.
Some further changes may be needed as well.  Until the necessary
adaptations are made, don't generate parallel plans in serializable
mode, and if a previously-generated parallel plan is used after
serializable mode has been activated, run it serially.

This fixes a bug in commit 7aea8e4f2d.
2015-10-16 11:58:27 -04:00
Robert Haas 82b37765c7 Fix a problem with parallel workers being unable to restore role.
check_role() tries to verify that the user has permission to become the
requested role, but this is inappropriate in a parallel worker, which
needs to exactly recreate the master's authorization settings.  So skip
the check in that case.

This fixes a bug in commit 924bcf4f16.
2015-10-16 11:37:19 -04:00
Robert Haas 6de6d96d97 Invalidate caches after cranking up a parallel worker transaction.
Starting a parallel worker transaction changes our notion of which XIDs
are in-progress or committed, and our notion of the current command
counter ID.  Therefore, our view of these caches prior to starting
this transaction may no longer valid.  Defend against that by clearing
them.

This fixes a bug in commit 924bcf4f16.
2015-10-16 11:31:23 -04:00
Robert Haas 94b4f7e2a6 Tighten up application of parallel mode checks.
Commit 924bcf4f16 failed to enforce
parallel mode checks during the commit of a parallel worker, because
we exited parallel mode prior to ending the transaction so that we
could pop the active snapshot.  Re-establish parallel mode during
parallel worker commit.  Without this, it's far too easy for unsafe
actions during the pre-commit sequence to crash the server instead of
hitting the error checks as intended.

Just to be extra paranoid, adjust a couple of the sanity checks in
xact.c to check not only IsInParallelMode() but also
IsParallelWorker().
2015-10-16 09:59:57 -04:00
Robert Haas 423ec0877f Transfer current command counter ID to parallel workers.
Commit 924bcf4f16 correctly forbade
parallel workers to modify the command counter while in parallel mode,
but it inexplicably neglected to actually transfer the current command
counter from leader to workers.  This can result in the workers seeing
a different set of tuples from the leader, which is bad.  Repair.
2015-10-16 09:58:42 -04:00
Robert Haas 2ad5c27bb5 Don't send protocol messages to a shm_mq that no longer exists.
Commit 2bd9e412f9 introduced a mechanism
for relaying protocol messages from a background worker to another
backend via a shm_mq.  However, there was no provision for shutting
down the communication channel.  Therefore, a protocol message sent
late in the shutdown sequence, such as a DEBUG message resulting from
cranking up log_min_messages, could crash the server.  To fix, install
an on_dsm_detach callback that disables sending messages to the shm_mq
when the associated DSM is detached.
2015-10-16 09:42:33 -04:00
Andres Freund 2596d705bd Re-Align *_freeze_max_age reloption limits with corresponding GUC limits.
In 020235a575 I lowered the autovacuum_*freeze_max_age minimums to
allow for easier testing of wraparounds. I did not touch the
corresponding per-table limits. While those don't matter for the purpose
of wraparound, it seems more consistent to lower them as well.

It's noteworthy that the previous reloption lower limit for
autovacuum_multixact_freeze_max_age was too high by one magnitude, even
before 020235a575.

Discussion: 26377.1443105453@sss.pgh.pa.us
Backpatch: back to 9.0 (in parts), like the prior patch
2015-10-05 11:53:43 +02:00
Alvaro Herrera e06b2e1d2e Don't disable commit_ts in standby if enabled locally
Bug noticed by Fujii Masao
2015-10-02 12:49:01 -03:00
Peter Eisentraut 87c2b517ac Fix message punctuation according to style guide 2015-10-01 21:39:56 -04:00
Alvaro Herrera f12e814b88 Fix commit_ts for standby
Module initialization was still not completely correct after commit
6b61955135, per crash report from Takashi Ohnishi.  To fix, instead of
trying to monkey around with the value of the GUC setting directly, add
a separate boolean flag that enables the feature on a standby, but only
for the startup (recovery) process, when it sees that its master server
has the feature enabled.
Discussion: http://www.postgresql.org/message-id/ca44c6c7f9314868bdc521aea4f77cbf@MP-MSGSS-MBX004.msg.nttdata.co.jp

Also change the deactivation routine to delete all segment files rather
than leaving the last one around.  (This doesn't need separate
WAL-logging, because on recovery we execute the same deactivation
routine anyway.)

In passing, clean up the code structure somewhat, particularly so that
xlog.c doesn't know so much about when to activate/deactivate the
feature.

Thanks to Fujii Masao for testing and Petr Jelínek for off-list discussion.

Back-patch to 9.5, where commit_ts was introduced.
2015-10-01 15:06:55 -03:00
Robert Haas 227d57f358 Don't dump core when destroying an unused ParallelContext.
If a transaction or subtransaction creates a ParallelContext but ends
without calling InitializeParallelDSM, the previous code would
seg fault.  Fix that.
2015-09-30 18:36:31 -04:00
Alvaro Herrera 6b61955135 Code review for transaction commit timestamps
There are three main changes here:

1. No longer cause a start failure in a standby if the feature is
disabled in postgresql.conf but enabled in the master.  This reverts one
part of commit 4f3924d9cd43; what we keep is the ability of the standby
to activate/deactivate the module (which includes creating and removing
segments as appropriate) during replay of such actions in the master.

2. Replay WAL records affecting commitTS even if the feature is
disabled.  This means the standby will always have the same state as the
master after replay.

3. Have COMMIT PREPARE record the transaction commit time as well.  We
were previously only applying it in the normal transaction commit path.

Author: Petr Jelínek
Discussion: http://www.postgresql.org/message-id/CAHGQGwHereDzzzmfxEBYcVQu3oZv6vZcgu1TPeERWbDc+gQ06g@mail.gmail.com
Discussion: http://www.postgresql.org/message-id/CAHGQGwFuzfO4JscM9LCAmCDCxp_MfLvN4QdB+xWsS-FijbjTYQ@mail.gmail.com

Additionally, I cleaned up nearby code related to replication origins,
which I found a bit hard to follow, and fixed a couple of typos.

Backpatch to 9.5, where this code was introduced.

Per bug reports from Fujii Masao and subsequent discussion.
2015-09-29 14:40:56 -03:00
Alvaro Herrera 17f5831c81 Fix "sesssion" typo
It was introduced alongside replication origins, by commit
5aa2350426, so backpatch to 9.5.

Pointed out by Fujii Masao
2015-09-28 19:13:42 -03:00
Andres Freund aa29c1ccd9 Remove legacy multixact truncation support.
In 9.5 and master there is no need to support legacy truncation. This is
just committed separately to make it easier to backpatch the WAL logged
multixact truncation to 9.3 and 9.4 if we later decide to do so.

I bumped master's magic from 0xD086 to 0xD088 and 9.5's from 0xD085 to
0xD087 to avoid 9.5 reusing a value that has been in use on master while
keeping the numbers increasing between major versions.

Discussion: 20150621192409.GA4797@alap3.anarazel.de
Backpatch: 9.5
2015-09-26 19:04:25 +02:00
Andres Freund 4f627f8973 Rework the way multixact truncations work.
The fact that multixact truncations are not WAL logged has caused a fair
share of problems. Amongst others it requires to do computations during
recovery while the database is not in a consistent state, delaying
truncations till checkpoints, and handling members being truncated, but
offset not.

We tried to put bandaids on lots of these issues over the last years,
but it seems time to change course. Thus this patch introduces WAL
logging for multixact truncations.

This allows:
1) to perform the truncation directly during VACUUM, instead of delaying it
   to the checkpoint.
2) to avoid looking at the offsets SLRU for truncation during recovery,
   we can just use the master's values.
3) simplify a fair amount of logic to keep in memory limits straight,
   this has gotten much easier

During the course of fixing this a bunch of additional bugs had to be
fixed:
1) Data was not purged from memory the member's SLRU before deleting
   segments. This happened to be hard or impossible to hit due to the
   interlock between checkpoints and truncation.
2) find_multixact_start() relied on SimpleLruDoesPhysicalPageExist - but
   that doesn't work for offsets that haven't yet been flushed to
   disk. Add code to flush the SLRUs to fix. Not pretty, but it feels
   slightly safer to only make decisions based on actual on-disk state.
3) find_multixact_start() could be called concurrently with a truncation
   and thus fail. Via SetOffsetVacuumLimit() that could lead to a round
   of emergency vacuuming. The problem remains in
   pg_get_multixact_members(), but that's quite harmless.

For now this is going to only get applied to 9.5+, leaving the issues in
the older branches in place. It is quite possible that we need to
backpatch at a later point though.

For the case this gets backpatched we need to handle that an updated
standby may be replaying WAL from a not-yet upgraded primary. We have to
recognize that situation and use "old style" truncation (i.e. looking at
the SLRUs) during WAL replay. In contrast to before, this now happens in
the startup process, when replaying a checkpoint record, instead of the
checkpointer. Doing truncation in the restartpoint is incorrect, they
can happen much later than the original checkpoint, thereby leading to
wraparound.  To avoid "multixact_redo: unknown op code 48" errors
standbys would have to be upgraded before primaries.

A later patch will bump the WAL page magic, and remove the legacy
truncation codepaths. Legacy truncation support is just included to make
a possible future backpatch easier.

Discussion: 20150621192409.GA4797@alap3.anarazel.de
Reviewed-By: Robert Haas, Alvaro Herrera, Thomas Munro
Backpatch: 9.5 for now
2015-09-26 19:04:25 +02:00
Teodor Sigaev dc943ad952 Allow autoanalyze to add pages deleted from pending list to FSM
Commit e956808328 introduces adding pages
to FSM for ordinary insert, but autoanalyze was able just cleanup
pending list without adding to FSM.

Also fix double call of IndexFreeSpaceMapVacuum() during ginvacuumcleanup()

Report from Fujii Masao
Patch by me
Review by Jeff Janes
2015-09-23 15:33:51 +03:00
Peter Eisentraut 4a1e15e4a9 Add missing serial comma 2015-09-18 22:41:42 -04:00
Teodor Sigaev 22f519c92a Fix bug introduced by microvacuum for GiST
Commit 013ebc0a7b introduces microvacuum for
GiST, deletetion of tuple marked LP_DEAD uses IndexPageMultiDelete while
recovery code uses IndexPageTupleDelete in loop. This causes a difference
in offset numbers of tuples to delete. Patch introduces usage of
IndexPageMultiDelete in GiST except gistplacetopage() where only one tuple is
deleted at once. That also slightly improve performance, because
IndexPageMultiDelete is more effective.

Patch changes WAL format, so bump wal page magic.

Bug report from Jeff Janes
Diagnostic and patch by Anastasia Lubennikova and me
2015-09-17 14:22:37 +03:00
Fujii Masao 10fbb79f1a Improve log messages related to tablespace_map file
This patch changes the log message which is logged when the server
successfully renames backup_label file to *.old but fails to rename
tablespace_map file during the shutdown. Previously the WARNING
message "online backup mode was not canceled" was logged in that case.
However this message is confusing because the backup mode is treated
as canceled whenever backup_label is successfully renamed. So this
commit makes the server log the message "online backup mode canceled"
in that case.

Also this commit changes errdetail messages so that they follow the
error message style guide.

Back-patch to 9.5 where tablespace_map file is introduced.

Original patch by Amit Kapila, heavily modified by me.
2015-09-15 23:21:51 +09:00
Alvaro Herrera 5cd6538345 Add missing ReleaseBuffer call in BRIN revmap code
I think this particular branch is actually dead, but the analysis to
prove that is not trivial, so instead take the weasel way.

Reported by Jinyu Zhang
Backpatch to 9.5, where BRIN was introduced.
2015-09-11 15:29:46 -03:00
Teodor Sigaev 223936e226 Fix oversight in 013ebc0a7b commit
Declaration of varibale inside ÓÝ×Õ
2015-09-09 19:21:16 +03:00
Teodor Sigaev 013ebc0a7b Microvacuum for GIST
Mark index tuple as dead if it's pointed by kill_prior_tuple during
ordinary (search) scan and remove it during insert process if there is no
enough space for new tuple to insert. This improves select performance
because index will not return tuple marked as dead and improves insert
performance because it reduces number of page split.

Anastasia Lubennikova <a.lubennikova@postgrespro.ru> with
 minor editorialization by me
2015-09-09 18:43:37 +03:00
Fujii Masao 96f6a0cb41 Remove files signaling a standby promotion request at postmaster startup
This commit makes postmaster forcibly remove the files signaling
a standby promotion request. Otherwise, the existence of those files
can trigger a promotion too early, whether a user wants that or not.

This removal of files is usually unnecessary because they can exist
only during a few moments during a standby promotion. However
there is a race condition: if pg_ctl promote is executed and creates
the files during a promotion, the files can stay around even after
the server is brought up to new master. Then, if new standby starts
by using the backup taken from that master, the files can exist
at the server startup and should be removed in order to avoid
an unexpected promotion.

Back-patch to 9.1 where promote signal file was introduced.

Problem reported by Feike Steenbergen.
Original patch by Michael Paquier, modified by me.

Discussion: 20150528100705.4686.91426@wrigleys.postgresql.org
2015-09-09 22:51:44 +09:00
Alvaro Herrera 1aba62ec63 Allow per-tablespace effective_io_concurrency
Per discussion, nowadays it is possible to have tablespaces that have
wildly different I/O characteristics from others.  Setting different
effective_io_concurrency parameters for those has been measured to
improve performance.

Author: Julien Rouhaud
Reviewed by: Andres Freund
2015-09-08 12:51:42 -03:00
Teodor Sigaev e26692248a Make GIN's cleanup pending list process interruptable
Cleanup process could be called by ordinary insert/update and could take a lot
of time. Add vacuum_delay_point() to make this process interruptable. Under
vacuum this call will also throttle a vacuum process to decrease system load,
called from insert/update it will not throttle, and that reduces a latency.

Backpatch for all supported branches.

Jeff Janes <jeff.janes@gmail.com>
2015-09-07 17:16:29 +03:00
Teodor Sigaev e956808328 Add pages deleted from pending list to FSM
Add pages deleted from GIN's pending list during cleanup to free space map
immediately. Clean up process could be initiated by ordinary insert but adding
page to FSM might occur only at vacuum. On some workload like never-vacuumed
insert-only tables it could cause a huge bloat.

Jeff Janes <jeff.janes@gmail.com>
2015-09-07 16:24:01 +03:00
Heikki Linnakangas c80b5f66c6 Fix misc typos.
Oskari Saarenmaa. Backpatch to stable branches where applicable.
2015-09-05 11:35:49 +03:00
Tatsuo Ishii c39f5674df Fix brin index summarizing while vacuuming.
If the number of heap blocks is not multiples of pages per range, the
summarizing produces wrong summary information for the last brin index
tuple while vacuuming.

Problem reported by Tatsuo Ishii and fixed by Amit Langote.

Discussion at "[HACKERS] BRIN INDEX value (message id :20150903.174935.1946402199422994347.t-ishii@sraoss.co.jp)
Backpatched to 9.5 in which brin index was added.
2015-09-05 09:19:25 +09:00
Tom Lane c5454f99c4 Fix subtransaction cleanup after an outer-subtransaction portal fails.
Formerly, we treated only portals created in the current subtransaction as
having failed during subtransaction abort.  However, if the error occurred
while running a portal created in an outer subtransaction (ie, a cursor
declared before the last savepoint), that has to be considered broken too.

To allow reliable detection of which ones those are, add a bookkeeping
field to struct Portal that tracks the innermost subtransaction in which
each portal has actually been executed.  (Without this, we'd end up
failing portals containing functions that had called the subtransaction,
thereby breaking plpgsql exception blocks completely.)

In addition, when we fail an outer-subtransaction Portal, transfer its
resources into the subtransaction's resource owner, so that they're
released early in cleanup of the subxact.  This fixes a problem reported by
Jim Nasby in which a function executed in an outer-subtransaction cursor
could cause an Assert failure or crash by referencing a relation created
within the inner subtransaction.

The proximate cause of the Assert failure is that AtEOSubXact_RelationCache
assumed it could blow away a relcache entry without first checking that the
entry had zero refcount.  That was a bad idea on its own terms, so add such
a check there, and to the similar coding in AtEOXact_RelationCache.  This
provides an independent safety measure in case there are still ways to
provoke the situation despite the Portal-level changes.

This has been broken since subtransactions were invented, so back-patch
to all supported branches.

Tom Lane and Michael Paquier
2015-09-04 13:37:14 -04:00
Fujii Masao 1ea5ce5c5f Document that max_worker_processes must be high enough in standby.
The setting values of some parameters including max_worker_processes
must be equal to or higher than the values on the master. However,
previously max_worker_processes was not listed as such parameter
in the document. So this commit adds it to that list.

Back-patch to 9.4 where max_worker_processes was added.
2015-09-03 22:30:16 +09:00
Teodor Sigaev 30bb26b5e0 Allow usage of huge maintenance_work_mem for GIN build.
Currently, in-memory posting list during GIN build process is limited 1GB
because of using repalloc. The patch replaces call of repalloc to repalloc_huge.
It increases limit of posting list from 180 millions
(1GB / sizeof(ItemPointerData)) to 4 billions limited by maxcount/count fields
in GinEntryAccumulator and subsequent calls. Check added.

Also, fix accounting of allocatedMemory during build to prevent integer
overflow with maintenance_work_mem > 4GB.

Robert Abraham <robert.abraham86@googlemail.com> with additions by me
2015-09-02 20:08:58 +03:00
Alvaro Herrera e68be16b0d Do not allow *timestamp to be passed as NULL
The code had bugs that would cause crashes if NULL was passed as that
argument (originally intended to mean not to bother returning its
value), and after inspection it turns out that nothing seems interested
in the case that *ts is NULL anyway.  Therefore, remove the partial
checks intended to support that case.

Author: Michael Paquier
though I didn't include a proposed Assert.

Backpatch to 9.5.
2015-08-21 14:36:54 -03:00
Andres Freund e95126cf04 Don't use function definitions looking like old-style ones.
This fixes a bunch of somewhat pedantic warnings with new
compilers. Since by far the majority of other functions definitions use
the (void) style it just seems to be consistent to do so as well in the
remaining few places.
2015-08-15 17:25:00 +02:00
Simon Riggs 47167b7907 Reduce lock levels for ALTER TABLE SET autovacuum storage options
Reduce lock levels down to ShareUpdateExclusiveLock for all autovacuum-related
relation options when setting them using ALTER TABLE.

Add infrastructure to allow varying lock levels for relation options in later
patches. Setting multiple options together uses the highest lock level required
for any option. Works for both main and toast tables.

Fabrízio Mello, reviewed by Michael Paquier, mild edit and additional regression
tests from myself
2015-08-14 14:19:28 +01:00
Alvaro Herrera fcbf455842 Fix unitialized variables
As complained by clang, reported by Andres Freund.  Brown paper bag bug
in ccc4c07499.

Add some comments, too.

Backpatch to 9.5, like that one.
2015-08-13 00:12:07 -03:00
Alvaro Herrera ccc4c07499 Close some holes in BRIN page assignment
In some corner cases, it is possible for the BRIN index relation to be
extended by brin_getinsertbuffer but the new page not be used
immediately for anything by its callers; when this happens, the page is
initialized and the FSM is updated (by brin_getinsertbuffer) with the
info about that page, but these actions are not WAL-logged.  A later
index insert/update can use the page, but since the page is already
initialized, the initialization itself is not WAL-logged then either.
Replay of this sequence of events causes recovery to fail altogether.

There is a related corner case within brin_getinsertbuffer itself, in
which we extend the relation to put a new index tuple there, but later
find out that we cannot do so, and do not return the buffer; the page
obtained from extension is not even initialized.  The resulting page is
lost forever.

To fix, shuffle the code so that initialization is not the
responsibility of brin_getinsertbuffer anymore, in normal cases;
instead, the initialization is done by its callers (brin_doinsert and
brin_doupdate) once they're certain that the page is going to be used.
When either those functions determine that the new page cannot be used,
before bailing out they initialize the page as an empty regular page,
enter it in FSM and WAL-log all this.  This way, the page is usable for
future index insertions, and WAL replay doesn't find trying to insert
tuples in pages whose initialization didn't make it to the WAL.  The
same strategy is used in brin_getinsertbuffer when it cannot return the
new page.

Additionally, add a new step to vacuuming so that all pages of the index
are scanned; whenever an uninitialized page is found, it is initialized
as empty and WAL-logged.  This closes the hole that the relation is
extended but the system crashes before anything is WAL-logged about it.
We also take this opportunity to update the FSM, in case it has gotten
out of date.

Thanks to Heikki Linnakangas for finding the problem that kicked some
additional analysis of BRIN page assignment code.

Backpatch to 9.5, where BRIN was introduced.

Discussion: https://www.postgresql.org/message-id/20150723204810.GY5596@postgresql.org
2015-08-12 14:20:38 -03:00
Andres Freund 18e8613564 Address points made in post-commit review of replication origins.
Amit reviewed the replication origins patch and made some good
points. Address them. This fixes typos in error messages, docs and
comments and adds a missing error check (although in a
should-never-happen scenario).

Discussion: CAA4eK1JqUBVeWWKwUmBPryFaje4190ug0y-OAUHWQ6tD83V4xg@mail.gmail.com
Backpatch: 9.5, where replication origins were introduced.
2015-08-07 15:09:05 +02:00
Robert Haas 0e141c0fbb Reduce ProcArrayLock contention by removing backends in batches.
When a write transaction commits, it must clear its XID advertised via
the ProcArray, which requires that we hold ProcArrayLock in exclusive
mode in order to prevent concurrent processes running GetSnapshotData
from seeing inconsistent results.  When many processes try to commit
at once, ProcArrayLock must change hands repeatedly, with each
concurrent process trying to commit waking up to acquire the lock in
turn.  To make things more efficient, when more than one backend is
trying to commit a write transaction at the same time, have just one
of them acquire ProcArrayLock in exclusive mode and clear the XIDs of
all processes in the group.  Benchmarking reveals that this is much
more efficient at very high client counts.

Amit Kapila, heavily revised by me, with some review also from Pavan
Deolasee.
2015-08-06 12:02:12 -04:00
Alvaro Herrera 2834855cb9 Fix BRIN to use SnapshotAny during summarization
For correctness of summarization results, it is critical that the
snapshot used during the summarization scan is able to see all tuples
that are live to all transactions -- including tuples inserted or
deleted by in-progress transactions.  Otherwise, it would be possible
for a transaction to insert a tuple, then idle for a long time while a
concurrent transaction executes summarization of the range: this would
result in the inserted value not being considered in the summary.
Previously we were trying to use a MVCC snapshot in conjunction with
adding a "placeholder" tuple in the index: the snapshot would see all
committed tuples, and the placeholder tuple would catch insertions by
any new inserters.  The hole is that prior insertions by transactions
that are still in progress by the time the MVCC snapshot was taken were
ignored.

Kevin Grittner reported this as a bogus error message during vacuum with
default transaction isolation mode set to repeatable read (because the
error report mentioned a function name not being invoked during), but
the problem is larger than that.

To fix, tweak IndexBuildHeapRangeScan to have a new mode that behaves
the way we need using SnapshotAny visibility rules.  This change
simplifies the BRIN code a bit, mainly by removing large comments that
were mistaken.  Instead, rely on the SnapshotAny semantics to provide
what it needs.  (The business about a placeholder tuple needs to remain:
that covers the case that a transaction inserts a a tuple in a page that
summarization already scanned.)

Discussion: https://www.postgresql.org/message-id/20150731175700.GX2441@postgresql.org

In passing, remove a couple of unused declarations from brin.h and
reword a comment to be proper English.  This part submitted by Kevin
Grittner.

Backpatch to 9.5, where BRIN was introduced.
2015-08-05 16:20:50 -03:00
Fujii Masao dd85acf0c4 Make recovery rename tablespace_map to *.old if backup_label is not present.
If tablespace_map file is present without backup_label file, there is
no use of such file.  There is no harm in retaining it, but it is better
to get rid of the map file so that we don't have any redundant file
in data directory and it will avoid any sort of confusion. It seems
prudent though to just rename the file out of the way rather than
delete it completely, also we ignore any error that occurs in rename
operation as even if map file is present without backup_label file,
it is harmless.

Back-patch to 9.5 where tablespace_map file was introduced.

Amit Kapila, reviewed by Robert Haas, Alvaro Herrera and me.
2015-08-03 23:04:41 +09:00
Tom Lane 09cecdf285 Fix a number of places that produced XX000 errors in the regression tests.
It's against project policy to use elog() for user-facing errors, or to
omit an errcode() selection for errors that aren't supposed to be "can't
happen" cases.  Fix all the violations of this policy that result in
ERRCODE_INTERNAL_ERROR log entries during the standard regression tests,
as errors that can reliably be triggered from SQL surely should be
considered user-facing.

I also looked through all the files touched by this commit and fixed
other nearby problems of the same ilk.  I do not claim to have fixed
all violations of the policy, just the ones in these files.

In a few places I also changed existing ERRCODE choices that didn't
seem particularly appropriate; mainly replacing ERRCODE_SYNTAX_ERROR
by something more specific.

Back-patch to 9.5, but no further; changing ERRCODE assignments in
stable branches doesn't seem like a good idea.
2015-08-02 23:49:19 -04:00
Heikki Linnakangas 358cde320b Fix race condition that lead to WALInsertLock deadlock with commit_delay.
If a call to WaitForXLogInsertionsToFinish() returned a value in the middle
of a page, and another backend then started to insert a record to the same
page, and then you called WaitXLogInsertionsToFinish() again, the second
call might return a smaller value than the first call. The problem was in
GetXLogBuffer(), which always updated the insertingAt value to the
beginning of the requested page, not the actual requested location. Because
of that, the second call might return a xlog pointer to the beginning of
the page, while the first one returned a later position on the same page.
XLogFlush() performs two calls to WaitXLogInsertionsToFinish() in
succession, and holds WALWriteLock on the second call, which can deadlock
if the second call to WaitXLogInsertionsToFinish() blocks.

Reported by Spiros Ioannou. Backpatch to 9.4, where the more scalable
WALInsertLock mechanism, and this bug, was introduced.
2015-08-02 20:08:10 +03:00
Andres Freund 7039760114 Fix issues around the "variable" support in the lwlock infrastructure.
The lwlock scalability work introduced two race conditions into the
lwlock variable support provided for xlog.c. First, and harmlessly on
most platforms, it set/read the variable without the spinlock in some
places. Secondly, due to the removal of the spinlock, it was possible
that a backend missed changes to the variable's state if it changed in
the wrong moment because checking the lock's state, the variable's state
and the queuing are not protected by a single spinlock acquisition
anymore.

To fix first move resetting the variable's from LWLockAcquireWithVar to
WALInsertLockRelease, via a new function LWLockReleaseClearVar. That
prevents issues around waiting for a variable's value to change when a
new locker has acquired the lock, but not yet set the value. Secondly
re-check that the variable hasn't changed after enqueing, that prevents
the issue that the lock has been released and already re-acquired by the
time the woken up backend checks for the lock's state.

Reported-By: Jeff Janes
Analyzed-By: Heikki Linnakangas
Reviewed-By: Heikki Linnakangas
Discussion: 5592DB35.2060401@iki.fi
Backpatch: 9.5, where the lwlock scalability went in
2015-08-02 18:41:23 +02:00
Alvaro Herrera c81276241b Fix broken assertion in BRIN code
The code was assuming that any NULL value in scan keys was due to IS
NULL or IS NOT NULL, but it turns out to be possible to get them with
other operators too, if they are used in contrived-enough ways.  Easiest
way out of the problem seems to check explicitely for the IS NOT NULL
flag, instead of assuming it must be set if the IS NULL flag is not set,
when a null scan key is found; if neither flag is set, follow the lead
of other index AMs and assume that all indexable operators must be
strict, and thus the query is never satisfiable.

Also, add a comment to try and lure some future hacker into improving
analysis of scan keys in brin.

Per report from Andreas Seltenreich; diagnosis by Tom Lane.
Backpatch to 9.5.

Discussion: http://www.postgresql.org/message-id/20646.1437919632@sss.pgh.pa.us
2015-07-30 15:07:19 -03:00
Joe Conway 7b4bfc87d5 Plug RLS related information leak in pg_stats view.
The pg_stats view is supposed to be restricted to only show rows
about tables the user can read. However, it sometimes can leak
information which could not otherwise be seen when row level security
is enabled. Fix that by not showing pg_stats rows to users that would
be subject to RLS on the table the row is related to. This is done
by creating/using the newly introduced SQL visible function,
row_security_active().

Along the way, clean up three call sites of check_enable_rls(). The second
argument of that function should only be specified as other than
InvalidOid when we are checking as a different user than the current one,
as in when querying through a view. These sites were passing GetUserId()
instead of InvalidOid, which can cause the function to return incorrect
results if the current user has the BYPASSRLS privilege and row_security
has been set to OFF.

Additionally fix a bug causing RI Trigger error messages to unintentionally
leak information when RLS is enabled, and other minor cleanup and
improvements. Also add WITH (security_barrier) to the definition of pg_stats.

Bumped CATVERSION due to new SQL functions and pg_stats view definition.

Back-patch to 9.5 where RLS was introduced. Reported by Yaroslav.
Patch by Joe Conway and Dean Rasheed with review and input by
Michael Paquier and Stephen Frost.
2015-07-28 13:21:22 -07:00
Heikki Linnakangas 5e65f45c6e Another attempt at fixing memory leak in xlogreader.
max_block_id is also reset between reading records.

Michael Paquier
2015-07-28 09:09:36 +03:00
Heikki Linnakangas 820d1ced1b Don't assume that PageIsEmpty() returns true on an all-zeros page.
It does currently, and I don't see us changing that any time soon, but we
don't make that assumption anywhere else.

Per Tom Lane's suggestion. Backpatch to 9.2, like the previous patch that
added this assumption.
2015-07-27 18:54:09 +03:00
Heikki Linnakangas 61a65c53bd Fix memory leak in xlogreader facility.
XLogReaderFree failed to free the per-block data buffers, when they
happened to not be used by the latest read WAL record.

Michael Paquier. Backpatch to 9.5, where the per-block buffers were added.
2015-07-27 18:29:31 +03:00
Heikki Linnakangas 334445179c Reuse all-zero pages in GIN.
In GIN, an all-zeros page would be leaked forever, and never reused. Just
add them to the FSM in vacuum, and they will be reinitialized when grabbed
from the FSM. On master and 9.5, attempting to access the page's opaque
struct also caused an assertion failure, although that was otherwise
harmless.

Reported by Jeff Janes. Backpatch to all supported versions.
2015-07-27 12:30:26 +03:00
Heikki Linnakangas 023430abf7 Fix handling of all-zero pages in SP-GiST vacuum.
SP-GiST initialized an all-zeros page at vacuum, but that was not
WAL-logged, which is not safe. You might get a torn page write, when it gets
flushed to disk, and end-up with a half-initialized index page. To fix,
leave it in the all-zeros state, and add it to the FSM. It will be
initialized when reused. Also don't set the page-deleted flag when recycling
an empty page. That was also not WAL-logged, and a torn write of that would
cause the page to have an invalid checksum.

Backpatch to 9.2, where SP-GiST indexes were added.
2015-07-27 12:28:21 +03:00
Heikki Linnakangas 65c384c5ab Avoid calling PageGetSpecialPointer() on an all-zeros page.
That was otherwise harmless, but tripped the new assertion in
PageGetSpecialPointer().

Reported by Amit Langote. Backpatch to 9.5, where the assertion was added.
2015-07-27 12:24:27 +03:00
Tom Lane d9476b8380 Dodge portability issue (apparent compiler bug) in new tablesample code.
Some of the older OS X critters in the buildfarm are failing regression,
with symptoms showing that a request for 100% sampling in BERNOULLI or
SYSTEM methods actually gets only around 50% of the table.  gdb revealed
that the computation of the "cutoff" number was producing 0x7FFFFFFF
rather than the expected 0x100000000.  Inspecting the assembly code,
it looks like gcc is trying to use lrint() instead of rint() and then
fumbling the conversion from long double to uint64.  This seems like a
clear compiler bug, but assigning the intermediate result into a plain
double variable works around it, so let's just do that.  (Another idea
would be to give up one bit of hash width so that we don't need to use
a uint64 cutoff, but let's see if this is enough.)
2015-07-25 19:42:32 -04:00
Tom Lane dd7a8f66ed Redesign tablesample method API, and do extensive code review.
The original implementation of TABLESAMPLE modeled the tablesample method
API on index access methods, which wasn't a good choice because, without
specialized DDL commands, there's no way to build an extension that can
implement a TSM.  (Raw inserts into system catalogs are not an acceptable
thing to do, because we can't undo them during DROP EXTENSION, nor will
pg_upgrade behave sanely.)  Instead adopt an API more like procedural
language handlers or foreign data wrappers, wherein the only SQL-level
support object needed is a single handler function identified by having
a special return type.  This lets us get rid of the supporting catalog
altogether, so that no custom DDL support is needed for the feature.

Adjust the API so that it can support non-constant tablesample arguments
(the original coding assumed we could evaluate the argument expressions at
ExecInitSampleScan time, which is undesirable even if it weren't outright
unsafe), and discourage sampling methods from looking at invisible tuples.
Make sure that the BERNOULLI and SYSTEM methods are genuinely repeatable
within and across queries, as required by the SQL standard, and deal more
honestly with methods that can't support that requirement.

Make a full code-review pass over the tablesample additions, and fix
assorted bugs, omissions, infelicities, and cosmetic issues (such as
failure to put the added code stanzas in a consistent ordering).
Improve EXPLAIN's output of tablesample plans, too.

Back-patch to 9.5 so that we don't have to support the original API
in production.
2015-07-25 14:39:00 -04:00
Heikki Linnakangas 766dcfb16c Fix off-by-one error in calculating subtrans/multixact truncation point.
If there were no subtransactions (or multixacts) active, we would calculate
the oldestxid == next xid. That's correct, but if next XID happens to be
on the next pg_subtrans (pg_multixact) page, the page does not exist yet,
and SimpleLruTruncate will produce an "apparent wraparound" warning. The
warning is harmless in this case, but looks very alarming to users.

Backpatch to all supported versions. Patch and analysis by Thomas Munro.
2015-07-23 01:29:59 +03:00
Tom Lane 434873806a Fix some oversights in BRIN patch.
Remove HeapScanDescData.rs_initblock, which wasn't being used for anything
in the final version of the patch.

Fix IndexBuildHeapScan so that it supports syncscan again; the patch
broke synchronous scanning for index builds by forcing rs_startblk
to zero even when the caller did not care about that and had asked
for syncscan.

Add some commentary and usage defenses to heap_setscanlimits().

Fix heapam so that asking for rs_numblocks == 0 does what you would
reasonably expect.  As coded it amounted to requesting a whole-table
scan, because those "--x <= 0" tests on an unsigned variable would
behave surprisingly.
2015-07-21 13:38:24 -04:00
Heikki Linnakangas eb11de8ff5 Sanity-check that a page zeroed by redo routine is marked with WILL_INIT.
There was already a sanity-check in the other direction: if a page was
marked with WILL_INIT, it had to be initialized by the redo routine. It's
not strictly necessary for correctness that a page is marked with WILL_INIT
if it's going to be initialized at redo, but it's a missed optimization if
nothing else.

Fix a few instances of this issue in SP-GiST, where a block in WAL record
was not marked with WILL_INIT, but was in fact always initialized at redo.
We were creating a full-page image of the page unnecessarily in those
cases.

Backpatch to 9.5, where the new WILL_INIT flag was added.
2015-07-20 22:34:01 +03:00
Alvaro Herrera 8d90736924 Improve BRIN documentation somewhat
This removes some info about support procedures being used, which was
obsoleted by commit db5f98ab4f, as well as add some more documentation
on how to create new opclasses using the Minmax infrastructure.
(Hopefully we can get something similar for Inclusion as well.)

In passing, fix some obsolete mentions of "mmtuples" in source code
comments.

Backpatch to 9.5, where BRIN was introduced.
2015-07-20 12:16:40 +02:00
Heikki Linnakangas 7261172430 Remove obsolete heap_formtuple/modifytuple/deformtuple functions.
These variants used the old-style 'n'/' ' NULL indicators. The new-style
functions have been available since version 8.1. That should be long enough
that if there is still any old external code using these functions, they
can just switch to the new functions without worrying about backwards
compatibility

Peter Geoghegan
2015-07-02 21:21:23 +03:00
Heikki Linnakangas f92d6a540a Use appendStringInfoString/Char et al where appropriate.
Patch by David Rowley. Backpatch to 9.5, as some of the calls were new in
9.5, and keeping the code in sync with master makes future backpatching
easier.
2015-07-02 12:36:03 +03:00
Fujii Masao 8217370864 Make XLogFileCopy() look the same as in 9.4.
XLogFileCopy() was changed heavily in commit de76884. However it was
partially reverted in commit 7abc685 and most of those changes to
XLogFileCopy() were no longer needed. Then commit 7cbee7c removed
those unnecessary code, but XLogFileCopy() looked different in master
and 9.4 though the contents are almost the same.

This patch makes XLogFileCopy() look the same in master and back-branches,
which makes back-patching easier, per discussion on pgsql-hackers.
Back-patch to 9.5.

Discussion: 55760844.7090703@iki.fi

Michael Paquier
2015-07-01 10:54:47 +09:00
Tom Lane 131926a52d Remove useless check for NULL subexpression.
Coverity rightly gripes that it's silly to have a test here when
the adjacent ExecEvalExpr() would choke on a NULL expression pointer.

Petr Jelinek
2015-06-30 12:53:54 -04:00
Heikki Linnakangas fdf28853ae Don't call PageGetSpecialPointer() on page until it's been initialized.
After calling XLogInitBufferForRedo(), the page might be all-zeros if it was
not in page cache already. btree_xlog_unlink_page initialized the page
correctly, but it called PageGetSpecialPointer before initializing it, which
would lead to a corrupt page at WAL replay, if the unlinked page is not in
page cache.

Backpatch to 9.4, the bug came with the rewrite of B-tree page deletion.
2015-06-30 13:41:30 +03:00
Heikki Linnakangas 47fe4d25d5 Initialize GIN metapage correctly when replaying metapage-update WAL record.
I broke this with my WAL format refactoring patch. Before that, the metapage
was read from disk, and modified in-place regardless of the LSN. That was
always a bit silly, as there's no need to read the old page version from
disk disk when we're overwriting it anyway. So that was changed in 9.5, but
I failed to add a GinInitPage call to initialize the page-headers correctly.
Usually you wouldn't notice, because the metapage is already in the page
cache and is not zeroed.

One way to reproduce this is to perform a VACUUM on an already vacuumed
table (so that the vacuum has no real work to do), immediately after a
checkpoint, and then perform an immediate shutdown. After recovery, the
page headers of the metapage will be incorrectly all-zeroes.

Reported by Jeff Janes
2015-06-30 00:06:00 +03:00
Heikki Linnakangas d661532e27 Also trigger restartpoints based on max_wal_size on standby.
When archive recovery and restartpoints were initially introduced,
checkpoint_segments was ignored on the grounds that the files restored from
archive don't consume any space in the recovery server. That was changed in
later releases, but even then it was arguably a feature rather than a bug,
as performing restartpoints as often as checkpoints during normal operation
might be excessive, but you might nevertheless not want to waste a lot of
space for pre-allocated WAL by setting checkpoint_segments to a high value.
But now that we have separate min_wal_size and max_wal_size settings, you
can bound WAL usage with max_wal_size, and still avoid consuming excessive
space usage by setting min_wal_size to a lower value, so that argument is
moot.

There are still some issues with actually limiting the space usage to
max_wal_size: restartpoints in recovery can only start after seeing the
checkpoint record, while a checkpoint starts flushing buffers as soon as
the redo-pointer is set. Restartpoint is paced to happen at the same
leisurily speed, determined by checkpoint_completion_target, as checkpoints,
but because they are started later, max_wal_size can be exceeded by upto
one checkpoint cycle's worth of WAL, depending on
checkpoint_completion_target. But that seems better than not trying at all,
and max_wal_size is a soft limit anyway.

The documentation already claimed that max_wal_size is obeyed in recovery,
so this just fixes the behaviour to match the docs. However, add some
weasel-words there to mention that max_wal_size may well be exceeded by
some amount in recovery.
2015-06-29 00:09:10 +03:00
Heikki Linnakangas a32c3ec893 Promote the assertion that XLogBeginInsert() is not called twice into ERROR.
Seems like cheap insurance for WAL bugs. A spurious call to
XLogBeginInsert() in itself would be fairly harmless, but if there is any
data registered and the insertion is not completed/cancelled properly, there
is a risk that the data ends up in a wrong WAL record.

Per Jeff Janes's suggestion.
2015-06-28 22:30:39 +03:00
Heikki Linnakangas a45c70acf3 Fix double-XLogBeginInsert call in GIN page splits.
If data checksums or wal_log_hints is on, and a GIN page is split, the code
to find a new, empty, block was called after having already called
XLogBeginInsert(). That causes an assertion failure or PANIC, if finding the
new block involves updating a FSM page that had not been modified since last
checkpoint, because that update is WAL-logged, which calls XLogBeginInsert
again. Nested XLogBeginInsert calls are not supported.

To fix, rearrange GIN code so that XLogBeginInsert is called later, after
finding the victim buffers.

Reported by Jeff Janes.
2015-06-28 22:16:21 +03:00
Simon Riggs 66fbcb0d2e Avoid hot standby cancels from VAC FREEZE
VACUUM FREEZE generated false cancelations of standby queries on an
otherwise idle master. Caused by an off-by-one error on cutoff_xid
which goes back to original commit.

Backpatch to all versions 9.0+

Analysis and report by Marco Nenciarini

Bug fix by Simon Riggs
2015-06-27 00:41:47 +01:00
Alvaro Herrera 4028222468 Fix BRIN xlog replay
There was a confusion about which block number to use when storing an
item's pointer in the revmap -- the revmap page's blkno was being used,
not the data page's blkno.

Spotted-by: Jeff Janes
2015-06-26 18:13:05 -03:00
Robert Haas 8f15f74a44 Be more conservative about removing tablespace "symlinks".
Don't apply rmtree(), which will gleefully remove an entire subtree,
and don't even apply unlink() unless it's symlink or a directory,
the only things that we expect to find.

Amit Kapila, with minor tweaks by me, per extensive discussions
involving Andrew Dunstan, Fujii Masao, and Heikki Linnakangas,
at least some of whom also reviewed the code.
2015-06-26 15:53:13 -04:00
Heikki Linnakangas 4b8e24b9ad Fix a couple of bugs with wal_log_hints.
1. Replay of the WAL record for setting a bit in the visibility map
contained an assertion that a full-page image of that record type can only
occur with checksums enabled. But it can also happen with wal_log_hints, so
remove the assertion. Unlike checksums, wal_log_hints can be changed on the
fly, so it would be complicated to figure out if it was enabled at the time
that the WAL record was generated.

2. wal_log_hints has the same effect on the locking needed to read the LSN
of a page as data checksums. BufferGetLSNAtomic() didn't get the memo.

Backpatch to 9.4, where wal_log_hints was added.
2015-06-26 12:38:24 +03:00
Andres Freund 667912aee6 Improve multixact emergency autovacuum logic.
Previously autovacuum was not necessarily triggered if space in the
members slru got tight. The first problem was that the signalling was
tied to values in the offsets slru, but members can advance much
faster. Thats especially a problem if old sessions had been around that
previously prevented the multixact horizon to increase. Secondly the
skipping logic doesn't work if the database was restarted after
autovacuum was triggered - that knowledge is not preserved across
restart. This is especially a problem because it's a common
panic-reaction to restart the database if it gets slow to
anti-wraparound vacuums.

Fix the first problem by separating the logic for members from
offsets. Trigger autovacuum whenever a multixact crosses a segment
boundary, as the current member offset increases in irregular values, so
we can't use a simple modulo logic as for offsets.  Add a stopgap for
the second problem, by signalling autovacuum whenver ERRORing out
because of boundaries.

Discussion: 20150608163707.GD20772@alap3.anarazel.de

Backpatch into 9.3, where it became more likely that multixacts wrap
around.
2015-06-21 18:57:28 +02:00
Andres Freund 90231cd518 Add missing check for wal_debug GUC.
9a20a9b2 added a new elog(), enabled when WAL_DEBUG is defined. The
other WAL_DEBUG dependant messages check for the wal_debug GUC, but this
one did not. While at it replace 'upto' with 'up to'.

Discussion: 20150610110253.GF3832@alap3.anarazel.de

Backpatch to 9.4, the first release containing 9a20a9b2.
2015-06-21 18:37:09 +02:00
Robert Haas ed16f73c57 Fix corner case in autovacuum-forcing logic for multixact wraparound.
Since find_multixact_start() relies on SimpleLruDoesPhysicalPageExist(),
and that function looks only at the on-disk state, it's possible for it
to fail to find a page that exists in the in-memory SLRU that has not
been written yet.  If that happens, SetOffsetVacuumLimit() will
erroneously decide to force emergency autovacuuming immediately.

We should probably fix find_multixact_start() to consider the data
cached in memory as well as on the on-disk state, but that's no excuse
for SetOffsetVacuumLimit() to be stupid about the case where it can
no longer read the value after having previously succeeded in doing so.

Report by Andres Freund.
2015-06-19 11:28:30 -04:00
Alvaro Herrera 94232c909d Fix typos
tablesapce -> tablespace
there -> their

These were introduced in 72d422a52, so no need to backpatch.
2015-06-08 15:37:42 -03:00
Fujii Masao 7abc685974 Refactor WAL segment copying code.
* Remove unused argument "dstfname" and related code from XLogFileCopy().

* Previously XLogFileCopy() returned a pstrdup'd string so that
InstallXLogFileSegment() used it later. Since the pstrdup'd string was never
free'd, there could be a risk of memory leak. It was almost harmless because
the startup process exited just after calling XLogFileCopy(), it existed.
This commit changes XLogFileCopy() so that it directly calls
InstallXLogFileSegment() and doesn't call pstrdup() at all. Which fixes that
memory leak problem.

* Extend InstallXLogFileSegment() so that the caller can specify the log level.
Which allows us to emit an error when InstallXLogFileSegment() fails a disk
file access like link() and rename(). Previously it was always logged with
LOG level and additionally needed to be logged with ERROR when we wanted
to treat it as an error.

Michael Paquier
2015-06-09 03:03:24 +09:00
Andres Freund d1b958218a Allow HotStandbyActiveInReplay() to be called in single user mode.
HotStandbyActiveInReplay, introduced in 061b079f, only allowed WAL
replay to happen in the startup process, missing the single user case.

This buglet is fairly harmless as it only causes problems when single
user mode in an assertion enabled build is used to replay a btree vacuum
record.

Backpatch to 9.2. 061b079f was backpatched further, but the assertion
was not.
2015-06-08 14:09:27 +02:00
Robert Haas 068cfadf9e Cope with possible failure of the oldest MultiXact to exist.
Recent commits, mainly b69bf30b9b and
53bb309d2d, introduced mechanisms to
protect against wraparound of the MultiXact member space: the number
of multixacts that can exist at one time is limited to 2^32, but the
total number of members in those multixacts is also limited to 2^32,
and older code did not take care to enforce the second limit,
potentially allowing old data to be overwritten while it was still
needed.

Unfortunately, these new mechanisms failed to account for the fact
that the code paths in which they run might be executed during
recovery or while the cluster was in an inconsistent state.  Also,
they failed to account for the fact that users who used pg_upgrade
to upgrade a PostgreSQL version between 9.3.0 and 9.3.4 might have
might oldestMultiXid = 1 in the control file despite the true value
being larger.

To fix these problems, first, avoid unnecessarily examining the
mmembers of MultiXacts when the cluster is not known to be consistent.
TruncateMultiXact has done this for a long time, and this patch does
not fix that.  But the new calls used to prevent member wraparound
are not needed until we reach normal running, so avoid calling them
earlier.  (SetMultiXactIdLimit is actually called before InRecovery
is set, so we can't rely on that; we invent our own multixact-specific
flag instead.)

Second, make failure to look up the members of a MultiXact a non-fatal
error.  Instead, if we're unable to determine the member offset at
which wraparound would occur, postpone arming the member wraparound
defenses until we are able to do so.  If we're unable to determine the
member offset that should force autovacuum, force it continuously
until we are able to do so.  If we're unable to deterine the member
offset at which we should truncate the members SLRU, log a message and
skip truncation.

An important consequence of these changes is that anyone who does have
a bogus oldestMultiXid = 1 value in pg_control will experience
immediate emergency autovacuuming when upgrading to a release that
contains this fix.  The release notes should highlight this fact.  If
a user has no pg_multixact/offsets/0000 file, but has oldestMultiXid = 1
in the control file, they may wish to vacuum any tables with
relminmxid = 1 prior to upgrading in order to avoid an immediate
emergency autovacuum after the upgrade.  This must be done with a
PostgreSQL version 9.3.5 or newer and with vacuum_multixact_freeze_min_age
and vacuum_multixact_freeze_table_age set to 0.

This patch also adds an additional log message at each database server
startup, indicating either that protections against member wraparound
have been engaged, or that they have not.  In the latter case, once
autovacuum has advanced oldestMultiXid to a sane value, the message
indicating that the guards have been engaged will appear at the next
checkpoint.  A few additional messages have also been added at the DEBUG1
level so that the correct operation of this code can be properly audited.

Along the way, this patch fixes another, related bug in TruncateMultiXact
that has existed since PostgreSQL 9.3.0: when no MultiXacts exist at
all, the truncation code looks up NextMultiXactId, which doesn't exist
yet.  This can lead to TruncateMultiXact removing every file in
pg_multixact/offsets instead of keeping one around, as it should.
This in turn will cause the database server to refuse to start
afterwards.

Patch by me.  Review by Álvaro Herrera, Andres Freund, Noah Misch, and
Thomas Munro.
2015-06-05 09:31:57 -04:00
Tom Lane d8179b001a Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f337 introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.

After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/.  The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.

This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.

I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().

The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.

Back-patch to all active branches, like the prior commit.

Abhijit Menon-Sen and Tom Lane
2015-05-28 17:33:03 -04:00
Tom Lane 79f2b5d583 Fix valgrind's "unaddressable bytes" whining about BRIN code.
brin_form_tuple calculated an exact tuple size, then palloc'd and
filled just that much.  Later, brin_doinsert or brin_doupdate would
MAXALIGN the tuple size and tell PageAddItem that that was the size
of the tuple to insert.  If the original tuple size wasn't a multiple
of MAXALIGN, the net result would be that PageAddItem would memcpy
a few more bytes than the palloc request had been for.

AFAICS, this is totally harmless in the real world: the error is a
read overrun not a write overrun, and palloc would certainly have
rounded the request up to a MAXALIGN multiple internally, so there's
no chance of the memcpy fetching off the end of memory.  Valgrind,
however, is picky to the byte level not the MAXALIGN level.

Fix it by pushing the MAXALIGN step back to brin_form_tuple.  (The other
possible source of tuples in this code, brin_form_placeholder_tuple,
was already producing a MAXALIGN'd result.)

In passing, be a bit more paranoid about internal allocations in
brin_form_tuple.
2015-05-25 21:56:19 -04:00
Alvaro Herrera cdbdc43827 Update README.tuplock
Multixact truncation is now handled differently, and this file hadn't
gotten the memo.

Per note from Amit Langote.  I didn't use his patch, though.

Also update the description of infomask bits, which weren't completely up
to date either.  This commit also propagates b01a4f6838 back to 9.3 and
9.4, which apparently I failed to do back then.
2015-05-25 15:09:05 -03:00
Tom Lane 2aa0476dc3 Manual cleanup of pgindent results.
Fix some places where pgindent did silly stuff, often because project
style wasn't followed to begin with.  (I've not touched the atomics
headers, though.)
2015-05-24 15:04:10 -04:00
Bruce Momjian 807b9e0dff pgindent run for 9.5 2015-05-23 21:35:49 -04:00
Tom Lane 72809480d6 Fix incorrect snprintf() limit.
Typo in commit 7cbee7c0a.  No practical effect since the buffer should
never actually be overrun, but various compilers and static analyzers will
whine about it.

Petr Jelinek
2015-05-23 16:05:52 -04:00
Tom Lane 821b821a24 Still more fixes for lossy-GiST-distance-functions patch.
Fix confusion in documentation, substantial memory leakage if float8 or
float4 are pass-by-reference, and assorted comments that were obsoleted
by commit 98edd617f3.
2015-05-23 15:22:25 -04:00
Heikki Linnakangas 7cbee7c0a1 At promotion, don't leave behind a partial segment on the old timeline.
With commit de768844, a copy of the partial segment was archived with the
.partial suffix, but the original file was still left in pg_xlog, so it
didn't actually solve the problems with archiving the partial segment that
it was supposed to solve. With this patch, the partial segment is renamed
rather than copied, so we only archive it with the .partial suffix.

Also be more robust in detecting if the last segment is already being
archived. Previously I used XLogArchiveIsBusy() for that, but that's not
quite right. With archive_mode='always', there might be a .ready file for
it, and we don't want to rename it to .partial in that case.

The old segment is needed until we're fully committed to the new timeline,
i.e. until we've written the end-of-recovery WAL record and updated the
min recovery point and timeline in the control file. So move the renaming
later in the startup sequence, after all that's been done.
2015-05-22 11:04:33 +03:00
Fujii Masao 85d0e661aa Make recovery_target_action = pause work.
Previously even if recovery_target_action was set to pause and
the recovery target was reached, the recovery could never be paused.
Because the setting of pause was *always* overridden with that of
shutdown unexpectedly. This override is valid and intentional
if hot_standby is not enabled because there is no way to resume
the paused recovery in this case and the setting of pause is
completely useless. But not if hot_standby is enabled.

This patch changes the code so that the setting of pause is overridden
with that of shutdown only when hot_standby is not enabled.

Bug reported by Andres Freund
2015-05-21 13:56:17 +09:00
Heikki Linnakangas fa60fb63e5 Fix more typos in comments.
Patch by CharSyam, plus a few more I spotted with grep.
2015-05-20 19:45:43 +03:00
Heikki Linnakangas 4fc72cc7bb Collection of typo fixes.
Use "a" and "an" correctly, mostly in comments. Two error messages were
also fixed (they were just elogs, so no translation work required). Two
function comments in pg_proc.h were also fixed. Etsuro Fujita reported one
of these, but I found a lot more with grep.

Also fix a few other typos spotted while grepping for the a/an typos.
For example, "consists out of ..." -> "consists of ...". Plus a "though"/
"through" mixup reported by Euler Taveira.

Many of these typos were in old code, which would be nice to backpatch to
make future backpatching easier. But much of the code was new, and I didn't
feel like crafting separate patches for each branch. So no backpatching.
2015-05-20 16:56:22 +03:00