Commit Graph

7588 Commits

Author SHA1 Message Date
Peter Eisentraut 6fc547960d Rename C symbols for backend lo_ functions
Rename the C symbols for lo_* to be_lo_*, so they don't conflict with
libpq prototypes.

Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
2017-01-17 12:35:30 -05:00
Peter Eisentraut 30b9a4495a Remove unnecessary include
Between 6eeb95f0f5 and
7b1c2a0f20, builtins.h contained
additional prototypes that have now been moved elsewhere, so we don't
need to include nodes/parsenodes.h anymore.

Fix some files that were relying on builtins.h implicitly pulling in
some unrelated stuff they needed.

Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
2017-01-17 12:35:19 -05:00
Tom Lane d43a619c60 Fix check_srf_call_placement() to handle VALUES cases correctly.
INSERT ... VALUES with a single VALUES row is implemented quite differently
from the general VALUES case.  A user-visible implication of that is that
we accept SRFs in the single-row case, but not in the multi-row case.
That's a historical artifact no doubt, but in view of the lack of field
complaints, I'm not excited about fixing it right now.

However, check_srf_call_placement() needs to know about this, first because
it should throw an error in the unsupported case, and second because it
should set p_hasTargetSRFs in the single-row case (because we treat that
like a SELECT tlist).  That's an oversight in commit a4c35ea1c.

To fix, split EXPR_KIND_VALUES into two values.  So far as I can see,
this is the only place where we need to distinguish the two cases at
present; but there might be more later.

Patch by me, per report from Andres Freund.

Discussion: https://postgr.es/m/20170116081548.zg63zltblwimpfgp@alap3.anarazel.de
2017-01-16 15:23:11 -05:00
Tom Lane 0777f7a2e8 Fix matching of boolean index columns to sort ordering.
Normally, if we have a WHERE clause like "indexcol = constant",
the planner will figure out that that index column can be ignored
when determining whether the index has a desired sort ordering.
But this failed to work for boolean index columns, because a
condition like "boolcol = true" is canonicalized to just "boolcol"
which does not give rise to an EquivalenceClass.  Add a check to
allow the same type of deduction to be made in this case too.

Per a complaint from Dima Pavlov.  Arguably this is a bug, but given the
limited impact and the small number of complaints so far, I won't risk
destabilizing plans in stable branches by back-patching.

Patch by me, reviewed by Michael Paquier

Discussion: https://postgr.es/m/1788.1481605684@sss.pgh.pa.us
2017-01-15 14:09:35 -05:00
Tom Lane ab1f0c8225 Change representation of statement lists, and add statement location info.
This patch makes several changes that improve the consistency of
representation of lists of statements.  It's always been the case
that the output of parse analysis is a list of Query nodes, whatever
the types of the individual statements in the list.  This patch brings
similar consistency to the outputs of raw parsing and planning steps:

* The output of raw parsing is now always a list of RawStmt nodes;
the statement-type-dependent nodes are one level down from that.

* The output of pg_plan_queries() is now always a list of PlannedStmt
nodes, even for utility statements.  In the case of a utility statement,
"planning" just consists of wrapping a CMD_UTILITY PlannedStmt around
the utility node.  This list representation is now used in Portal and
CachedPlan plan lists, replacing the former convention of intermixing
PlannedStmts with bare utility-statement nodes.

Now, every list of statements has a consistent head-node type depending
on how far along it is in processing.  This allows changing many places
that formerly used generic "Node *" pointers to use a more specific
pointer type, thus reducing the number of IsA() tests and casts needed,
as well as improving code clarity.

Also, the post-parse-analysis representation of DECLARE CURSOR is changed
so that it looks more like EXPLAIN, PREPARE, etc.  That is, the contained
SELECT remains a child of the DeclareCursorStmt rather than getting flipped
around to be the other way.  It's now true for both Query and PlannedStmt
that utilityStmt is non-null if and only if commandType is CMD_UTILITY.
That allows simplifying a lot of places that were testing both fields.
(I think some of those were just defensive programming, but in many places,
it was actually necessary to avoid confusing DECLARE CURSOR with SELECT.)

Because PlannedStmt carries a canSetTag field, we're also able to get rid
of some ad-hoc rules about how to reconstruct canSetTag for a bare utility
statement; specifically, the assumption that a utility is canSetTag if and
only if it's the only one in its list.  While I see no near-term need for
relaxing that restriction, it's nice to get rid of the ad-hocery.

The API of ProcessUtility() is changed so that what it's passed is the
wrapper PlannedStmt not just the bare utility statement.  This will affect
all users of ProcessUtility_hook, but the changes are pretty trivial; see
the affected contrib modules for examples of the minimum change needed.
(Most compilers should give pointer-type-mismatch warnings for uncorrected
code.)

There's also a change in the API of ExplainOneQuery_hook, to pass through
cursorOptions instead of expecting hook functions to know what to pick.
This is needed because of the DECLARE CURSOR changes, but really should
have been done in 9.6; it's unlikely that any extant hook functions
know about using CURSOR_OPT_PARALLEL_OK.

Finally, teach gram.y to save statement boundary locations in RawStmt
nodes, and pass those through to Query and PlannedStmt nodes.  This allows
more intelligent handling of cases where a source query string contains
multiple statements.  This patch doesn't actually do anything with the
information, but a follow-on patch will.  (Passing this information through
cleanly is the true motivation for these changes; while I think this is all
good cleanup, it's unlikely we'd have bothered without this end goal.)

catversion bump because addition of location fields to struct Query
affects stored rules.

This patch is by me, but it owes a good deal to Fabien Coelho who did
a lot of preliminary work on the problem, and also reviewed the patch.

Discussion: https://postgr.es/m/alpine.DEB.2.20.1612200926310.29821@lancre
2017-01-14 16:02:35 -05:00
Robert Haas 0563a3a8b5 Fix a bug in how we generate partition constraints.
Move the code for doing parent attnos to child attnos mapping for Vars
in partition constraint expressions to a separate function
map_partition_varattnos() and call it from the appropriate places.
Doing it in get_qual_from_partbound(), as is now, would produce wrong
result in certain multi-level partitioning cases, because it only
considers the current pair of parent-child relations.  In certain
multi-level partitioning cases, attnums for the same key attribute(s)
might differ between various levels causing the same attribute to be
numbered differently in different instances of the Var corresponding
to a given attribute.

With this commit, in generate_partition_qual(), we first generate the
the whole partition constraint (considering all levels of partitioning)
and then do the mapping, so that Vars in the final expression are
numbered according the leaf relation (to which it is supposed to apply).

Amit Langote, reviewed by me.
2017-01-13 14:04:35 -05:00
Tom Lane e2117e4ab0 Fix field order in struct catcache.
Somebody failed to grasp the point of having the #ifdef CATCACHE_STATS
fields at the end of the struct.  Put that back the way it should be,
and add a comment making it more explicit why it should be that way.
2017-01-12 18:59:57 -05:00
Alvaro Herrera 3957b58b88 Fix ALTER TABLE / SET TYPE for irregular inheritance
If inherited tables don't have exactly the same schema, the USING clause
in an ALTER TABLE / SET DATA TYPE misbehaves when applied to the
children tables since commit 9550e8348b.  Starting with that commit,
the attribute numbers in the USING expression are fixed during parse
analysis.  This can lead to bogus errors being reported during
execution, such as:
   ERROR:  attribute 2 has wrong type
   DETAIL:  Table has type smallint, but query expects integer.

Since it wouldn't do to revert to the original coding, we now apply a
transformation to map the attribute numbers to the correct ones for each
child.

Reported by Justin Pryzby
Analysis by Tom Lane; patch by me.
Discussion: https://postgr.es/m/20170102225618.GA10071@telsasoft.com
2017-01-09 19:26:58 -03:00
Tom Lane 7c3abe3c92 Get rid of ParseState.p_value_substitute; use a columnref hook instead.
I noticed that p_value_substitute, which is a single-purpose kluge I added
in 2002 (commit b0422b215), could be replaced by having domainAddConstraint
install a parser hook that looks for the name "value".  The parser hook
code only dates back to 2009, so it's not surprising that we had to kluge
this in 2002, but we can do it more cleanly now.
2017-01-07 16:02:16 -05:00
Tom Lane 3c40594e6e Improve documentation of struct ParseState.
I got annoyed about how some fields of ParseState were documented in the
struct's block comment and some weren't; not all of the latter are trivial.
Fix that.  Also reorder a couple of fields that seem to have been placed
rather randomly, or maybe with an idea of avoiding padding space; but there
are never so many ParseStates in existence at one time that we ought to
value pad space over readability.
2017-01-07 15:34:28 -05:00
Robert Haas 175ff6598e Fix possible crash reading pg_stat_activity.
With the old code, a backend that read pg_stat_activity without ever
having executed a parallel query might see a backend in the midst of
executing one waiting on a DSA LWLock, resulting in a crash.  The
solution is for backends to register the tranche at startup time, not
the first time a parallel query is executed.

Report by Andreas Seltenreich.  Patch by me, reviewed by Thomas Munro.
2017-01-05 12:27:09 -05:00
Robert Haas 18fc5192a6 Remove unnecessary arguments from partitioning functions.
RelationGetPartitionQual() and generate_partition_qual() are always
called with recurse = true, so we don't need an argument for that.

Extracted by me from a larger patch by Amit Langote.
2017-01-04 14:56:37 -05:00
Robert Haas f1b4c771ea Fix reporting of constraint violations for table partitioning.
After a tuple is routed to a partition, it has been converted from the
root table's row type to the partition's row type.  ExecConstraints
needs to report the failure using the original tuple and the parent's
tuple descriptor rather than the ones for the selected partition.

Amit Langote
2017-01-04 14:36:34 -05:00
Tom Lane 698127a4a9 Prefer int-wide pg_atomic_flag over char-wide when using gcc intrinsics.
configure can only probe the existence of gcc intrinsics, not how well
they're implemented, and unfortunately the answer is sometimes "badly".
In particular we've found that multiple compilers fail to implement
char-width __sync_lock_test_and_set() correctly on PPC; and even a correct
implementation would necessarily be pretty inefficient, since that hardware
has only a word-wide primitive to work with.

Given the knowledge we've accumulated in s_lock.h, it appears that it's
best to rely on int-width TAS operations on most non-Intel architectures.
Hence, pick int not char when both are nominally available to us in
generic-gcc.h (note that that code is not used for x86[_64]).

Back-patch to fix regression test failures on FreeBSD/PPC.  Ordinarily
back-patching a change like this would be verboten because of ABI breakage.
But since pg_atomic_flag is not yet used in any Postgres data structure,
there's no ABI to break.  It seems safer to back-patch to avoid possible
gotchas, if someday we do back-patch something that uses pg_atomic_flag.

Discussion: https://postgr.es/m/25414.1483076673@sss.pgh.pa.us
2017-01-04 13:36:55 -05:00
Robert Haas 345b2dcf07 Move partition_tuple_slot out of EState.
Commit 2ac3ef7a01 added a TupleTapleSlot
for partition tuple slot to EState (es_partition_tuple_slot) but it's
more logical to have it as part of ModifyTableState
(mt_partition_tuple_slot) and CopyState (partition_tuple_slot).

Discussion: http://postgr.es/m/1bd459d9-4c0c-197a-346e-e5e59e217d97@lab.ntt.co.jp

Amit Langote, per a gripe from me
2017-01-04 13:16:59 -05:00
Tom Lane 6667d9a6d7 Re-allow SSL passphrase prompt at server start, but not thereafter.
Leave OpenSSL's default passphrase collection callback in place during
the first call of secure_initialize() in server startup.  Although that
doesn't work terribly well in daemon contexts, some people feel we should
not break it for anyone who was successfully using it before.  We still
block passphrase demands during SIGHUP, meaning that you can't adjust SSL
configuration on-the-fly if you used a passphrase, but this is no worse
than what it was before commit de41869b6.  And we block passphrase demands
during EXEC_BACKEND reloads; that behavior wasn't useful either, but at
least now it's documented.

Tweak some related log messages for more readability, and avoid issuing
essentially duplicate messages about reload failure caused by a passphrase.

Discussion: https://postgr.es/m/29982.1483412575@sss.pgh.pa.us
2017-01-04 12:44:03 -05:00
Robert Haas 0fad355bec Update obsolete comments in lwlock.h.
The typical size of an LWLock is now 16 bytes even on 64-bit platforms,
and the size of slock_t is now irrelevant.  But pg_atomic_uint32 can
(perhaps surprisingly) still be larger than 4 bytes, so there's still
some marginal point to allowing LWLOCK_MINIMAL_SIZE == 64.

Commit 008608b9d5 made the changes
that led to the need for these updates.
2017-01-04 12:03:40 -05:00
Bruce Momjian 1d25779284 Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
Tom Lane de41869b64 Allow SSL configuration to be updated at SIGHUP.
It is no longer necessary to restart the server to enable, disable,
or reconfigure SSL.  Instead, we just create a new SSL_CTX struct
(by re-reading all relevant files) whenever we get SIGHUP.  Testing
shows that this is fast enough that it shouldn't be a problem.

In conjunction with that, downgrade the logic that complains about
pg_hba.conf "hostssl" lines when SSL isn't active: now that's just
a warning condition not an error.

An issue that still needs to be addressed is what shall we do with
passphrase-protected server keys?  As this stands, the server would
demand the passphrase again on every SIGHUP, which is certainly
impractical.  But the case was only barely supported before, so that
does not seem a sufficient reason to hold up committing this patch.

Andreas Karlsson, reviewed by Michael Banck and Michael Paquier

Discussion: https://postgr.es/m/556A6E8A.9030400@proxel.se
2017-01-02 21:37:12 -05:00
Tom Lane 1d63f7d2d1 Use clock_gettime(), if available, in instr_time measurements.
The advantage of clock_gettime() is that the API allows the result to
be precise to nanoseconds, not just microseconds as in gettimeofday().
Now that it's routinely possible to do tens of plan node executions
in 1us, we really need more precision than gettimeofday() can offer
for EXPLAIN ANALYZE to accumulate statistics with.

Some research shows that clock_gettime() is available on pretty nearly
every modern Unix-ish platform, and as far as I have been able to test,
it has about the same execution time as gettimeofday(), so there's no
loss in switching over.  (By the same token, this doesn't do anything
to fix the fact that we really wish clock readings were faster.  But
there's enough win here to justify changing anyway.)

A small side benefit is that on most platforms, we can use CLOCK_MONOTONIC
instead of CLOCK_REALTIME and thereby render EXPLAIN impervious to
concurrent resets of the system clock.  (This means that code must not
assume that the contents of struct instr_time have any well-defined
interpretation as timestamps, but really that was true before.)

Some platforms offer nonstandard clock IDs that might be of interest.
This patch knows we should use CLOCK_MONOTONIC_RAW on macOS, because it
provides more precision and is faster to read than their CLOCK_MONOTONIC.
If there turn out to be many more cases where we need special rules, it
might be appropriate to handle the selection of clock ID in configure,
but for the moment that doesn't seem worth the trouble.

Discussion: https://postgr.es/m/31856.1400021891@sss.pgh.pa.us
2017-01-02 13:41:51 -05:00
Tom Lane 80a7298b9e Remove manual breaks in NodeTag assignments to fix duplicate tag numbers.
Commit f0e44751d added new node tags at a place in the tag numbering
where there was no daylight left before the next hard-coded number,
resulting in some duplicate tag assignments.  This doesn't seem to have
caused any big problem so far, but it's surely trouble waiting to happen.

We could adjust the manually assigned breakpoints to make more room,
but that just leaves the same hazard waiting to strike again in future.
What seems like a better idea is to get rid of the manual assignments
and leave NodeTags to be automatically assigned, consecutively from one
on up.  This means that any change in the tag list forces a backend-wide
recompile, but realistically that's usually needed anyway.

Discussion: https://postgr.es/m/29670.1482942811@sss.pgh.pa.us
2016-12-29 16:57:41 -05:00
Peter Eisentraut 27866bd1e8 Expand ad-hoc unit abbreviations in function descriptions
There is no need to use abbreviations here, so just write it out for
consistency.
2016-12-29 11:15:01 -05:00
Peter Eisentraut 2e254130d1 Make more use of RoleSpec struct
Most code was casting this through a generic Node.  By declaring
everything as RoleSpec appropriately, we can remove a bunch of casts and
ad-hoc node type checking.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
2016-12-29 10:49:39 -05:00
Tom Lane fe591f8bf6 Replace enum InhOption with simple boolean.
Now that it has only INH_NO and INH_YES values, it's just weird that
it's not a plain bool, so make it that way.

Also rename RangeVar.inhOpt to "inh", to be like RangeTblEntry.inh.
My recollection is that we gave it a different name specifically because
it had a different representation than the derived bool value, but it
no longer does.  And this is a good forcing function to be sure we
catch any places that are affected by the change.

Bump catversion because of possible effect on stored RangeVar nodes.
I'm not exactly convinced that we ever store RangeVar on disk, but
we have a readfuncs function for it, so be cautious.  (If we do do so,
then commit e13486eba was in error not to bump catversion.)

Follow-on to commit e13486eba.

Discussion: http://postgr.es/m/CA+TgmoYe+EG7LdYX6pkcNxr4ygkP4+A=jm9o-CPXyOvRiCNwaQ@mail.gmail.com
2016-12-23 13:35:18 -05:00
Robert Haas e13486eba0 Remove sql_inheritance GUC.
This backward-compatibility GUC is long overdue for removal.

Discussion: http://postgr.es/m/CA+TgmoYe+EG7LdYX6pkcNxr4ygkP4+A=jm9o-CPXyOvRiCNwaQ@mail.gmail.com
2016-12-23 07:35:01 -05:00
Robert Haas 7819ba1ef6 Remove _hash_chgbufaccess().
This is basically for the same reasons I got rid of _hash_wrtbuf()
in commit 25216c98938495fd741bf585dcbef45b3a9ffd40: it's not
convenient to have a function which encapsulates MarkBufferDirty(),
especially as we move towards having hash indexes be WAL-logged.

Patch by me, reviewed (but not entirely endorsed) by Amit Kapila.
2016-12-23 07:14:37 -05:00
Robert Haas 2ac3ef7a01 Fix tuple routing in cases where tuple descriptors don't match.
The previous coding failed to work correctly when we have a
multi-level partitioned hierarchy where tables at successive levels
have different attribute numbers for the partition key attributes.  To
fix, have each PartitionDispatch object store a standalone
TupleTableSlot initialized with the TupleDesc of the corresponding
partitioned table, along with a TupleConversionMap to map tuples from
the its parent's rowtype to own rowtype.  After tuple routing chooses
a leaf partition, we must use the leaf partition's tuple descriptor,
not the root table's.  To that end, a dedicated TupleTableSlot for
tuple routing is now allocated in EState.

Amit Langote
2016-12-22 17:36:37 -05:00
Tom Lane cd1b215692 Fix handling of expanded objects in CoerceToDomain and CASE execution.
When the input value to a CoerceToDomain expression node is a read-write
expanded datum, we should pass a read-only pointer to any domain CHECK
expressions and then return the original read-write pointer as the
expression result.  Previously we were blindly passing the same pointer to
all the consumers of the value, making it possible for a function in CHECK
to modify or even delete the expanded value.  (Since a plpgsql function
will absorb a passed-in read-write expanded array as a local variable
value, it will in fact delete the value on exit.)

A similar hazard of passing the same read-write pointer to multiple
consumers exists in domain_check() and in ExecEvalCase, so fix those too.

The fix requires adding MakeExpandedObjectReadOnly calls at the appropriate
places, which is simple enough except that we need to get the data type's
typlen from somewhere.  For the domain cases, solve this by redefining
DomainConstraintRef.tcache as okay for callers to access; there wasn't any
reason for the original convention against that, other than not wanting the
API of typcache.c to be any wider than it had to be.  For CASE, there's
no good solution except to add a syscache lookup during executor start.

Per bug #14472 from Marcos Castedo.  Back-patch to 9.5 where expanded
values were introduced.

Discussion: https://postgr.es/m/15225.1482431619@sss.pgh.pa.us
2016-12-22 15:01:37 -05:00
Andres Freund 6ef2eba3f5 Skip checkpoints, archiving on idle systems.
Some background activity (like checkpoints, archive timeout, standby
snapshots) is not supposed to happen on an idle system. Unfortunately
so far it was not easy to determine when a system is idle, which
defeated some of the attempts to avoid redundant activity on an idle
system.

To make that easier, allow to make individual WAL insertions as not
being "important". By checking whether any important activity happened
since the last time an activity was performed, it now is easy to check
whether some action needs to be repeated.

Use the new facility for checkpoints, archive timeout and standby
snapshots.

The lack of a facility causes some issues in older releases, but in my
opinion the consequences (superflous checkpoints / archived segments)
aren't grave enough to warrant backpatching.

Author: Michael Paquier, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Amit Kapila, Kyotaro HORIGUCHI
Bug: #13685
Discussion:
    https://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org
    https://www.postgresql.org/message-id/CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com
Backpatch: -
2016-12-22 11:31:50 -08:00
Heikki Linnakangas 01ec25631f Simplify tape block format.
No more indirect blocks. The blocks form a linked list instead.

This saves some memory, because we don't need to have a buffer in memory to
hold the indirect block (or blocks). To reflect that, TAPE_BUFFER_OVERHEAD
is reduced from 3 to 1 buffer, which allows using more memory for building
the initial runs.

Reviewed by Peter Geoghegan and Robert Haas.

Discussion: https://www.postgresql.org/message-id/34678beb-938e-646e-db9f-a7def5c44ada%40iki.fi
2016-12-22 18:45:00 +02:00
Tom Lane 89fcea1ace Fix strange behavior (and possible crashes) in full text phrase search.
In an attempt to simplify the tsquery matching engine, the original
phrase search patch invented rewrite rules that would rearrange a
tsquery so that no AND/OR/NOT operator appeared below a PHRASE operator.
But this approach had numerous problems.  The rearrangement step was
missed by ts_rewrite (and perhaps other places), allowing tsqueries
to be created that would cause Assert failures or perhaps crashes at
execution, as reported by Andreas Seltenreich.  The rewrite rules
effectively defined semantics for operators underneath PHRASE that were
buggy, or at least unintuitive.  And because rewriting was done in
tsqueryin() rather than at execution, the rearrangement was user-visible,
which is not very desirable --- for example, it might cause unexpected
matches or failures to match in ts_rewrite.

As a somewhat independent problem, the behavior of nested PHRASE operators
was only sane for left-deep trees; queries like "x <-> (y <-> z)" did not
behave intuitively at all.

To fix, get rid of the rewrite logic altogether, and instead teach the
tsquery execution engine to manage AND/OR/NOT below a PHRASE operator
by explicitly computing the match location(s) and match widths for these
operators.

This requires introducing some additional fields into the publicly visible
ExecPhraseData struct; but since there's no way for third-party code to
pass such a struct to TS_phrase_execute, it shouldn't create an ABI problem
as long as we don't move the offsets of the existing fields.

Another related problem was that index searches supposed that "!x <-> y"
could be lossily approximated as "!x & y", which isn't correct because
the latter will reject, say, "x q y" which the query itself accepts.
This required some tweaking in TS_execute_ternary along with the main
tsquery engine.

Back-patch to 9.6 where phrase operators were introduced.  While this
could be argued to change behavior more than we'd like in a stable branch,
we have to do something about the crash hazards and index-vs-seqscan
inconsistency, and it doesn't seem desirable to let the unintuitive
behaviors induced by the rewriting implementation stand as precedent.

Discussion: https://postgr.es/m/28215.1481999808@sss.pgh.pa.us
Discussion: https://postgr.es/m/26706.1482087250@sss.pgh.pa.us
2016-12-21 15:18:39 -05:00
Robert Haas 1fc5c49450 Refactor partition tuple routing code to reduce duplication.
Amit Langote
2016-12-21 11:36:10 -05:00
Robert Haas 3b790d256f Fix corner-case bug in WaitEventSetWaitBlock on Windows.
If we do not reset the FD_READ event, WaitForMultipleObjects won't
return it again again unless we've meanwhile read from the socket,
which is generally true but not guaranteed.  WaitEventSetWaitBlock
itself may fail to return the event to the caller if the latch is
also set, and even if we changed that, the caller isn't obliged to
handle all returned events at once.  On non-Windows systems, the
socket-read event is purely level-triggered, so this issue does
not exist.  To fix, make Windows reset the event when needed.

This bug was introduced by 98a64d0bd7,
and causes hangs when trying to use the pldebugger extension.

Patch by Amit Kapial.  Reported and tested by Ashutosh Sharma, who
also provided some analysis.  Further analysis by Michael Paquier.
2016-12-21 11:01:48 -05:00
Peter Eisentraut f3b421da5f Reorder pg_sequence columns to avoid alignment issue
On AIX, doubles are aligned at 4 bytes, but int64 is aligned at 8 bytes.
Our code assumes that doubles have alignment that can also be applied to
int64, but that fails in this case.  One effect is that
heap_form_tuple() writes tuples in a different layout than
Form_pg_sequence expects.

Rather than rewrite the whole alignment code, work around the issue by
reordering the columns in pg_sequence so that the first int64 column
naturally comes out at an 8-byte boundary.
2016-12-21 09:06:49 -05:00
Peter Eisentraut 1753b1b027 Add pg_sequence system catalog
Move sequence metadata (start, increment, etc.) into a proper system
catalog instead of storing it in the sequence heap object.  This
separates the metadata from the sequence data.  Sequence metadata is now
operated on transactionally by DDL commands, whereas previously
rollbacks of sequence-related DDL commands would be ignored.

Reviewed-by: Andreas Karlsson <andreas@proxel.se>
2016-12-20 08:28:18 -05:00
Robert Haas 7cd0fd655d Invalid parent's relcache after CREATE TABLE .. PARTITION OF.
Otherwise, subsequent commands in the same transaction see the wrong
partition descriptor.

Amit Langote.  Reported by Tomas Vondra and David Fetter.  Reviewed
by me.

Discussion: http://postgr.es/m/22dd313b-d7fd-22b5-0787-654845c8f849%402ndquadrant.com
Discussion: http://postgr.es/m/20161215090916.GB20659%40fetter.org
2016-12-19 22:53:30 -05:00
Robert Haas e13029a5ce Provide a DSA area for all parallel queries.
This will allow future parallel query code to dynamically allocate
storage shared by all participants.

Thomas Munro, with assorted changes by me.
2016-12-19 17:11:46 -05:00
Robert Haas dd728826c5 Fix locking problem in _hash_squeezebucket() / _hash_freeovflpage().
A bucket squeeze operation needs to lock each page of the bucket
before releasing the prior page, but the previous coding fumbled the
locking when freeing an overflow page during a bucket squeeze
operation.  Commit 6d46f4783e
introduced this bug.

Amit Kapila, with help from Kuntal Ghosh and Dilip Kumar, after
an initial trouble report by Jeff Janes.  Reviewed by me.  I also
fixed a problem with a comment.
2016-12-19 12:31:50 -05:00
Robert Haas 668dbbec27 Remove unused file.
This was added in 1054097464, but has
never been used for anything as far as I can tell.  There seems to
be no reason to keep it.
2016-12-19 11:29:31 -05:00
Fujii Masao 3901fd70cc Support quorum-based synchronous replication.
This feature is also known as "quorum commit" especially in discussion
on pgsql-hackers.

This commit adds the following new syntaxes into synchronous_standby_names
GUC. By using FIRST and ANY keywords, users can specify the method to
choose synchronous standbys from the listed servers.

  FIRST num_sync (standby_name [, ...])
  ANY num_sync (standby_name [, ...])

The keyword FIRST specifies a priority-based synchronous replication
which was available also in 9.6 or before. This method makes transaction
commits wait until their WAL records are replicated to num_sync
synchronous standbys chosen based on their priorities.

The keyword ANY specifies a quorum-based synchronous replication
and makes transaction commits wait until their WAL records are
replicated to *at least* num_sync listed standbys. In this method,
the values of sync_state.pg_stat_replication for the listed standbys
are reported as "quorum". The priority is still assigned to each standby,
but not used in this method.

The existing syntaxes having neither FIRST nor ANY keyword are still
supported. They are the same as new syntax with FIRST keyword, i.e.,
a priorirty-based synchronous replication.

Author: Masahiko Sawada
Reviewed-By: Michael Paquier, Amit Kapila and me
Discussion: <CAD21AoAACi9NeC_ecm+Vahm+MMA6nYh=Kqs3KB3np+MBOS_gZg@mail.gmail.com>

Many thanks to the various individuals who were involved in
discussing and developing this feature.
2016-12-19 21:15:30 +09:00
Tom Lane 23c75b55aa Improve documentation around TS_execute().
I got frustrated by the lack of commentary in this area, so here is some
reverse-engineered documentation, along with minor stylistic cleanup.
No code changes more significant than removal of unused variables.

Back-patch to 9.6, not because that's useful in itself, but because
we have some bugs to fix in phrase search and this would cause merge
failures if it's only in HEAD.
2016-12-16 11:50:32 -05:00
Robert Haas 3761fe3c20 Simplify LWLock tranche machinery by removing array_base/array_stride.
array_base and array_stride were added so that we could identify the
offset of an LWLock within a tranche, but this facility is only very
marginally used apart from the main tranche.  So, give every lock in
the main tranche its own tranche ID and get rid of array_base,
array_stride, and all that's attached.  For debugging facilities
(Trace_lwlocks and LWLOCK_STATS) print the pointer address of the
LWLock using %p instead of the offset.  This is arguably more useful,
and certainly a lot cheaper.  Drop the offset-within-tranche from
the information reported to dtrace and from one can't-happen message
inside lwlock.c.

The main user-visible impact of this change is that pg_stat_activity
will now report all waits for LWLocks as "LWLock" rather than
reporting some as "LWLockTranche" and others as "LWLockNamed".

The main motivation for this change is that the need to specify an
array_base and an array_stride is awkward for parallel query.  There
is only a very limited supply of tranche IDs so we can't just keep
allocating new ones, and if we try to use the same tranche IDs every
time then we run into trouble when multiple parallel contexts are
use simultaneously.  So if we didn't get rid of this mechanism we'd
have to make it even more complicated.  By simplifying it in this
way, we instead reduce the size of the generated code for lwlock.c
by about 5%.

Discussion: http://postgr.es/m/CA+TgmoYsFn6NUW1x0AZtupJGUAs1UDY4dJtCN47_Q6D0sP80PA@mail.gmail.com
2016-12-16 11:29:23 -05:00
Robert Haas b81b5a96f4 Unbreak Finalize HashAggregate over Partial HashAggregate.
Commit 5dfc198146 introduced the use
of a new type of hash table with linear reprobing for hash aggregates.
Such a hash table behaves very poorly if keys are inserted in hash
order, which does in fact happen in the case where a query use a
Finalize HashAggregate node fed (via Gather) by a Partial
HashAggregate node.  In fact, queries with this type of plan tend
to run effectively forever.

Fix that by seeding the hash value differently in each worker
(and in the leader, if it participates).

Andres Freund and Robert Haas
2016-12-16 10:03:08 -05:00
Robert Haas 25216c9893 Remove _hash_wrtbuf() in favor of calling MarkBufferDirty().
The whole concept of _hash_wrtbuf() is that we need to know at the
time we're releasing the buffer lock (and pin) whether we dirtied the
buffer, but this is easy to get wrong.  This patch actually fixes one
non-obvious bug of that form: hashbucketcleanup forgot to signal
_hash_squeezebucket, which gets the primary bucket page already
locked, as to whether it had already dirtied the page.  Calling
MarkBufferDirty() at the places where we dirty the buffer is more
intuitive and lets us simplify the code in various places as well.

On top of all that, the ultimate goal here is to make hash indexes
WAL-logged, and as the comments to _hash_wrtbuf() note, it should
go away when that happens.  Making it go away a little earlier than
that seems like a good preparatory step.

Report by Jeff Janes.  Diagnosis by Amit Kapila, Kuntal Ghosh,
and Dilip Kumar.  Patch by me, after studying an alternative patch
submitted by Amit Kapila.

Discussion: http://postgr.es/m/CAA4eK1Kf6tOY0oVz_SEdngiNFkeXrA3xUSDPPORQvsWVPdKqnA@mail.gmail.com
2016-12-16 09:37:28 -05:00
Robert Haas 4b9a98e154 Clean up code, comments, and formatting for table partitioning.
Amit Langote, plus pgindent-ing by me.  Inspired in part by review
comments from Tomas Vondra.
2016-12-13 10:59:14 -05:00
Robert Haas 3856cf9607 Remove should_free arguments to tuplesort routines.
Since commit e94568ecc1, the answer is
always "false", and we do not need to complicate the API by arranging
to return a constant value.

Peter Geoghegan

Discussion: http://postgr.es/m/CAM3SWZQWZZ_N=DmmL7tKy_OUjGH_5mN=N=A6h7kHyyDvEhg2DA@mail.gmail.com
2016-12-12 15:57:35 -05:00
Tom Lane 9b3d02c2a9 Catversion bump for temporary replication slots.
Missed in commit a924c327e2.
Per Fujii Masao.
2016-12-12 14:41:49 -05:00
Tom Lane be7b2848c6 Make the different Unix-y semaphore implementations ABI-compatible.
Previously, the "sem" field of PGPROC varied in size depending on which
kernel semaphore API we were using.  That was okay as long as there was
only one likely choice per platform, but in the wake of commit ecb0d20a9,
that assumption seems rather shaky.  It doesn't seem out of the question
anymore that an extension compiled against one API choice might be loaded
into a postmaster built with another choice.  Moreover, this prevents any
possibility of selecting the semaphore API at postmaster startup, which
might be something we want to do in future.

Hence, change PGPROC.sem to be PGSemaphore (i.e. a pointer) for all Unix
semaphore APIs, and turn the pointed-to data into an opaque struct whose
contents are only known within the responsible modules.

For the SysV and unnamed-POSIX APIs, the pointed-to data has to be
allocated elsewhere in shared memory, which takes a little bit of
rejiggering of the InitShmemAllocation code sequence.  (I invented a
ShmemAllocUnlocked() function to make that a little cleaner than it used
to be.  That function is not meant for any uses other than the ones it
has now, but it beats having InitShmemAllocation() know explicitly about
allocation of space for semaphores and spinlocks.)  This change means an
extra indirection to access the semaphore data, but since we only touch
that when blocking or awakening a process, there shouldn't be any
meaningful performance penalty.  Moreover, at least for the unnamed-POSIX
case on Linux, the sem_t type is quite a bit wider than a pointer, so this
reduces sizeof(PGPROC) which seems like a good thing.

For the named-POSIX API, there's effectively no change: the PGPROC.sem
field was and still is a pointer to something returned by sem_open() in
the postmaster's memory space.  Document and check the pre-existing
limitation that this case can't work in EXEC_BACKEND mode.

It did not seem worth unifying the Windows semaphore ABI with the Unix
cases, since there's no likelihood of needing ABI compatibility much less
runtime switching across those cases.  However, we can simplify the Windows
code a bit if we define PGSemaphore as being directly a HANDLE, rather than
pointer to HANDLE, so let's do that while we're here.  (This also ends up
being no change in what's physically stored in PGPROC.sem.  We're just
moving the HANDLE fetch from callees to callers.)

It would take a bunch of additional code shuffling to get to the point of
actually choosing a semaphore API at postmaster start, but the effects
of that would now be localized in the port/XXX_sema.c files, so it seems
like fit material for a separate patch.  The need for it is unproven as
yet, anyhow, whereas the ABI risk to extensions seems real enough.

Discussion: https://postgr.es/m/4029.1481413370@sss.pgh.pa.us
2016-12-12 13:32:10 -05:00
Peter Eisentraut a924c327e2 Add support for temporary replication slots
This allows creating temporary replication slots that are removed
automatically at the end of the session or on error.

From: Petr Jelinek <petr.jelinek@2ndquadrant.com>
2016-12-12 08:38:17 -05:00
Heikki Linnakangas e7f051b8f9 Refactor the code for verifying user's password.
Split md5_crypt_verify() into three functions:
* get_role_password() to fetch user's password from pg_authid, and check
  its expiration.
* md5_crypt_verify() to check an MD5 authentication challenge
* plain_crypt_verify() to check a plaintext password.

get_role_password() will be needed as a separate function by the upcoming
SCRAM authentication patch set. Most of the remaining functionality in
md5_crypt_verify() was different for MD5 and plaintext authentication, so
split that for readability.

While we're at it, simplify the *_crypt_verify functions by using
stack-allocated buffers to hold the temporary MD5 hashes, instead of
pallocing.

Reviewed by Michael Paquier.

Discussion: https://www.postgresql.org/message-id/3029e460-d47c-710e-507e-d8ba759d7cbb@iki.fi
2016-12-12 12:48:13 +02:00
Tom Lane 0b78106cd4 Fix reporting of column typmods for multi-row VALUES constructs.
expandRTE() and get_rte_attribute_type() reported the exprType() and
exprTypmod() values of the expressions in the first row of the VALUES as
being the column type/typmod returned by the VALUES RTE.  That's fine for
the data type, since we coerce all expressions in a column to have the same
common type.  But we don't coerce them to have a common typmod, so it was
possible for rows after the first one to return values that violate the
claimed column typmod.  This leads to the incorrect result seen in bug
#14448 from Hassan Mahmood, as well as some other corner-case misbehaviors.

The desired behavior is the same as we use in other type-unification
cases: report the common typmod if there is one, but otherwise return -1
indicating no particular constraint.  It's cheap for transformValuesClause
to determine the common typmod while transforming a multi-row VALUES, but
it'd be less cheap for expandRTE() and get_rte_attribute_type() to
re-determine that info every time they're asked --- possibly a lot less
cheap, if the VALUES has many rows.  Therefore, the best fix is to record
the common typmods explicitly in a list in the VALUES RTE, as we were
already doing for column collations.  This looks quite a bit like what
we're doing for CTE RTEs, so we can save a little bit of space and code by
unifying the representation for those two RTE types.  They both now share
coltypes/coltypmods/colcollations fields.  (At some point it might seem
desirable to populate those fields for all RTE types; but right now it
looks like constructing them for other RTE types would add more code and
cycles than it would save.)

The RTE change requires a catversion bump, so this fix is only usable
in HEAD.  If we fix this at all in the back branches, the patch will
need to look quite different.

Report: https://postgr.es/m/20161205143037.4377.60754@wrigleys.postgresql.org
Discussion: https://postgr.es/m/27429.1480968538@sss.pgh.pa.us
2016-12-08 11:40:02 -05:00
Heikki Linnakangas fe7bdf0bf6 Clean up password authentication code a bit.
Commit fe0a0b59, which moved code to do MD5 authentication to a separate
CheckMD5Auth() function, left behind a comment that really belongs inside
the function, too. Also move the check for db_user_namespace inside the
function, seems clearer that way.

Now that the md5 salt is passed as argument to md5_crypt_verify, it's a bit
silly that it peeks into the Port struct to see if MD5 authentication was
used. Seems more straightforward to treat it as an MD5 authentication, if
the md5 salt argument is given. And after that, md5_crypt_verify only used
the Port argument to look at port->user_name, but that is redundant,
because it is also passed as a separate 'role' argument. So remove the Port
argument altogether.
2016-12-08 13:44:47 +02:00
Robert Haas f0e44751d7 Implement table partitioning.
Table partitioning is like table inheritance and reuses much of the
existing infrastructure, but there are some important differences.
The parent is called a partitioned table and is always empty; it may
not have indexes or non-inherited constraints, since those make no
sense for a relation with no data of its own.  The children are called
partitions and contain all of the actual data.  Each partition has an
implicit partitioning constraint.  Multiple inheritance is not
allowed, and partitioning and inheritance can't be mixed.  Partitions
can't have extra columns and may not allow nulls unless the parent
does.  Tuples inserted into the parent are automatically routed to the
correct partition, so tuple-routing ON INSERT triggers are not needed.
Tuple routing isn't yet supported for partitions which are foreign
tables, and it doesn't handle updates that cross partition boundaries.

Currently, tables can be range-partitioned or list-partitioned.  List
partitioning is limited to a single column, but range partitioning can
involve multiple columns.  A partitioning "column" can be an
expression.

Because table partitioning is less general than table inheritance, it
is hoped that it will be easier to reason about properties of
partitions, and therefore that this will serve as a better foundation
for a variety of possible optimizations, including query planner
optimizations.  The tuple routing based which this patch does based on
the implicit partitioning constraints is an example of this, but it
seems likely that many other useful optimizations are also possible.

Amit Langote, reviewed and tested by Robert Haas, Ashutosh Bapat,
Amit Kapila, Rajkumar Raghuwanshi, Corey Huinker, Jaime Casanova,
Rushabh Lathia, Erik Rijkers, among others.  Minor revisions by me.
2016-12-07 13:17:55 -05:00
Stephen Frost cb9dcbc1ee Bump catversion for restrictive RLS changes
Mea culpa.

Pointed out by Andres.
2016-12-06 10:12:31 -05:00
Tom Lane 3ebf2b4545 Remove extraneous semicolon from uses of relptr_declare().
If we're going to write a semicolon after calls of relptr_declare(),
then we don't need one inside the macro, and removing it suppresses
"empty declaration" warnings from pickier compilers (eg pademelon).

While at it, we might as well use relptr() inside relptr_declare(),
because otherwise that macro would likely go unused altogether.

Also improve the comment, which I for one found unclear,
and provide a specific example of intended usage.
2016-12-05 20:27:55 -05:00
Stephen Frost 093129c9d9 Add support for restrictive RLS policies
We have had support for restrictive RLS policies since 9.5, but they
were only available through extensions which use the appropriate hooks.
This adds support into the grammer, catalog, psql and pg_dump for
restrictive RLS policies, thus reducing the cases where an extension is
necessary.

In passing, also move away from using "AND"d and "OR"d in comments.
As pointed out by Alvaro, it's not really appropriate to attempt
to make verbs out of "AND" and "OR", so reword those comments which
attempted to.

Reviewed By: Jeevan Chalke, Dean Rasheed
Discussion: https://postgr.es/m/20160901063404.GY4028@tamriel.snowman.net
2016-12-05 15:50:55 -05:00
Robert Haas 88f626f868 Fix more DSA problems uncovered by the buildfarm.
On 32-bit systems, don't try to use 64-bit DSA pointers, because the
computation of DSA_MAX_SEGMENT_SIZE overflows Size.

Cast 1 to Size before shifting it, so that the compiler doesn't
produce a result of the wrong width.

In passing, change one use of size_t to Size.
2016-12-05 10:38:08 -05:00
Robert Haas 670b3bc8f5 Try to fix some DSA-related compiler warnings.
Commit 13df76a537 was overconfident
about how portable %016lx is.  Some compilers complain because they
need %016llx, while platforms where DSA pointers are only 32 bits
get unhappy about using a 64-bit format for a 32-bit quantity.

Thomas Munro, per an off-list suggestion from me.
2016-12-05 10:01:08 -05:00
Heikki Linnakangas fe0a0b5993 Replace PostmasterRandom() with a stronger source, second attempt.
This adds a new routine, pg_strong_random() for generating random bytes,
for use in both frontend and backend. At the moment, it's only used in
the backend, but the upcoming SCRAM authentication patches need strong
random numbers in libpq as well.

pg_strong_random() is based on, and replaces, the existing implementation
in pgcrypto. It can acquire strong random numbers from a number of sources,
depending on what's available:

- OpenSSL RAND_bytes(), if built with OpenSSL
- On Windows, the native cryptographic functions are used
- /dev/urandom

Unlike the current pgcrypto function, the source is chosen by configure.
That makes it easier to test different implementations, and ensures that
we don't accidentally fall back to a less secure implementation, if the
primary source fails. All of those methods are quite reliable, it would be
pretty surprising for them to fail, so we'd rather find out by failing
hard.

If no strong random source is available, we fall back to using erand48(),
seeded from current timestamp, like PostmasterRandom() was. That isn't
cryptographically secure, but allows us to still work on platforms that
don't have any of the above stronger sources. Because it's not very secure,
the built-in implementation is only used if explicitly requested with
--disable-strong-random.

This replaces the more complicated Fortuna algorithm we used to have in
pgcrypto, which is unfortunate, but all modern platforms have /dev/urandom,
so it doesn't seem worth the maintenance effort to keep that. pgcrypto
functions that require strong random numbers will be disabled with
--disable-strong-random.

Original patch by Magnus Hagander, tons of further work by Michael Paquier
and me.

Discussion: https://www.postgresql.org/message-id/CAB7nPqRy3krN8quR9XujMVVHYtXJ0_60nqgVc6oUk8ygyVkZsA@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/CAB7nPqRWkNYRRPJA7-cF+LfroYV10pvjdz6GNvxk-Eee9FypKA@mail.gmail.com
2016-12-05 13:42:59 +02:00
Robert Haas 767a9039d7 Fix thinko in b3427dade1. 2016-12-02 15:06:41 -05:00
Tom Lane b3427dade1 Delete deleteWhatDependsOn() in favor of more performDeletion() flag bits.
deleteWhatDependsOn() had grown an uncomfortably large number of
assumptions about what it's used for.  There are actually only two minor
differences between what it does and what a regular performDeletion() call
can do, so let's invent additional bits in performDeletion's existing flags
argument that specify those behaviors, and get rid of deleteWhatDependsOn()
as such.  (We'd probably have done it this way from the start, except that
performDeletion didn't originally have a flags argument, IIRC.)

Also, add a SKIP_EXTENSIONS flag bit that prevents ever recursing to an
extension, and use that when dropping temporary objects at session end.
This provides a more general solution to the problem addressed in a hacky
way in commit 08dd23cec: if an extension script creates temp objects and
forgets to remove them again, the whole extension went away when its
contained temp objects were deleted.  The previous solution only covered
temp relations, but this solves it for all object types.

These changes require minor additions in dependency.c to pass the flags
to subroutines that previously didn't get them, but it's still a net
savings of code, and it seems cleaner than before.

Having done this, revert the special-case code added in 08dd23cec that
prevented addition of pg_depend records for temp table extension
membership, because that caused its own oddities: dropping an extension
that had created such a table didn't automatically remove the table,
leading to a failure if the table had another dependency on the extension
(such as use of an extension data type), or to a duplicate-name failure if
you then tried to recreate the extension.  But we keep the part that
prevents the pg_temp_nnn schema from becoming an extension member; we never
want that to happen.  Add a regression test case covering these behaviors.

Although this fixes some arguable bugs, we've heard few field complaints,
and any such problems are easily worked around by explicitly dropping temp
objects at the end of extension scripts (which seems like good practice
anyway).  So I won't risk a back-patch.

Discussion: https://postgr.es/m/e51f4311-f483-4dd0-1ccc-abec3c405110@BlueTreble.com
2016-12-02 14:57:55 -05:00
Robert Haas 13df76a537 Introduce dynamic shared memory areas.
Programmers discovered decades ago that it was useful to have a simple
interface for allocating and freeing memory, which is why malloc() and
free() were invented.  Unfortunately, those handy tools don't work
with dynamic shared memory segments because those are specific to
PostgreSQL and are not necessarily mapped at the same address in every
cooperating process.  So invent our own allocator instead.  This makes
it possible for processes cooperating as part of parallel query
execution to allocate and free chunks of memory without having to
reserve them prior to the start of execution.  It could also be used
for longer lived objects; for example, we could consider storing data
for pg_stat_statements or the stats collector in shared memory using
these interfaces, rather than writing them to files.  Basically,
anything that needs shared memory but can't predict in advance how
much it's going to need might find this useful.

Thomas Munro and Robert Haas.  The original code (of mine) on which
Thomas based his work was actually designed to be a new backend-local
memory allocator for PostgreSQL, but that hasn't gone anywhere - or
not yet, anyway.  Thomas took that work and performed major
refactoring and extensive modifications to make it work with dynamic
shared memory, including the addition of appropriate locking.

Discussion: CA+TgmobkeWptGwiNa+SGFWsTLzTzD-CeLz0KcE-y6LFgoUus4A@mail.gmail.com
Discussion: CAEepm=1z5WLuNoJ80PaCvz6EtG9dN0j-KuHcHtU6QEfcPP5-qA@mail.gmail.com
2016-12-02 12:34:36 -05:00
Robert Haas 13e14a78ea Management of free memory pages.
This is intended as infrastructure for a full-fledged allocator for
dynamic shared memory.  The interface looks a bit like a real
allocator, but only supports allocating and freeing memory in
multiples of the 4kB page size.  Further, to free memory, you must
know the size of the span you wish to free, in pages.  While these are
make it unsuitable as an allocator in and of itself, it still serves
as very useful scaffolding for a full-fledged allocator.

Robert Haas and Thomas Munro.  This code is mostly the same as my 2014
submission, but Thomas fixed quite a few bugs and made some changes to
the interface.

Discussion: CA+TgmobkeWptGwiNa+SGFWsTLzTzD-CeLz0KcE-y6LFgoUus4A@mail.gmail.com
Discussion: CAEepm=1z5WLuNoJ80PaCvz6EtG9dN0j-KuHcHtU6QEfcPP5-qA@mail.gmail.com
2016-12-02 12:03:30 -05:00
Robert Haas fbc1c12a94 Add a crude facility for dealing with relative pointers.
C doesn't have any sort of built-in understanding of a pointer
relative to some arbitrary base address, but dynamic shared memory
segments can be mapped at different addresses in different processes,
so any sort of shared data structure stored within a dynamic shared
memory segment can't use absolute pointers.  We could use something
like Size to represent a relative pointer, but then the compiler
provides no type-checking.  Use stupid macro tricks to get some
type-checking.

Patch originally by me.  Concept suggested by Andres Freund.  Recently
resubmitted as part of Thomas Munro's work on dynamic shared memory
allocation.

Discussion: 20131205144434.GG12398@alap2.anarazel.de
Discussion: CAEepm=1z5WLuNoJ80PaCvz6EtG9dN0j-KuHcHtU6QEfcPP5-qA@mail.gmail.com
2016-12-02 11:29:01 -05:00
Robert Haas b460f5d669 Add max_parallel_workers GUC.
Increase the default value of the existing max_worker_processes GUC
from 8 to 16, and add a new max_parallel_workers GUC with a maximum
of 8.  This way, even if the maximum amount of parallel query is
happening, there is still room for background workers that do other
things, as originally envisioned when max_worker_processes was added.

Julien Rouhaud, reviewed by Amit Kapila and by revised by me.
2016-12-02 07:42:58 -05:00
Alvaro Herrera 5714931b07 Fix Windows build for 78c8c81439
Author: Petr Jelínek
2016-12-02 09:40:36 -03:00
Alvaro Herrera fa2fa99552 Permit dump/reload of not-too-large >1GB tuples
Our documentation states that our maximum field size is 1 GB, and that
our maximum row size of 1.6 TB.  However, while this might be attainable
in theory with enough contortions, it is not workable in practice; for
starters, pg_dump fails to dump tables containing rows larger than 1 GB,
even if individual columns are well below the limit; and even if one
does manage to manufacture a dump file containing a row that large, the
server refuses to load it anyway.

This commit enables dumping and reloading of such tuples, provided two
conditions are met:

1. no single column is larger than 1 GB (in output size -- for bytea
   this includes the formatting overhead)
2. the whole row is not larger than 2 GB

There are three related changes to enable this:

a. StringInfo's API now has two additional functions that allow creating
a string that grows beyond the typical 1GB limit (and "long" string).
ABI compatibility is maintained.  We still limit these strings to 2 GB,
though, for reasons explained below.

b. COPY now uses long StringInfos, so that pg_dump doesn't choke
trying to emit rows longer than 1GB.

c. heap_form_tuple now uses the MCXT_ALLOW_HUGE flag in its allocation
for the input tuple, which means that large tuples are accepted on
input.  Note that at this point we do not apply any further limit to the
input tuple size.

The main reason to limit to 2 GB is that the FE/BE protocol uses 32 bit
length words to describe each row; and because the documentation is
ambiguous on its signedness and libpq does consider it signed, we cannot
use the highest-order bit.  Additionally, the StringInfo API uses "int"
(which is 4 bytes wide in most platforms) in many places, so we'd need
to change that API too in order to improve, which has lots of fallout.

Backpatch to 9.5, which is the oldest that has
MemoryContextAllocExtended, a necessary piece of infrastructure.  We
could apply to 9.4 with very minimal additional effort, but any further
than that would require backpatching "huge" allocations too.

This is the largest set of changes we could find that can be
back-patched without breaking compatibility with existing systems.
Fixing a bigger set of problems (for example, dumping tuples bigger than
2GB, or dumping fields bigger than 1GB) would require changing the FE/BE
protocol and/or changing the StringInfo API in an ABI-incompatible way,
neither of which would be back-patchable.

Authors: Daniel Vérité, Álvaro Herrera
Reviewed by: Tomas Vondra
Discussion: https://postgr.es/m/20160229183023.GA286012@alvherre.pgsql
2016-12-02 00:34:01 -03:00
Peter Eisentraut 78c8c81439 Refactor libpqwalreceiver
The whole walreceiver API is now wrapped into a struct, like most of our
other loadable module APIs.  The libpq connection is no longer a global
variable in libpqwalreceiver.  Instead, it is encapsulated into a struct
that is passed around the functions.  This allows multiple walreceivers
to run at the same time.

Add some rudimentary support for logical replication connections to
libpqwalreceiver.

These changes are mostly cosmetic and are going to be useful for the
future logical replication patches.

From: Petr Jelinek <petr@2ndquadrant.com>
2016-12-01 20:23:28 -05:00
Peter Eisentraut 597a87ccc9 Use latch instead of select() in walreceiver
Replace use of poll()/select() by WaitLatchOrSocket(), which is more
portable and flexible.

Also change walreceiver to use its procLatch instead of a custom latch.

From: Petr Jelinek <petr@2ndquadrant.com>
2016-12-01 20:23:28 -05:00
Andres Freund fc4b3dea29 User narrower representative tuples in the hash-agg hashtable.
So far the hashtable stored representative tuples in the form of its
input slot, with all columns in the hashtable that are not
needed (i.e. not grouped upon or functionally dependent) set to NULL.

Thats good for saving memory, but it turns out that having tuples full
of NULL isn't free. slot_deform_tuple is faster if there's no NULL
bitmap even if no NULLs are encountered, and skipping over leading NULLs
isn't free.

So compute a separate tuple descriptor that only contains the needed
columns. As columns have already been moved in/out the slot for the
hashtable that does not imply additional per-row overhead.

Author: Andres Freund
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/20161103110721.h5i5t5saxfk5eeik@alap3.anarazel.de
2016-11-30 17:30:09 -08:00
Andres Freund 8ed3f11bb0 Perform one only projection to compute agg arguments.
Previously we did a ExecProject() for each individual aggregate
argument. That turned out to be a performance bottleneck in queries with
multiple aggregates.

Doing all the argument computations in one ExecProject() is quite a bit
cheaper because ExecProject's fastpath can do the work at once in a
relatively tight loop, and because it can get all the required columns
with a single slot_getsomeattr and save some other redundant setup
costs.

Author: Andres Freund
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/20161103110721.h5i5t5saxfk5eeik@alap3.anarazel.de
2016-11-30 16:20:24 -08:00
Robert Haas 6d46f4783e Improve hash index bucket split behavior.
Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency.  Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.

In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel.  There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.

This patch also removes the unworldly assumption that a split will
never be interrupted.  With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion.  While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.

Amit Kapila.  I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's.  Also
reviewed by Jesper Pedersen, Jeff Janes, and others.

Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
2016-11-30 15:39:21 -05:00
Tom Lane 11da83a0e7 Add uuid to the set of types supported by contrib/btree_gist.
Paul Jungwirth, reviewed and hacked on by Teodor Sigaev, Ildus
Kurbangaliev, Adam Brusselback, Chris Bandy, and myself.

Discussion: https://postgr.es/m/CA+renyUEE29=X01JXdz8_TQvo6n9=2XoEBBRnQ8rkLyr+kjPxQ@mail.gmail.com
Discussion: https://postgr.es/m/55F6EE82.8080209@sigaev.ru
2016-11-29 14:08:34 -05:00
Robert Haas 273270593f Mark IsPostmasterEnvironment and IsBackgroundWorker as PGDLLIMPORT.
Per request from Craig Ringer.
2016-11-26 10:29:18 -05:00
Tom Lane dbdfd114f3 Bring some clarity to the defaults for the xxx_flush_after parameters.
Instead of confusingly stating platform-dependent defaults for these
parameters in the comments in postgresql.conf.sample (with the main
entry being a lie on Linux), teach initdb to install the correct
platform-dependent value in postgresql.conf, similarly to the way
we handle other platform-dependent defaults.  This won't do anything
for existing 9.6 installations, but since it's effectively only a
documentation improvement, that seems OK.

Since this requires initdb to have access to the default values,
move the #define's for those to pg_config_manual.h; the original
placement in bufmgr.h is unworkable because that file can't be
included by frontend programs.

Adjust the default value for wal_writer_flush_after so that it is 1MB
regardless of XLOG_BLCKSZ, conforming to what is stated in both the
SGML docs and postgresql.conf.  (We could alternatively make it scale
with XLOG_BLCKSZ, but I'm not sure I see the point.)

Copy-edit related SGML documentation.

Fabien Coelho and Tom Lane, per a gripe from Tomas Vondra.

Discussion: <30ebc6e3-8358-09cf-44a8-578252938424@2ndquadrant.com>
2016-11-25 18:36:10 -05:00
Robert Haas e343dfa42b Remove barrier.h
A new thing also called a "barrier" is proposed, but whether we decide
to take that patch or not, this file seems to have outlived its
usefulness.

Thomas Munro
2016-11-22 20:28:24 -05:00
Tom Lane 906bfcad7b Improve handling of "UPDATE ... SET (column_list) = row_constructor".
Previously, the right-hand side of a multiple-column assignment, if it
wasn't a sub-SELECT, had to be a simple parenthesized expression list,
because gram.y was responsible for "bursting" the construct into
independent column assignments.  This had the minor defect that you
couldn't write ROW (though you should be able to, since the standard says
this is a row constructor), and the rather larger defect that unlike other
uses of row constructors, we would not expand a "foo.*" item into multiple
columns.

Fix that by changing the RHS to be just "a_expr" in the grammar, leaving
it to transformMultiAssignRef to separate the elements of a RowExpr;
which it will do only after performing standard transformation of the
RowExpr, so that "foo.*" behaves as expected.

The key reason we didn't do that before was the hard-wired handling of
DEFAULT tokens (SetToDefault nodes).  This patch deals with that issue by
allowing DEFAULT in any a_expr and having parse analysis throw an error
if SetToDefault is found in an unexpected place.  That's an improvement
anyway since the error can be more specific than just "syntax error".

The SQL standard suggests that the RHS could be any a_expr yielding a
suitable row value.  This patch doesn't really move the goal posts in that
respect --- you're still limited to RowExpr or a sub-SELECT --- but it does
fix the grammar restriction, so it provides some tangible progress towards
a full implementation.  And the limitation is now documented by an explicit
error message rather than an unhelpful "syntax error".

Discussion: <8542.1479742008@sss.pgh.pa.us>
2016-11-22 15:20:10 -05:00
Robert Haas e8ac886c24 Support condition variables.
Condition variables provide a flexible way to sleep until a
cooperating process causes an arbitrary condition to become true.  In
simple cases, this can be accomplished with a WaitLatch/ResetLatch
loop; the cooperating process can call SetLatch after performing work
that might cause the condition to be satisfied, and the waiting
process can recheck the condition each time.  However, if the process
performing the work doesn't have an easy way to identify which
processes might be waiting, this doesn't work, because it can't
identify which latches to set.  Condition variables solve that problem
by internally maintaining a list of waiters; a process that may have
caused some waiter's condition to be satisfied must "signal" or
"broadcast" on the condition variable.

Robert Haas and Thomas Munro
2016-11-22 14:27:11 -05:00
Peter Eisentraut 67dc4ccbb2 Add pg_sequences view
Like pg_tables, pg_views, and others, this view contains information
about sequences in a way that is independent of the system catalog
layout but more comprehensive than the information schema.

To help implement the view, add a new internal function
pg_sequence_last_value() to return the last value of a sequence.  This
is kept separate from pg_sequence_parameters() to separate querying
run-time state from catalog-like information.

Reviewed-by: Andreas Karlsson <andreas@proxel.se>
2016-11-18 14:59:03 -05:00
Robert Haas b40b4dd9e1 Reserve zero as an invalid DSM handle.
Previously, the handle for the control segment could not be zero, but
some other DSM segment could potentially have a handle value of zero.
However, that means that if someone wanted to store a dsm_handle that
might or might not be valid, they would need a separate boolean to
keep track of whether the associated value is legal.  That's annoying,
so change things so that no DSM segment can ever have a handle of 0 -
or as we call it here, DSM_HANDLE_INVALID.

Thomas Munro.  This was submitted as part of a much larger patch to
add an malloc-like allocator for dynamic shared memory, but this part
seems like a good idea independently of the rest of the patch.
2016-11-15 16:33:29 -05:00
Tom Lane ffaa44cb55 Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced.  In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures.  This effect appears to be the explanation for a recent
failure on buildfarm member piculet.

To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.

In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before.  To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input.  (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)

In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c.  And improve some obsolete comments.

No regression test because I don't know a deterministic way to cause this
failure.  But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.

Back-patch to 9.4 where MVCC catalog scans were introduced.  (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4.  But MVCC catalog scans were supposed to fix that.)

Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 15:55:35 -05:00
Tom Lane 24aef33804 Cleanup of rewriter and planner handling of Query.hasRowSecurity flag.
Be sure to pull up the subquery's hasRowSecurity flag when flattening a
subquery in pull_up_simple_subquery().  This isn't a bug today because
we don't look at the hasRowSecurity flag during planning, but it could
easily be a bug tomorrow.

Likewise, make rewriteRuleAction() pull up the hasRowSecurity flag when
absorbing RTEs from a rule action.  This isn't a bug either, for the
opposite reason: the flag should never be set yet.  But again, it seems
like good future proofing.

Add a comment explaining why rewriteTargetView() should *not* set
hasRowSecurity when adding stuff to securityQuals.

Improve some nearby comments about securityQuals processing, and document
that field more completely in parsenodes.h.

Patch by me, analysis by Dean Rasheed.

Discussion: <CAEZATCXZ8tb2DV6f=bkhsMV6u_gRcZ0CZBw2J-qU84RxSukZog@mail.gmail.com>
2016-11-10 16:16:33 -05:00
Tom Lane 530f806524 Re-allow user_catalog_table option for materialized views.
The reloptions stuff allows this option to be set on a matview.
While it's questionable whether that is useful or was really intended,
it does work, and we shouldn't change that in minor releases.  Commit
e3e66d8a9 disabled the option since I didn't realize that it was
possible for it to be set on a matview.  Tweak the test to re-allow it.

Discussion: <19749.1478711862@sss.pgh.pa.us>
2016-11-10 15:00:58 -05:00
Tom Lane 1833f1a1c3 Simplify code by getting rid of SPI_push, SPI_pop, SPI_restore_connection.
The idea behind SPI_push was to allow transitioning back into an
"unconnected" state when a SPI-using procedure calls unrelated code that
might or might not invoke SPI.  That sounds good, but in practice the only
thing it does for us is to catch cases where a called SPI-using function
forgets to call SPI_connect --- which is a highly improbable failure mode,
since it would be exposed immediately by direct testing of said function.
As against that, we've had multiple bugs induced by forgetting to call
SPI_push/SPI_pop around code that might invoke SPI-using functions; these
are much harder to catch and indeed have gone undetected for years in some
cases.  And we've had to band-aid around some problems of this ilk by
introducing conditional push/pop pairs in some places, which really kind
of defeats the purpose altogether; if we can't draw bright lines between
connected and unconnected code, what's the point?

Hence, get rid of SPI_push[_conditional], SPI_pop[_conditional], and the
underlying state variable _SPI_curid.  It turns out SPI_restore_connection
can go away too, which is a nice side benefit since it was never more than
a kluge.  Provide no-op macros for the deleted functions so as to avoid an
API break for external modules.

A side effect of this removal is that SPI_palloc and allied functions no
longer permit being called when unconnected; they'll throw an error
instead.  The apparent usefulness of the previous behavior was a mirage
as well, because it was depended on by only a few places (which I fixed in
preceding commits), and it posed a risk of allocations being unexpectedly
long-lived if someone forgot a SPI_push call.

Discussion: <20808.1478481403@sss.pgh.pa.us>
2016-11-08 17:39:57 -05:00
Tom Lane 9257f07872 Replace uses of SPI_modifytuple that intend to allocate in current context.
Invent a new function heap_modify_tuple_by_cols() that is functionally
equivalent to SPI_modifytuple except that it always allocates its result
by simple palloc.  I chose however to make the API details a bit more
like heap_modify_tuple: pass a tupdesc rather than a Relation, and use
bool convention for the isnull array.

Use this function in place of SPI_modifytuple at all call sites where the
intended behavior is to allocate in current context.  (There actually are
only two call sites left that depend on the old behavior, which makes me
wonder if we should just drop this function rather than keep it.)

This new function is easier to use than heap_modify_tuple() for purposes
of replacing a single column (or, really, any fixed number of columns).
There are a number of places where it would simplify the code to change
over, but I resisted that temptation for the moment ... everywhere except
in plpgsql's exec_assign_value(); changing that might offer some small
performance benefit, so I did it.

This is on the way to removing SPI_push/SPI_pop, but it seems like
good code cleanup in its own right.

Discussion: <9633.1478552022@sss.pgh.pa.us>
2016-11-08 15:36:44 -05:00
Tom Lane e3e66d8a98 Band-aid fix for incorrect use of view options as StdRdOptions.
We really ought to make StdRdOptions and the other decoded forms of
reloptions self-identifying, but for the moment, assume that only plain
relations could possibly be user_catalog_tables.  Fixes problem with bogus
"ON CONFLICT is not supported on table ... used as a catalog table" error
when target is a view with cascade option.

Discussion: <26681.1477940227@sss.pgh.pa.us>
2016-11-07 12:08:18 -05:00
Tom Lane 33cb96ba1a Revert "Provide DLLEXPORT markers for C functions via PG_FUNCTION_INFO_V1 macro."
This reverts commit c8ead2a397.
Seems there is no way to do this that doesn't cause MSVC to give
warnings, so let's just go back to the way we've been doing it.

Discussion: <11843.1478358206@sss.pgh.pa.us>
2016-11-07 10:19:22 -05:00
Tom Lane 86d19d27ce Remove duplicate macro definition.
Seems to be a copy-and-pasteo.  Odd that we heard no reports of
compiler warnings about it.

Thomas Munro
2016-11-05 11:51:46 -04:00
Tom Lane 06f5fd2f4f pgwin32_is_junction's argument should be "const char *" not "char *".
We're passing const strings to it in places, and that's not an
unreasonable thing to do.  Per buildfarm (noted on frogmouth
in particular).
2016-11-05 11:14:10 -04:00
Tom Lane c8ead2a397 Provide DLLEXPORT markers for C functions via PG_FUNCTION_INFO_V1 macro.
Second try at the change originally made in commit 8518583cd;
this time with contrib updates so that manual extern declarations
are also marked with PGDLLEXPORT.  The release notes should point
this out as a significant source-code change for extension authors,
since they'll have to make similar additions to avoid trouble on Windows.

Laurenz Albe, doc change by me

Patch: <A737B7A37273E048B164557ADEF4A58B53962ED8@ntex2010a.host.magwien.gv.at>
2016-11-04 19:04:56 -04:00
Kevin Grittner 8c48375e5f Implement syntax for transition tables in AFTER triggers.
This is infrastructure for the complete SQL standard feature.  No
support is included at this point for execution nodes or PLs.  The
intent is to add that soon.

As this patch leaves things, standard syntax can create tuplestores
to contain old and/or new versions of rows affected by a statement.
References to these tuplestores are in the TriggerData structure.
C triggers can access the tuplestores directly, so they are usable,
but they cannot yet be referenced within a SQL statement.
2016-11-04 10:49:50 -05:00
Robert Haas f2e6a2ccf1 Add API to check if an existing exclusive lock allows cleanup.
LockBufferForCleanup() acquires a cleanup lock unconditionally, and
ConditionalLockBufferForCleanup() acquires a cleanup lock if it is
possible to do so without waiting; this patch adds a new API,
IsBufferCleanupOK(), which tests whether an exclusive lock already
held happens to be a cleanup lock.  This is possible because a cleanup
lock simply means an exclusive lock plus the assurance any other pins
on the buffer are newer than our own pin.  Therefore, just as the
existing functions decide that the exclusive lock that they've just
taken is a cleanup lock if they observe the pin count to be 1, this
new function allows us to observe that the pin count is 1 on a buffer
we've already locked.

This is useful in situations where a backend definitely wishes to
modify the buffer and also wishes to perform cleanup operations if
possible.  The patch to eliminate heavyweight locking by hash indexes
uses this, and it may have other applications as well.

Amit Kapila, per a suggestion from me.  Some comment adjustments by me
as well.
2016-11-04 09:32:24 -04:00
Robert Haas 6bb9a6177d Remove declarations for pq_putmessage_hook and pq_flush_hook.
Commit 2bd9e412f9 added these in error.
They were part of an earlier design for that patch and survived in the
committed version only by inadvertency.

Julien Rouhaud
2016-10-31 09:14:46 -04:00
Peter Eisentraut c32fe432af Avoid using a C++ keyword in header file
per cpluspluscheck
2016-10-26 22:41:56 -04:00
Heikki Linnakangas 56f39009c5 Fix typos in comments.
Vinayak Pokale
2016-10-26 11:12:31 +03:00
Magnus Hagander 56c7d8d455 Allow pg_basebackup to stream transaction log in tar mode
This will write the received transaction log into a file called
pg_wal.tar(.gz) next to the other tarfiles instead of writing it to
base.tar. When using fetch mode, the transaction log is still written to
base.tar like before, and when used against a pre-10 server, the file
is named pg_xlog.tar.

To do this, implement a new concept of a "walmethod", which is
responsible for writing the WAL. Two implementations exist, one that
writes to a plain directory (which is also used by pg_receivexlog) and
one that writes to a tar file with optional compression.

Reviewed by Michael Paquier
2016-10-23 15:23:11 +02:00
Robert Haas f82ec32ac3 Rename "pg_xlog" directory to "pg_wal".
"xlog" is not a particularly clear abbreviation for "write-ahead log",
and it sometimes confuses users into believe that the contents of the
"pg_xlog" directory are not critical data, leading to unpleasant
consequences.  So, rename the directory to "pg_wal".

This patch modifies pg_upgrade and pg_basebackup to understand both
the old and new directory layouts; the former is necessary given the
purpose of the tool, while the latter merely avoids an unnecessary
backward-compatibility break.

We may wish to consider renaming other programs, switches, and
functions which still use the old "xlog" naming to also refer to
"wal".  However, that's still under discussion, so let's do just this
much for now.

Discussion: CAB7nPqTeC-8+zux8_-4ZD46V7YPwooeFxgndfsq5Rg8ibLVm1A@mail.gmail.com

Michael Paquier
2016-10-20 11:32:18 -04:00
Andres Freund 90d3da11c9 Fix a few typos in simplehash.h.
Author: Erik Rijkers
Discussion: <274e4c8ac545d6622735f97c1f6c354b@xs4all.nl>
2016-10-18 10:55:56 -07:00
Robert Haas fca41acb86 Fix typo in comment.
Amit Langote
2016-10-18 13:43:27 -04:00
Heikki Linnakangas faae1c918e Revert "Replace PostmasterRandom() with a stronger way of generating randomness."
This reverts commit 9e083fd468. That was a
few bricks shy of a load:

* Query cancel stopped working
* Buildfarm member pademelon stopped working, because the box doesn't have
  /dev/urandom nor /dev/random.

This clearly needs some more discussion, and a quite different patch, so
revert for now.
2016-10-18 16:28:23 +03:00
Heikki Linnakangas 9e083fd468 Replace PostmasterRandom() with a stronger way of generating randomness.
This adds a new routine, pg_strong_random() for generating random bytes,
for use in both frontend and backend. At the moment, it's only used in
the backend, but the upcoming SCRAM authentication patches need strong
random numbers in libpq as well.

pg_strong_random() is based on, and replaces, the existing implementation
in pgcrypto. It can acquire strong random numbers from a number of sources,
depending on what's available:
- OpenSSL RAND_bytes(), if built with OpenSSL
- On Windows, the native cryptographic functions are used
- /dev/urandom
- /dev/random

Original patch by Magnus Hagander, with further work by Michael Paquier
and me.

Discussion: <CAB7nPqRy3krN8quR9XujMVVHYtXJ0_60nqgVc6oUk8ygyVkZsA@mail.gmail.com>
2016-10-17 11:52:50 +03:00
Andres Freund 5dfc198146 Use more efficient hashtable for execGrouping.c to speed up hash aggregation.
The more efficient hashtable speeds up hash-aggregations with more than
a few hundred groups significantly. Improvements of over 120% have been
measured.

Due to the the different hash table queries that not fully
determined (e.g. GROUP BY without ORDER BY) may change their result
order.

The conversion is largely straight-forward, except that, due to the
static element types of simplehash.h type hashes, the additional data
some users store in elements (e.g. the per-group working data for hash
aggregaters) is now stored in TupleHashEntryData->additional.  The
meaning of BuildTupleHashTable's entrysize (renamed to additionalsize)
has been changed to only be about the additionally stored size.  That
size is only used for the initial sizing of the hash-table.

Reviewed-By: Tomas Vondra
Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-14 17:22:51 -07:00
Andres Freund b30d3ea824 Add a macro templatized hashtable.
dynahash.c hash tables aren't quite fast enough for some
use-cases. There are several reasons for lacking performance:
- the use of chaining for collision handling makes them cache
  inefficient, that's especially an issue when the tables get bigger.
- as the element sizes for dynahash are only determined at runtime,
  offset computations are somewhat expensive
- hash and element comparisons are indirect function calls, causing
  unnecessary pipeline stalls
- it's two level structure has some benefits (somewhat natural
  partitioning), but increases the number of indirections
to fix several of these the hash tables have to be adjusted to the
individual use-case at compile-time. C unfortunately doesn't provide a
good way to do compile code generation (like e.g. c++'s templates for
all their weaknesses do).  Thus the somewhat ugly approach taken here is
to allow for code generation using a macro-templatized header file,
which generates functions and types based on a prefix and other
parameters.

Later patches use this infrastructure to use such hash tables for
tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation,
...). In queries where these use up a large fraction of the time, this
has been measured to lead to performance improvements of over 100%.

There are other cases where this could be useful (e.g. catcache.c).

The hash table design chosen is a variant of linear open-addressing. The
biggest disadvantage of simple linear addressing schemes are highly
variable lookup times due to clustering, and deletions leaving a lot of
tombstones around.  To address these issues a variant of "robin hood"
hashing is employed.  Robin hood hashing optimizes chaining lengths by
moving elements close to their optimal bucket ("rich" elements), out of
the way if a to-be-inserted element is further away from its optimal
position (i.e. it's "poor").  While that can make insertions slower, the
average lookup performance is a lot better, and higher fill factors can
be used in a still performant manner.  To avoid tombstones - which
normally solve the issue that a deleted node's presence is relevant to
determine whether a lookup needs to continue looking or is done -
buckets following a deleted element are shifted backwards, unless
they're empty or already at their optimal position.

There's further possible improvements that can be made to this
implementation. Amongst others:
- Use distance as a termination criteria during searches. This is
  generally a good idea, but I've been able to see the overhead of
  distance calculations in some cases.
- Consider combining the 'empty' status into the hashvalue, and enforce
  storing the hashvalue. That could, in some cases, increase memory
  density and remove a few instructions.
- Experiment further with the, very conservatively choosen, fillfactor.
- Make maximum size of hashtable configurable, to allow storing very
  very large tables. That'd require 64bit hash values to be more common
  than now, though.
- some smaller memcpy calls could be optimized to copy larger chunks
But since the new implementation is already considerably faster than
dynahash it seem sensible to start using it.

Reviewed-By: Tomas Vondra
Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-14 16:07:38 -07:00
Andres Freund aa3ca5e3dd Add likely/unlikely() branch hint macros.
These are useful for very hot code paths. Because it's easy to guess
wrongly about likelihood, and because such likelihoods change over time,
they should be used sparingly.

Past tests have shown it'd be a good idea to use them in some places,
e.g. in error checks around ereports that ERROR out, but that's work for
later.

Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
2016-10-14 16:05:30 -07:00
Tom Lane 4f52fd3c6d Revert addition of PGDLLEXPORT in PG_FUNCTION_INFO_V1 macro.
This turns out not to be as harmless as I thought: MSVC will complain
if it sees an "extern" declaration without PGDLLEXPORT and then one with.
(Seems fairly silly, given that this can be changed after the fact by the
linker, but there you have it.)  Therefore, contrib modules that have
extern's for V1 functions in header files are falling over in the
buildfarm, since none of those externs are marked PGDLLEXPORT.

We might or might not conclude that we're willing to plaster those
declarations with PGDLLEXPORT in HEAD, but in any case there's no way we're
going to ship this change in the back branches.  Third-party authors would
not thank us for breaking their code in a minor release.  Hence, revert
the addition of PGDLLEXPORT (but let's keep the extra info in the comment).
If we do the other changes we can revert this commit in HEAD.

Per buildfarm.
2016-10-12 18:01:43 -04:00
Tom Lane 5c80642aa8 Remove unnecessary int2vector-specific hash function and equality operator.
These functions were originally added in commit d8cedf67a to support
use of int2vector columns as catcache lookup keys.  However, there are
no catcaches that use such columns.  (Indeed I now think it must always
have been dead code: a catcache with such a key column would need an
underlying unique index on the column, but we've never had an int2vector
btree opclass.)

Getting rid of the int2vector-specific operator and function does not
lose any functionality, because operations on int2vectors will now fall
back to the generic anyarray support.  This avoids a wart that a btree
index on an int2vector column (made using anyarray_ops) would fail to
match equality searches, because int2vectoreq wasn't a member of the
opclass.  We don't really care much about that, since int2vector is not
meant as a type for users to use, but it's silly to have extra code and
less functionality.

If we ever do want a catcache to be indexed by an int2vector column,
we'd need to put back full btree and hash opclasses for int2vector,
comparable to the support for oidvector.  (The anyarray code can't be
used at such a low level, because it needs to do catcache lookups.)
But we'll deal with that if/when the need arises.

Also worth noting is that removal of the hash int2vector_ops opclass will
break any user-created hash indexes on int2vector columns.  While hash
anyarray_ops would serve the same purpose, it would probably not compute
the same hash values and thus wouldn't be on-disk-compatible.  Given that
int2vector isn't a user-facing type and we're planning other incompatible
changes in hash indexes for v10 anyway, this doesn't seem like something
to worry about, but it's probably worth mentioning here.

Amit Langote

Discussion: <d9bb74f8-b194-7307-9ebd-90645d377e45@lab.ntt.co.jp>
2016-10-12 14:54:08 -04:00
Tom Lane 8518583cdb Provide DLLEXPORT markers for C functions via PG_FUNCTION_INFO_V1 macro.
This isn't really necessary for our own code, because we use a .DEF file
in MSVC builds (see gendef.pl), or --export-all-symbols in MinGW and
Cygwin builds, to ensure that all global symbols in loadable modules
will be exported on Windows.  However, third-party authors might use
different build processes that need this marker, and it's harmless
enough for our own builds.

To some extent, this is an oversight in commit e7128e8db, so back-patch
to 9.4 where that was added.

Laurenz Albe

Discussion: <A737B7A37273E048B164557ADEF4A58B539300BD@ntex2010a.host.magwien.gv.at>
2016-10-12 12:45:50 -04:00
Heikki Linnakangas bb55dd6059 Fix copy-pasto in comment.
Amit Langote
2016-10-12 12:07:54 +03:00
Heikki Linnakangas b75f467b6e Simplify the code for logical tape read buffers.
Pass the buffer size as argument to LogicalTapeRewindForRead, rather than
setting it earlier with the separate LogicTapeAssignReadBufferSize call.
This way, the buffer size is set closer to where it's actually used, which
makes the code easier to understand.

This makes the calculation for how much memory to use for the buffers less
precise. We now use the same amount of memory for every tape, rounded down
to the nearest BLCKSZ boundary, instead of using one more block for some
tapes, to get the total up to exact amount of memory available. That should
be OK, merging isn't too sensitive to the exact amount of memory used.

Reviewed by Peter Geoghegan

Discussion: <0f607c4b-df23-353e-bf56-c0389d28495f@iki.fi>
2016-10-12 12:05:45 +03:00
Tom Lane 2f1eaf87e8 Drop server support for FE/BE protocol version 1.0.
While this isn't a lot of code, it's been essentially untestable for
a very long time, because libpq doesn't support anything older than
protocol 2.0, and has not since release 6.3.  There's no reason to
believe any other client-side code still uses that protocol, either.

Discussion: <2661.1475849167@sss.pgh.pa.us>
2016-10-11 12:19:18 -04:00
Tom Lane 2b860f52ed Remove "sco" and "unixware" ports.
SCO OpenServer and SCO UnixWare are more or less dead platforms.
We have never had a buildfarm member testing the "sco" port, and
the last "unixware" member was last heard from in 2012, so it's
fair to doubt that the code even compiles anymore on either one.
Remove both ports.  We can always undo this if someone shows up
with an interest in maintaining and testing these platforms.

Discussion: <17177.1476136994@sss.pgh.pa.us>
2016-10-11 11:26:04 -04:00
Andres Freund b0779abb3a Fix fallback implementation of pg_atomic_write_u32().
I somehow had assumed that in the spinlock (in turn possibly using
semaphores) based fallback atomics implementation 32 bit writes could be
done without a lock. As far as the write goes that's correct, since
postgres supports only platforms with single-copy atomicity for aligned
32bit writes.  But writing without holding the spinlock breaks
read-modify-write operations like pg_atomic_compare_exchange_u32(),
since they'll potentially "miss" a concurrent write, which can't happen
in actual hardware implementations.

In 9.6+ when using the fallback atomics implementation this could lead
to buffer header locks not being properly marked as released, and
potentially some related state corruption.  I don't see a related danger
in 9.5 (earliest release with the API), because pg_atomic_write_u32()
wasn't used in a concurrent manner there.

The state variable of local buffers, before this change, were
manipulated using pg_atomic_write_u32(), to avoid unnecessary
synchronization overhead. As that'd not be the case anymore, introduce
and use pg_atomic_unlocked_write_u32(), which does not correctly
interact with RMW operations.

This bug only caused issues when postgres is compiled on platforms
without atomics support (i.e. no common new platform), or when compiled
with --disable-atomics, which explains why this wasn't noticed in
testing.

Reported-By: Tom Lane
Discussion: <14947.1475690465@sss.pgh.pa.us>
Backpatch: 9.5-, where the atomic operations API was introduced.
2016-10-07 16:55:15 -07:00
Robert Haas d2ce38e204 Rename WAIT_* constants to PG_WAIT_*.
Windows apparently has a constant named WAIT_TIMEOUT, and some of these
other names are pretty generic, too.  Insert "PG_" at the front of each
name in order to disambiguate.

Michael Paquier
2016-10-05 08:04:52 -04:00
Robert Haas 23843dcb60 Remove trailing commas from enums.
Buildfarm member mylodon doesn't like them.  Actually, I don't like
them either, but I failed to notice these before pushing commit
6f3bd98ebf.
2016-10-04 11:50:34 -04:00
Robert Haas 6f3bd98ebf Extend framework from commit 53be0b1ad to report latch waits.
WaitLatch, WaitLatchOrSocket, and WaitEventSetWait now taken an
additional wait_event_info parameter; legal values are defined in
pgstat.h.  This makes it possible to uniquely identify every point in
the core code where we are waiting for a latch; extensions can pass
WAIT_EXTENSION.

Because latches were the major wait primitive not previously covered
by this patch, it is now possible to see information in
pg_stat_activity on a large number of important wait events not
previously addressed, such as ClientRead, ClientWrite, and SyncRep.

Unfortunately, many of the wait events added by this patch will fail
to appear in pg_stat_activity because they're only used in background
processes which don't currently appear in pg_stat_activity.  We should
fix this either by creating a separate view for such information, or
else by deciding to include them in pg_stat_activity after all.

Michael Paquier and Robert Haas, reviewed by Alexander Korotkov and
Thomas Munro.
2016-10-04 11:01:42 -04:00
Tom Lane 6bc811c992 Show a sensible value in pg_settings.unit for GUC_UNIT_XSEGS variables.
Commit 88e982302 invented GUC_UNIT_XSEGS for min_wal_size and max_wal_size,
but neglected to make it display sensibly in pg_settings.unit (by adding a
case to the switch in GetConfigOptionByNum).  Fix that, and adjust said
switch to throw a run-time error the next time somebody forgets.

In passing, avoid using a static buffer for the output string --- the rest
of this function pstrdup's from a local buffer, and I see no very good
reason why the units code should do it differently and less safely.

Per report from Otar Shavadze.  Back-patch to 9.5 where the new unit type
was added.

Report: <CAG-jOyA=iNFhN+yB4vfvqh688B7Tr5SArbYcFUAjZi=0Exp-Lg@mail.gmail.com>
2016-10-03 16:40:25 -04:00
Heikki Linnakangas e94568ecc1 Change the way pre-reading in external sort's merge phase works.
Don't pre-read tuples into SortTuple slots during merge. Instead, use the
memory for larger read buffers in logtape.c. We're doing the same number
of READTUP() calls either way, but managing the pre-read SortTuple slots
is much more complicated. Also, the on-tape representation is more compact
than SortTuples, so we can fit more pre-read tuples into the same amount
of memory this way. And we have better cache-locality, when we use just a
small number of SortTuple slots.

Now that we only hold one tuple from each tape in the SortTuple slots, we
can greatly simplify the "batch memory" management. We now maintain a
small set of fixed-sized slots, to hold the tuples, and fall back to
palloc() for larger tuples. We use this method during all merge phases,
not just the final merge, and also when randomAccess is requested, and
also in the TSS_SORTEDONTAPE case. In other words, it's used whenever we
do an external sort.

Reviewed by Peter Geoghegan and Claudio Freire.

Discussion: <CAM3SWZTpaORV=yQGVCG8Q4axcZ3MvF-05xe39ZvORdU9JcD6hQ@mail.gmail.com>
2016-10-03 13:37:49 +03:00
Peter Eisentraut cd03890d0b Fix breakage in previous change 2016-09-30 15:27:51 -04:00
Peter Eisentraut 330b48b94b Separate enum from struct
Otherwise the enum symbols are not visible outside the struct in C++.

Reviewed-by: Thomas Munro <thomas.munro@enterprisedb.com>
2016-09-30 15:11:47 -04:00
Peter Eisentraut bc34223bc1 pg_basebackup pg_receivexlog: Issue fsync more carefully
Several places weren't careful about fsyncing in the way.  See 1d4a0ab1
and 606e0f98 for details about required fsyncs.

This adds a couple of functions in src/common/ that have an equivalent
in the backend: durable_rename(), fsync_parent_path()

From: Michael Paquier <michael.paquier@gmail.com>
2016-09-29 12:00:00 -04:00
Peter Eisentraut bf5bb2e85b Move fsync routines of initdb into src/common/
The intention is to used those in other utilities such as pg_basebackup
and pg_receivexlog.

From: Michael Paquier <michael.paquier@gmail.com>
2016-09-29 12:00:00 -04:00
Peter Eisentraut e79e6c4da1 Fix CRC check handling in get_controlfile
The previous patch broke this by returning NULL for a failed CRC check,
which pg_controldata would then try to read.  Fix by returning the
result of the CRC check in a separate argument.

Michael Paquier and myself
2016-09-28 12:00:00 -04:00
Heikki Linnakangas babe05bc2b Turn password_encryption GUC into an enum.
This makes the parameter easier to extend, to support other password-based
authentication protocols than MD5. (SCRAM is being worked on.)

The GUC still accepts on/off as aliases for "md5" and "plain", although
we may want to remove those once we actually add support for another
password hash type.

Michael Paquier, reviewed by David Steele, with some further edits by me.

Discussion: <CAB7nPqSMXU35g=W9X74HVeQp0uvgJxvYOuA4A-A3M+0wfEBv-w@mail.gmail.com>
2016-09-28 12:22:44 +03:00
Peter Eisentraut 440c8d1bbc Fix some typos in comment 2016-09-26 12:00:00 -04:00
Tom Lane fdc9186f7e Replace the built-in GIN array opclasses with a single polymorphic opclass.
We had thirty different GIN array opclasses sharing the same operators and
support functions.  That still didn't cover all the built-in types, nor
did it cover arrays of extension-added types.  What we want is a single
polymorphic opclass for "anyarray".  There were two missing features needed
to make this possible:

1. We have to be able to declare the index storage type as ANYELEMENT
when the opclass is declared to index ANYARRAY.  This just takes a few
more lines in index_create().  Although this currently seems of use only
for GIN, there's no reason to make index_create() restrict it to that.

2. We have to be able to identify the proper GIN compare function for
the index storage type.  This patch proceeds by making the compare function
optional in GIN opclass definitions, and specifying that the default btree
comparison function for the index storage type will be looked up when the
opclass omits it.  Again, that seems pretty generically useful.

Since the comparison function lookup is done in initGinState(), making
use of the second feature adds an additional cache lookup to GIN index
access setup.  It seems unlikely that that would be very noticeable given
the other costs involved, but maybe at some point we should consider
making GinState data persist longer than it now does --- we could keep it
in the index relcache entry, perhaps.

Rather fortuitously, we don't seem to need to do anything to get this
change to play nice with dump/reload or pg_upgrade scenarios: the new
opclass definition is automatically selected to replace existing index
definitions, and the on-disk data remains compatible.  Also, if a user has
created a custom opclass definition for a non-builtin type, this doesn't
break that, since CREATE INDEX will prefer an exact match to opcintype
over a match to ANYARRAY.  However, if there's anyone out there with
handwritten DDL that explicitly specifies _bool_ops or one of the other
replaced opclass names, they'll need to adjust that.

Tom Lane, reviewed by Enrique Meneses

Discussion: <14436.1470940379@sss.pgh.pa.us>
2016-09-26 14:52:44 -04:00
Tom Lane da6c4f6ca8 Refer to OS X as "macOS", except for the port name which is still "darwin".
We weren't terribly consistent about whether to call Apple's OS "OS X"
or "Mac OS X", and the former is probably confusing to people who aren't
Apple users.  Now that Apple has rebranded it "macOS", follow their lead
to establish a consistent naming pattern.  Also, avoid the use of the
ancient project name "Darwin", except as the port code name which does not
seem desirable to change.  (In short, this patch touches documentation and
comments, but no actual code.)

I didn't touch contrib/start-scripts/osx/, either.  I suspect those are
obsolete and due for a rewrite, anyway.

I dithered about whether to apply this edit to old release notes, but
those were responsible for quite a lot of the inconsistencies, so I ended
up changing them too.  Anyway, Apple's being ahistorical about this,
so why shouldn't we be?
2016-09-25 15:40:57 -04:00
Tom Lane 49a91b88e6 Avoid using PostmasterRandom() for DSM control segment ID.
Commits 470d886c3 et al intended to fix the problem that the postmaster
selected the same "random" DSM control segment ID on every start.  But
using PostmasterRandom() for that destroys the intended property that the
delay between random_start_time and random_stop_time will be unpredictable.
(Said delay is probably already more predictable than we could wish, but
that doesn't mean that reducing it by a couple orders of magnitude is OK.)
Revert the previous patch and add a comment warning against misuse of
PostmasterRandom.  Fix the original problem by calling srandom() early in
PostmasterMain, using a low-security seed that will later be overwritten
by PostmasterRandom.

Discussion: <20789.1474390434@sss.pgh.pa.us>
2016-09-23 09:54:11 -04:00
Tom Lane 8023b5827f Remove nearly-unused SizeOfIptrData macro.
Past refactorings have removed all but one reference to SizeOfIptrData
(and that one place was in a pretty noncritical spot).  Since nobody's
complained, it seems probable that there are no supported compilers
that don't think sizeof(ItemPointerData) is 6.  If there are, we're
wasting MAXALIGN per heap tuple anyway, so it's rather silly to worry
about whether we can shave space in places like WAL records.

Pavan Deolasee

Discussion: <CABOikdOOawDda4hwLOT6zdA6MFfPLu3Z2YBZkX0JdayNS6JOeQ@mail.gmail.com>
2016-09-22 14:30:33 -04:00
Peter Eisentraut c1dc51d484 pg_ctl: Detect current standby state from pg_control
pg_ctl used to determine whether a server was in standby mode by looking
for a recovery.conf file.  With this change, it instead looks into
pg_control, which is potentially more accurate.  There are also
occasional discussions about removing recovery.conf, so this removes one
dependency.

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
2016-09-21 12:00:00 -04:00
Peter Eisentraut c91b34bab1 Fix typo
From: Michael Paquier <michael.paquier@gmail.com>
2016-09-21 12:00:00 -04:00
Robert Haas 470d886c32 Use PostmasterRandom(), not random(), for DSM control segment ID.
Otherwise, every startup gets the same "random" value, which is
definitely not what was intended.
2016-09-20 12:26:29 -04:00
Heikki Linnakangas 45310221a9 Fix outdated comments, GIST search queue is not an RBTree anymore.
The GiST search queue is implemented as a pairing heap rather than as
Red-Black Tree, since 9.5 (commit e7032610). I neglected these comments
in that commit.
2016-09-20 11:38:25 +03:00
Tom Lane d8c61c9765 Add debugging aid "bmsToString(Bitmapset *bms)".
This function has no direct callers at present, but it's convenient for
manual use in a debugger, rather than having to inspect memory and do
bit-counting in your head.

In passing, get rid of useless outBitmapset() wrapper around
_outBitmapset(); let's just export the function that does the work.
Likewise for outToken().

Ashutosh Bapat, tweaked a bit by me

Discussion: <CAFjFpRdiht8e1HTVirbubr4YzaON5iZTzFJjq909y4sU8M_6eA@mail.gmail.com>
2016-09-16 09:36:24 -04:00
Heikki Linnakangas 5c6df67e0c Fix building with LibreSSL.
LibreSSL defines OPENSSL_VERSION_NUMBER to claim that it is version 2.0.0,
but it doesn't have the functions added in OpenSSL 1.1.0. Add autoconf
checks for the individual functions we need, and stop relying on
OPENSSL_VERSION_NUMBER.

Backport to 9.5 and 9.6, like the patch that broke this. In the
back-branches, there are still a few OPENSSL_VERSION_NUMBER checks left,
to check for OpenSSL 0.9.8 or 0.9.7. I left them as they were - LibreSSL
has all those functions, so they work as intended.

Per buildfarm member curculio.

Discussion: <2442.1473957669@sss.pgh.pa.us>
2016-09-15 22:52:51 +03:00
Robert Haas 6415ba502b Improve code comment for GatherPath's single_copy flag.
Discussion: 5934.1472642782@sss.pgh.pa.us
2016-09-14 15:43:26 -04:00
Tom Lane 55c3391d1e Be pickier about converting between Name and Datum.
We were misapplying NameGetDatum() to plain C strings in some places.
This worked, because it was just a pointer cast anyway, but it's a type
cheat in some sense.  Use CStringGetDatum instead, and modify the
NameGetDatum macro so it won't compile if applied to something that's
not a pointer to NameData.  This should result in no changes to
generated code, but it is logically cleaner.

Mark Dilger, tweaked a bit by me

Discussion: <EFD8AC94-4C1F-40C1-A5EA-304080089C1B@gmail.com>
2016-09-13 17:17:48 -04:00
Tom Lane a4c35ea1c2 Improve parser's and planner's handling of set-returning functions.
Teach the parser to reject misplaced set-returning functions during parse
analysis using p_expr_kind, in much the same way as we do for aggregates
and window functions (cf commit eaccfded9).  While this isn't complete
(it misses nesting-based restrictions), it's much better than the previous
error reporting for such cases, and it allows elimination of assorted
ad-hoc expression_returns_set() error checks.  We could add nesting checks
later if it seems important to catch all cases at parse time.

There is one case the parser will now throw error for although previous
versions allowed it, which is SRFs in the tlist of an UPDATE.  That never
behaved sensibly (since it's ill-defined which generated row should be
used to perform the update) and it's hard to see why it should not be
treated as an error.  It's a release-note-worthy change though.

Also, add a new Query field hasTargetSRFs reporting whether there are
any SRFs in the targetlist (including GROUP BY/ORDER BY expressions).
The parser can now set that basically for free during parse analysis,
and we can use it in a number of places to avoid expression_returns_set
searches.  (There will be more such checks soon.)  In some places, this
allows decontorting the logic since it's no longer expensive to check for
SRFs in the tlist --- so I made the checks parallel to the handling of
hasAggs/hasWindowFuncs wherever it seemed appropriate.

catversion bump because adding a Query field changes stored rules.

Andres Freund and Tom Lane

Discussion: <24639.1473782855@sss.pgh.pa.us>
2016-09-13 13:54:24 -04:00
Robert Haas 445a38aba2 Have heapam.h include lockdefs.h rather than lock.h.
lockdefs.h was only split from lock.h relatively recently, and
represents a minimal subset of the old lock.h.  heapam.h only needs
that smaller subset, so adjust it to include only that.  This requires
some corresponding adjustments elsewhere.

Peter Geoghegan
2016-09-13 09:21:35 -04:00
Tom Lane f2717c79ee Improve unreachability recognition in elog() macro.
Some experimentation with an older version of gcc showed that it is able
to determine whether "if (elevel_ >= ERROR)" is compile-time constant
if elevel_ is declared "const", but otherwise not so much.  We had
accounted for that in ereport() but were too miserly with braces to
make it so in elog().  I don't know how many currently-interesting
compilers have the same quirk, but in case it will save some code
space, let's make sure that elog() is on the same footing as ereport()
for this purpose.

Back-patch to 9.3 where we introduced pg_unreachable() calls into
elog/ereport.
2016-09-10 17:54:23 -04:00
Tom Lane 24992c6db9 Rewrite PageIndexDeleteNoCompact into a form that only deletes 1 tuple.
The full generality of deleting an arbitrary number of tuples is no longer
needed, so let's save some code and cycles by replacing the original coding
with an implementation based on PageIndexTupleDelete.

We can always get back the old code from git if we need it again for new
callers (though I don't care for its willingness to mess with line pointers
it wasn't told to mess with).

Discussion: <552.1473445163@sss.pgh.pa.us>
2016-09-09 19:00:59 -04:00
Tom Lane 1a4be103a5 Convert PageAddItem into a macro to save a few cycles.
Nowadays this is just a backwards-compatibility wrapper around
PageAddItemExtended, so let's avoid the extra level of function call.
In addition, because pretty much all callers are passing constants
for the two bool arguments, compilers will be able to constant-fold
the conversion to a flags bitmask.

Discussion: <552.1473445163@sss.pgh.pa.us>
2016-09-09 18:17:07 -04:00
Tom Lane b1328d78f8 Invent PageIndexTupleOverwrite, and teach BRIN and GiST to use it.
PageIndexTupleOverwrite performs approximately the same function as
PageIndexTupleDelete (or PageIndexDeleteNoCompact) followed by PageAddItem
targeting the same item pointer offset.  But in the case where the new
tuple is the same size as the old, it avoids shuffling other data around on
the page, because the new tuple is placed where the old one was rather than
being appended to the end of the page.  This has been shown to provide a
substantial speedup for some GiST use-cases.

Also, this change allows some API simplifications: we can get rid of
the rather klugy and error-prone PAI_ALLOW_FAR_OFFSET flag for
PageAddItemExtended, since that was used only to cover a corner case
for BRIN that's better expressed by using PageIndexTupleOverwrite.

Note that this patch causes a rather subtle WAL incompatibility: the
physical page content change represented by certain WAL records is now
different than it was before, because while the tuples have the same
itempointer line numbers, the tuples themselves are in different places.
I have not bumped the WAL version number because I think it doesn't matter
unless you are trying to do bitwise comparisons of original and replayed
pages, and in any case we're early in a devel cycle and there will probably
be more WAL changes before v10 gets out the door.

There is probably room to make use of PageIndexTupleOverwrite in SP-GiST
and GIN too, but that is left for a future patch.

Andrey Borodin, reviewed by Anastasia Lubennikova, whacked around a bit
by me

Discussion: <CAJEAwVGQjGGOj6mMSgMwGvtFd5Kwe6VFAxY=uEPZWMDjzbn4VQ@mail.gmail.com>
2016-09-09 18:02:36 -04:00
Andres Freund 45e191e3aa Improve scalability of md.c for large relations.
So far md.c used a linked list of segments. That proved to be a problem
when processing large relations, because every smgr.c/md.c level access
to a page incurred walking through a linked list of all preceding
segments. Thus making accessing pages O(#segments).

Replace the linked list of segments hanging off SMgrRelationData with an
array of opened segments. That allows O(1) access to individual
segments, if they've previously been opened.

Discussion: <20140331101001.GE13135@alap3.anarazel.de>
Reviewed-By: Peter Geoghegan, Tom Lane (in an older version)
2016-09-08 17:18:46 -07:00
Tom Lane 0ab9c56d0f Support renaming an existing value of an enum type.
Not much to be said about this patch: it does what it says on the tin.

In passing, rename AlterEnumStmt.skipIfExists to skipIfNewValExists
to clarify what it actually does.  In the discussion of this patch
we considered supporting other similar options, such as IF EXISTS
on the type as a whole or IF NOT EXISTS on the target name.  This
patch doesn't actually add any such feature, but it might happen later.

Dagfinn Ilmari Mannsåker, reviewed by Emre Hasegeli

Discussion: <CAO=2mx6uvgPaPDf-rHqG8=1MZnGyVDMQeh8zS4euRyyg4D35OQ@mail.gmail.com>
2016-09-07 16:11:56 -04:00
Peter Eisentraut 49eb0fd097 Add location field to DefElem
Add a location field to the DefElem struct, used to parse many utility
commands.  Update various error messages to supply error position
information.

To propogate the error position information in a more systematic way,
create a ParseState in standard_ProcessUtility() and pass that to
interested functions implementing the utility commands.  This seems
better than passing the query string and then reassembling a parse state
ad hoc, which violates the encapsulation of the ParseState type.

Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
2016-09-06 12:00:00 -04:00
Bruce Momjian 67e1e2aaff C comment: fix file name mention on line 1
Author: Amit Langote
2016-09-06 00:03:55 -04:00
Tom Lane c54159d44c Make locale-dependent regex character classes work for large char codes.
Previously, we failed to recognize Unicode characters above U+7FF as
being members of locale-dependent character classes such as [[:alpha:]].
(Actually, the same problem occurs for large pg_wchar values in any
multibyte encoding, but UTF8 is the only case people have actually
complained about.)  It's impractical to get Spencer's original code to
handle character classes or ranges containing many thousands of characters,
because it insists on considering each member character individually at
regex compile time, whether or not the character will ever be of interest
at run time.  To fix, choose a cutoff point MAX_SIMPLE_CHR below which
we process characters individually as before, and deal with entire ranges
or classes as single entities above that.  We can actually make things
cheaper than before for chars below the cutoff, because the color map can
now be a simple linear array for those chars, rather than the multilevel
tree structure Spencer designed.  It's more expensive than before for
chars above the cutoff, because we must do a binary search in a list of
high chars and char ranges used in the regex pattern, plus call iswalpha()
and friends for each locale-dependent character class used in the pattern.
However, multibyte encodings are normally designed to give smaller codes
to popular characters, so that we can expect that the slow path will be
taken relatively infrequently.  In any case, the speed penalty appears
minor except when we have to apply iswalpha() etc. to high character codes
at runtime --- and the previous coding gave wrong answers for those cases,
so whether it was faster is moot.

Tom Lane, reviewed by Heikki Linnakangas

Discussion: <15563.1471913698@sss.pgh.pa.us>
2016-09-05 17:06:29 -04:00
Bruce Momjian f80049f76a C comment: align dashes in GroupState node header
Author: Jim Nasby
2016-09-05 13:09:54 -04:00
Tom Lane 15bc038f9b Relax transactional restrictions on ALTER TYPE ... ADD VALUE.
To prevent possibly breaking indexes on enum columns, we must keep
uncommitted enum values from getting stored in tables, unless we
can be sure that any such column is new in the current transaction.

Formerly, we enforced this by disallowing ALTER TYPE ... ADD VALUE
from being executed at all in a transaction block, unless the target
enum type had been created in the current transaction.  This patch
removes that restriction, and instead insists that an uncommitted enum
value can't be referenced unless it belongs to an enum type created
in the same transaction as the value.  Per discussion, this should be
a bit less onerous.  It does require each function that could possibly
return a new enum value to SQL operations to check this restriction,
but there aren't so many of those that this seems unmaintainable.

Andrew Dunstan and Tom Lane

Discussion: <4075.1459088427@sss.pgh.pa.us>
2016-09-05 12:59:55 -04:00
Simon Riggs 016abf1fb8 Add debug check function LWLockHeldByMeInMode()
Tests whether my process holds a lock in given mode.
Add initial usage in MarkBufferDirty().

Thomas Munro
2016-09-05 10:38:08 +01:00