Commit Graph

1354 Commits

Author SHA1 Message Date
Heikki Linnakangas 8cc7a4c5fd Fix typo in comment.
Jim Nasby
2015-05-18 10:38:52 +03:00
Tom Lane 424661913c Fix failure to copy IndexScan.indexorderbyops in copyfuncs.c.
This oversight results in a crash at executor startup if the plan has
been copied.  outfuncs.c was missed as well.

While we could probably have taught both those files to cope with the
originally chosen representation of an Oid array, it would have been
painful, not least because there'd be no easy way to verify the array
length.  An Oid List is far easier to work with.  And AFAICS, there is
no particular notational benefit to using an array rather than a list
in the existing parts of the patch either.  So just change it to a list.

Error in commit 35fcb1b3d0, which is new,
so no need for back-patch.
2015-05-17 21:22:12 -04:00
Andres Freund f3d3118532 Support GROUPING SETS, CUBE and ROLLUP.
This SQL standard functionality allows to aggregate data by different
GROUP BY clauses at once. Each grouping set returns rows with columns
grouped by in other sets set to NULL.

This could previously be achieved by doing each grouping as a separate
query, conjoined by UNION ALLs. Besides being considerably more concise,
grouping sets will in many cases be faster, requiring only one scan over
the underlying data.

The current implementation of grouping sets only supports using sorting
for input. Individual sets that share a sort order are computed in one
pass. If there are sets that don't share a sort order, additional sort &
aggregation steps are performed. These additional passes are sourced by
the previous sort step; thus avoiding repeated scans of the source data.

The code is structured in a way that adding support for purely using
hash aggregation or a mix of hashing and sorting is possible. Sorting
was chosen to be supported first, as it is the most generic method of
implementation.

Instead of, as in an earlier versions of the patch, representing the
chain of sort and aggregation steps as full blown planner and executor
nodes, all but the first sort are performed inside the aggregation node
itself. This avoids the need to do some unusual gymnastics to handle
having to return aggregated and non-aggregated tuples from underlying
nodes, as well as having to shut down underlying nodes early to limit
memory usage.  The optimizer still builds Sort/Agg node to describe each
phase, but they're not part of the plan tree, but instead additional
data for the aggregation node. They're a convenient and preexisting way
to describe aggregation and sorting.  The first (and possibly only) sort
step is still performed as a separate execution step. That retains
similarity with existing group by plans, makes rescans fairly simple,
avoids very deep plans (leading to slow explains) and easily allows to
avoid the sorting step if the underlying data is sorted by other means.

A somewhat ugly side of this patch is having to deal with a grammar
ambiguity between the new CUBE keyword and the cube extension/functions
named cube (and rollup). To avoid breaking existing deployments of the
cube extension it has not been renamed, neither has cube been made a
reserved keyword. Instead precedence hacking is used to make GROUP BY
cube(..) refer to the CUBE grouping sets feature, and not the function
cube(). To actually group by a function cube(), unlikely as that might
be, the function name has to be quoted.

Needs a catversion bump because stored rules may change.

Author: Andrew Gierth and Atri Sharma, with contributions from Andres Freund
Reviewed-By: Andres Freund, Noah Misch, Tom Lane, Svenne Krap, Tomas
    Vondra, Erik Rijkers, Marti Raudsepp, Pavel Stehule
Discussion: CAOeZVidmVRe2jU6aMk_5qkxnB7dfmPROzM7Ur8JPW5j8Y5X-Lw@mail.gmail.com
2015-05-16 03:46:31 +02:00
Simon Riggs f6d208d6e5 TABLESAMPLE, SQL Standard and extensible
Add a TABLESAMPLE clause to SELECT statements that allows
user to specify random BERNOULLI sampling or block level
SYSTEM sampling. Implementation allows for extensible
sampling functions to be written, using a standard API.
Basic version follows SQLStandard exactly. Usable
concrete use cases for the sampling API follow in later
commits.

Petr Jelinek

Reviewed by Michael Paquier and Simon Riggs
2015-05-15 14:37:10 -04:00
Heikki Linnakangas 98edd617f3 Fix datatype confusion with the new lossy GiST distance functions.
We can only support a lossy distance function when the distance function's
datatype is comparable with the original ordering operator's datatype.
The distance function always returns a float8, so we are limited to float8,
and float4 (by a hard-coded cast of the float8 to float4).

In light of this limitation, it seems like a good idea to have a separate
'recheck' flag for the ORDER BY expressions, so that if you have a non-lossy
distance function, it still works with lossy quals. There are cases like
that with the build-in or contrib opclasses, but it's plausible.

There was a hidden assumption that the ORDER BY values returned by GiST
match the original ordering operator's return type, but there are plenty
of examples where that's not true, e.g. in btree_gist and pg_trgm. As long
as the distance function is not lossy, we can tolerate that and just not
return the distance to the executor (or rather, always return NULL). The
executor doesn't need the distances if there are no lossy results.

There was another little bug: the recheck variable was not initialized
before calling the distance function. That revealed the bigger issue,
as the executor tried to reorder tuples that didn't need reordering, and
that failed because of the datatype mismatch.
2015-05-15 18:09:31 +03:00
Heikki Linnakangas 35fcb1b3d0 Allow GiST distance function to return merely a lower-bound.
The distance function can now set *recheck = false, like index quals. The
executor will then re-check the ORDER BY expressions, and use a queue to
reorder the results on the fly.

This makes it possible to do kNN-searches on polygons and circles, which
don't store the exact value in the index, but just a bounding box.

Alexander Korotkov and me
2015-05-15 14:26:51 +03:00
Tom Lane 1dc5ebc907 Support "expanded" objects, particularly arrays, for better performance.
This patch introduces the ability for complex datatypes to have an
in-memory representation that is different from their on-disk format.
On-disk formats are typically optimized for minimal size, and in any case
they can't contain pointers, so they are often not well-suited for
computation.  Now a datatype can invent an "expanded" in-memory format
that is better suited for its operations, and then pass that around among
the C functions that operate on the datatype.  There are also provisions
(rudimentary as yet) to allow an expanded object to be modified in-place
under suitable conditions, so that operations like assignment to an element
of an array need not involve copying the entire array.

The initial application for this feature is arrays, but it is not hard
to foresee using it for other container types like JSON, XML and hstore.
I have hopes that it will be useful to PostGIS as well.

In this initial implementation, a few heuristics have been hard-wired
into plpgsql to improve performance for arrays that are stored in
plpgsql variables.  We would like to generalize those hacks so that
other datatypes can obtain similar improvements, but figuring out some
appropriate APIs is left as a task for future work.  (The heuristics
themselves are probably not optimal yet, either, as they sometimes
force expansion of arrays that would be better left alone.)

Preliminary performance testing shows impressive speed gains for plpgsql
functions that do element-by-element access or update of large arrays.
There are other cases that get a little slower, as a result of added array
format conversions; but we can hope to improve anything that's annoyingly
bad.  In any case most applications should see a net win.

Tom Lane, reviewed by Andres Freund
2015-05-14 12:08:49 -04:00
Robert Haas 78efd5c1ed Extend abbreviated key infrastructure to datum tuplesorts.
Andrew Gierth, reviewed by Peter Geoghegan and by me.
2015-05-13 14:36:26 -04:00
Tom Lane 0bb8528b5c Fix postgres_fdw to return the right ctid value in EvalPlanQual cases.
If a postgres_fdw foreign table is a non-locked source relation in an
UPDATE, DELETE, or SELECT FOR UPDATE/SHARE, and the query selects its
ctid column, the wrong value would be returned if an EvalPlanQual
recheck occurred.  This happened because the foreign table's result row
was copied via the ROW_MARK_COPY code path, and EvalPlanQualFetchRowMarks
just unconditionally set the reconstructed tuple's t_self to "invalid".

To fix that, we can have EvalPlanQualFetchRowMarks copy the composite
datum's t_ctid field, and be sure to initialize that along with t_self
when postgres_fdw constructs a tuple to return.

If we just did that much then EvalPlanQualFetchRowMarks would start
returning "(0,0)" as ctid for all other ROW_MARK_COPY cases, which perhaps
does not matter much, but then again maybe it might.  The cause of that is
that heap_form_tuple, which is the ultimate source of all composite datums,
simply leaves t_ctid as zeroes in newly constructed tuples.  That seems
like a bad idea on general principles: a field that's really not been
initialized shouldn't appear to have a valid value.  So let's eat the
trivial additional overhead of doing "ItemPointerSetInvalid(&(td->t_ctid))"
in heap_form_tuple.

This closes out our handling of Etsuro Fujita's report that tableoid and
ctid weren't correctly set in postgres_fdw EvalPlanQual cases.  Along the
way we did a great deal of work to improve FDWs' ability to control row
locking behavior; which was not wasted effort by any means, but it didn't
end up being a fix for this problem because that feature would be too
expensive for postgres_fdw to use all the time.

Although the fix for the tableoid misbehavior was back-patched, I'm
hesitant to do so here; it seems far less likely that people would care
about remote ctid than tableoid, and even such a minor behavioral change
as this in heap_form_tuple is perhaps best not back-patched.  So commit
to HEAD only, at least for the moment.

Etsuro Fujita, with some adjustments by me
2015-05-13 14:05:29 -04:00
Andres Freund 4af6e61a36 Fix ON CONFLICT bugs that manifest when used in rules.
Specifically the tlist and rti of the pseudo "excluded" relation weren't
properly treated by expression_tree_walker, which lead to errors when
excluded was referenced inside a rule because the varnos where not
properly adjusted.  Similar omissions in OffsetVarNodes and
expression_tree_mutator had less impact, but should obviously be fixed
nonetheless.

A couple tests of for ON CONFLICT UPDATE into INSERT rule bearing
relations have been added.

In passing I updated a couple comments.
2015-05-13 00:13:22 +02:00
Tom Lane afb9249d06 Add support for doing late row locking in FDWs.
Previously, FDWs could only do "early row locking", that is lock a row as
soon as it's fetched, even though local restriction/join conditions might
discard the row later.  This patch adds callbacks that allow FDWs to do
late locking in the same way that it's done for regular tables.

To make use of this feature, an FDW must support the "ctid" column as a
unique row identifier.  Currently, since ctid has to be of type TID,
the feature is of limited use, though in principle it could be used by
postgres_fdw.  We may eventually allow FDWs to specify another data type
for ctid, which would make it possible for more FDWs to use this feature.

This commit does not modify postgres_fdw to use late locking.  We've
tested some prototype code for that, but it's not in committable shape,
and besides it's quite unclear whether it actually makes sense to do late
locking against a remote server.  The extra round trips required are likely
to outweigh any benefit from improved concurrency.

Etsuro Fujita, reviewed by Ashutosh Bapat, and hacked up a lot by me
2015-05-12 14:10:17 -04:00
Tom Lane 1a8a4e5cde Code review for foreign/custom join pushdown patch.
Commit e7cb7ee145 included some design
decisions that seem pretty questionable to me, and there was quite a lot
of stuff not to like about the documentation and comments.  Clean up
as follows:

* Consider foreign joins only between foreign tables on the same server,
rather than between any two foreign tables with the same underlying FDW
handler function.  In most if not all cases, the FDW would simply have had
to apply the same-server restriction itself (far more expensively, both for
lack of caching and because it would be repeated for each combination of
input sub-joins), or else risk nasty bugs.  Anyone who's really intent on
doing something outside this restriction can always use the
set_join_pathlist_hook.

* Rename fdw_ps_tlist/custom_ps_tlist to fdw_scan_tlist/custom_scan_tlist
to better reflect what they're for, and allow these custom scan tlists
to be used even for base relations.

* Change make_foreignscan() API to include passing the fdw_scan_tlist
value, since the FDW is required to set that.  Backwards compatibility
doesn't seem like an adequate reason to expect FDWs to set it in some
ad-hoc extra step, and anyway existing FDWs can just pass NIL.

* Change the API of path-generating subroutines of add_paths_to_joinrel,
and in particular that of GetForeignJoinPaths and set_join_pathlist_hook,
so that various less-used parameters are passed in a struct rather than
as separate parameter-list entries.  The objective here is to reduce the
probability that future additions to those parameter lists will result in
source-level API breaks for users of these hooks.  It's possible that this
is even a small win for the core code, since most CPU architectures can't
pass more than half a dozen parameters efficiently anyway.  I kept root,
joinrel, outerrel, innerrel, and jointype as separate parameters to reduce
code churn in joinpath.c --- in particular, putting jointype into the
struct would have been problematic because of the subroutines' habit of
changing their local copies of that variable.

* Avoid ad-hocery in ExecAssignScanProjectionInfo.  It was probably all
right for it to know about IndexOnlyScan, but if the list is to grow
we should refactor the knowledge out to the callers.

* Restore nodeForeignscan.c's previous use of the relcache to avoid
extra GetFdwRoutine lookups for base-relation scans.

* Lots of cleanup of documentation and missed comments.  Re-order some
code additions into more logical places.
2015-05-10 14:36:36 -04:00
Andres Freund 168d5805e4 Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint.  DO NOTHING avoids the
constraint violation, without touching the pre-existing row.  DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed.  The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.

This feature is often referred to as upsert.

This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert.  If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made.  If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.

To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.

Bumps catversion as stored rules change.

Author: Peter Geoghegan, with significant contributions from Heikki
    Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
    Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:43:10 +02:00
Andres Freund 2c8f4836db Represent columns requiring insert and update privileges indentently.
Previously, relation range table entries used a single Bitmapset field
representing which columns required either UPDATE or INSERT privileges,
despite the fact that INSERT and UPDATE privileges are separately
cataloged, and may be independently held.  As statements so far required
either insert or update privileges but never both, that was
sufficient. The required permission could be inferred from the top level
statement run.

The upcoming INSERT ... ON CONFLICT UPDATE feature needs to
independently check for both privileges in one statement though, so that
is not sufficient anymore.

Bumps catversion as stored rules change.

Author: Peter Geoghegan
Reviewed-By: Andres Freund
2015-05-08 00:20:46 +02:00
Robert Haas 40f42d2a34 Use outerPlanState macro instead of referring to leffttree.
This makes the executor code more consistent.  It also removes
an apparently superfluous NULL test in nodeGroup.c.

Qingqing Zhou, reviewed by Tom Lane, and further revised by me.
2015-05-04 16:17:36 -04:00
Robert Haas e7cb7ee145 Allow FDWs and custom scan providers to replace joins with scans.
Foreign data wrappers can use this capability for so-called "join
pushdown"; that is, instead of executing two separate foreign scans
and then joining the results locally, they can generate a path which
performs the join on the remote server and then is scanned locally.
This commit does not extend postgres_fdw to take advantage of this
capability; it just provides the infrastructure.

Custom scan providers can use this in a similar way.  Previously,
it was only possible for a custom scan provider to scan a single
relation.  Now, it can scan an entire join tree, provided of course
that it knows how to produce the same results that the join would
have produced if executed normally.

KaiGai Kohei, reviewed by Shigeru Hanada, Ashutosh Bapat, and me.
2015-05-01 08:50:35 -04:00
Robert Haas 924bcf4f16 Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things.  First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers.  Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes.  Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.

Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 15:02:14 -04:00
Andres Freund 6aab1f45ac Fix various typos and grammar errors in comments.
Author: Dmitriy Olshevskiy
Discussion: 553D00A6.4090205@bk.ru
2015-04-26 18:42:31 +02:00
Stephen Frost e89bd02f58 Perform RLS WITH CHECK before constraints, etc
The RLS capability is built on top of the WITH CHECK OPTION
system which was added for auto-updatable views, however, unlike
WCOs on views (which are mandated by the SQL spec to not fire until
after all other constraints and checks are done), it makes much more
sense for RLS checks to happen earlier than constraint and uniqueness
checks.

This patch reworks the structure which holds the WCOs a bit to be
explicitly either VIEW or RLS checks and the RLS-related checks are
done prior to the constraint and uniqueness checks.  This also allows
better error reporting as we are now reporting when a violation is due
to a WITH CHECK OPTION and when it's due to an RLS policy violation,
which was independently noted by Craig Ringer as being confusing.

The documentation is also updated to include a paragraph about when RLS
WITH CHECK handling is performed, as there have been a number of
questions regarding that and the documentation was previously silent on
the matter.

Author: Dean Rasheed, with some kabitzing and comment changes by me.
2015-04-24 20:34:26 -04:00
Heikki Linnakangas 61a553a091 Add comments explaining how unique and exclusion constraints are enforced. 2015-04-24 21:13:28 +03:00
Heikki Linnakangas 62420ae7d6 Move functions related to index maintenance to separate source file.
There is enough code here to deserve a file of their own, not be buried
in the middle of execUtils.c.
2015-04-24 09:33:23 +03:00
Tom Lane feeb526cfe Fix ExecOpenScanRelation to take a lock on a ROW_MARK_COPY relation.
ExecOpenScanRelation assumed that any relation listed in the ExecRowMark
list has been locked by InitPlan; but this is not true if the rel's
markType is ROW_MARK_COPY, which is possible if it's a foreign table.

In most (possibly all) cases, failure to acquire a lock here isn't really
problematic because the parser, planner, or plancache would have taken the
appropriate lock already.  In principle though it might leave us vulnerable
to working with a relation that we hold no lock on, and in any case if the
executor isn't depending on previously-taken locks otherwise then it should
not do so for ROW_MARK_COPY relations.

Noted by Etsuro Fujita.  Back-patch to all active versions, since the
inconsistency has been there a long time.  (It's almost certainly
irrelevant in 9.0, since that predates foreign tables, but the code's
still wrong on its own terms.)
2015-03-24 15:53:06 -04:00
Tom Lane cb1ca4d800 Allow foreign tables to participate in inheritance.
Foreign tables can now be inheritance children, or parents.  Much of the
system was already ready for this, but we had to fix a few things of
course, mostly in the area of planner and executor handling of row locks.

As side effects of this, allow foreign tables to have NOT VALID CHECK
constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to
accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS.  Continuing to
disallow these things would've required bizarre and inconsistent special
cases in inheritance behavior.  Since foreign tables don't enforce CHECK
constraints anyway, a NOT VALID one is a complete no-op, but that doesn't
mean we shouldn't allow it.  And it's possible that some FDWs might have
use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops
for most.

An additional change in support of this is that when a ModifyTable node
has multiple target tables, they will all now be explicitly identified
in EXPLAIN output, for example:

 Update on pt1  (cost=0.00..321.05 rows=3541 width=46)
   Update on pt1
   Foreign Update on ft1
   Foreign Update on ft2
   Update on child3
   ->  Seq Scan on pt1  (cost=0.00..0.00 rows=1 width=46)
   ->  Foreign Scan on ft1  (cost=100.00..148.03 rows=1170 width=46)
   ->  Foreign Scan on ft2  (cost=100.00..148.03 rows=1170 width=46)
   ->  Seq Scan on child3  (cost=0.00..25.00 rows=1200 width=46)

This was done mainly to provide an unambiguous place to attach "Remote SQL"
fields, but it is useful for inherited updates even when no foreign tables
are involved.

Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro
Horiguchi, some additional hacking by me
2015-03-22 13:53:21 -04:00
Tom Lane 443fd0540e Ensure tableoid reads correctly in EvalPlanQual-manufactured tuples.
The ROW_MARK_COPY path in EvalPlanQualFetchRowMarks() was just setting
tableoid to InvalidOid, I think on the assumption that the referenced
RTE must be a subquery or other case without a meaningful OID.  However,
foreign tables also use this code path, and they do have meaningful
table OIDs; so failure to set the tuple field can lead to user-visible
misbehavior.  Fix that by fetching the appropriate OID from the range
table.

There's still an issue about whether CTID can ever have a meaningful
value in this case; at least with postgres_fdw foreign tables, it does.
But that is a different problem that seems to require a significantly
different patch --- it's debatable whether postgres_fdw really wants to
use this code path at all.

Simplified version of a patch by Etsuro Fujita, who also noted the
problem to begin with.  The issue can be demonstrated in all versions
having FDWs, so back-patch to 9.1.
2015-03-12 13:39:09 -04:00
Tom Lane 8abb3cda0d Use the typcache to cache constraints for domain types.
Previously, we cached domain constraints for the life of a query, or
really for the life of the FmgrInfo struct that was used to invoke
domain_in() or domain_check().  But plpgsql (and probably other places)
are set up to cache such FmgrInfos for the whole lifespan of a session,
which meant they could be enforcing really stale sets of constraints.
On the other hand, searching pg_constraint once per query gets kind of
expensive too: testing says that as much as half the runtime of a
trivial query such as "SELECT 0::domaintype" went into that.

To fix this, delegate the responsibility for tracking a domain's
constraints to the typcache, which has the infrastructure needed to
detect syscache invalidation events that signal possible changes.
This not only removes unnecessary repeat reads of pg_constraint,
but ensures that we never apply stale constraint data: whatever we
use is the current data according to syscache rules.

Unfortunately, the current configuration of the system catalogs means
we have to flush cached domain-constraint data whenever either pg_type
or pg_constraint changes, which happens rather a lot (eg, creation or
deletion of a temp table will do it).  It might be worth rearranging
things to split pg_constraint into two catalogs, of which the domain
constraint one would probably be very low-traffic.  That's a job for
another patch though, and in any case this patch should improve matters
materially even with that handicap.

This patch makes use of the recently-added memory context reset callback
feature to manage the lifespan of domain constraint caches, so that we
don't risk deleting a cache that might be in the midst of evaluation.

Although this is a bug fix as well as a performance improvement, no
back-patch.  There haven't been many if any field complaints about
stale domain constraint checks, so it doesn't seem worth taking the
risk of modifying data structures as basic as MemoryContexts in back
branches.
2015-03-01 14:06:55 -05:00
Jeff Davis b419865a81 In array_agg(), don't create a new context for every group.
Previously, each new array created a new memory context that started
out at 8kB. This is incredibly wasteful when there are lots of small
groups of just a few elements each.

Change initArrayResult() and friends to accept a "subcontext" argument
to indicate whether the caller wants the ArrayBuildState allocated in
a new subcontext or not. If not, it can no longer be released
separately from the rest of the memory context.

Fixes bug report by Frank van Vugt on 2013-10-19.

Tomas Vondra. Reviewed by Ali Akbar, Tom Lane, and me.
2015-02-21 17:24:48 -08:00
Tom Lane e1a11d9311 Use FLEXIBLE_ARRAY_MEMBER for HeapTupleHeaderData.t_bits[].
This requires changing quite a few places that were depending on
sizeof(HeapTupleHeaderData), but it seems for the best.

Michael Paquier, some adjustments by me
2015-02-21 15:13:06 -05:00
Tom Lane 33a3b03d63 Use FLEXIBLE_ARRAY_MEMBER in some more places.
Fix a batch of structs that are only visible within individual .c files.

Michael Paquier
2015-02-20 17:32:01 -05:00
Tom Lane 09d8d110a6 Use FLEXIBLE_ARRAY_MEMBER in a bunch more places.
Replace some bogus "x[1]" declarations with "x[FLEXIBLE_ARRAY_MEMBER]".
Aside from being more self-documenting, this should help prevent bogus
warnings from static code analyzers and perhaps compiler misoptimizations.

This patch is just a down payment on eliminating the whole problem, but
it gets rid of a lot of easy-to-fix cases.

Note that the main problem with doing this is that one must no longer rely
on computing sizeof(the containing struct), since the result would be
compiler-dependent.  Instead use offsetof(struct, lastfield).  Autoconf
also warns against spelling that offsetof(struct, lastfield[0]).

Michael Paquier, review and additional fixes by me.
2015-02-20 00:11:42 -05:00
Tom Lane e983c4d1aa Rationalize the APIs of array element/slice access functions.
The four functions array_ref, array_set, array_get_slice, array_set_slice
have traditionally declared their array inputs and results as being of type
"ArrayType *".  This is a lie, and has been since Berkeley days, because
they actually also support "fixed-length array" types such as "name" and
"point"; not to mention that the inputs could be toasted.  These values
should be declared Datum instead to avoid confusion.  The current coding
already risks possible misoptimization by compilers, and it'll get worse
when "expanded" array representations become a valid alternative.

However, there's a fair amount of code using array_ref and array_set with
arrays that *are* known to be ArrayType structures, and there might be more
such places in third-party code.  Rather than cluttering those call sites
with PointerGetDatum/DatumGetArrayTypeP cruft, what I did was to rename the
existing functions to array_get_element/array_set_element, fix their
signatures, then reincarnate array_ref/array_set as backwards compatibility
wrappers.

array_get_slice/array_set_slice have no such constituency in the core code,
and probably not in third-party code either, so I just changed their APIs.
2015-02-16 12:23:58 -05:00
Tom Lane 08361cea2b Fix null-pointer-deref crash while doing COPY IN with check constraints.
In commit bf7ca15875 I introduced an
assumption that an RTE referenced by a whole-row Var must have a valid eref
field.  This is false for RTEs constructed by DoCopy, and there are other
places taking similar shortcuts.  Perhaps we should make all those places
go through addRangeTableEntryForRelation or its siblings instead of having
ad-hoc logic, but the most reliable fix seems to be to make the new code in
ExecEvalWholeRowVar cope if there's no eref.  We can reasonably assume that
there's no need to insert column aliases if no aliases were provided.

Add a regression test case covering this, and also verifying that a sane
column name is in fact available in this situation.

Although the known case only crashes in 9.4 and HEAD, it seems prudent to
back-patch the code change to 9.2, since all the ingredients for a similar
failure exist in the variant patch applied to 9.3 and 9.2.

Per report from Jean-Pierre Pelletier.
2015-02-15 23:26:45 -05:00
Heikki Linnakangas 57fe246890 Fix reference-after-free when waiting for another xact due to constraint.
If an insertion or update had to wait for another transaction to finish,
because there was another insertion with conflicting key in progress,
we would pass a just-free'd item pointer to XactLockTableWait().

All calls to XactLockTableWait() and MultiXactIdWait() had similar issues.
Some passed a pointer to a buffer in the buffer cache, after already
releasing the lock. The call in EvalPlanQualFetch had already released the
pin too. All but the call in execUtils.c would merely lead to reporting a
bogus ctid, however (or an assertion failure, if enabled).

All the callers that passed HeapTuple->t_data->t_ctid were slightly bogus
anyway: if the tuple was updated (again) in the same transaction, its ctid
field would point to the next tuple in the chain, not the tuple itself.

Backpatch to 9.4, where the 'ctid' argument to XactLockTableWait was added
(in commit f88d4cfc)
2015-02-04 16:00:34 +02:00
Stephen Frost 804b6b6db4 Fix column-privilege leak in error-message paths
While building error messages to return to the user,
BuildIndexValueDescription, ExecBuildSlotValueDescription and
ri_ReportViolation would happily include the entire key or entire row in
the result returned to the user, even if the user didn't have access to
view all of the columns being included.

Instead, include only those columns which the user is providing or which
the user has select rights on.  If the user does not have any rights
to view the table or any of the columns involved then no detail is
provided and a NULL value is returned from BuildIndexValueDescription
and ExecBuildSlotValueDescription.  Note that, for key cases, the user
must have access to all of the columns for the key to be shown; a
partial key will not be returned.

Further, in master only, do not return any data for cases where row
security is enabled on the relation and row security should be applied
for the user.  This required a bit of refactoring and moving of things
around related to RLS- note the addition of utils/misc/rls.c.

Back-patch all the way, as column-level privileges are now in all
supported versions.

This has been assigned CVE-2014-8161, but since the issue and the patch
have already been publicized on pgsql-hackers, there's no point in trying
to hide this commit.
2015-01-28 12:31:30 -05:00
Robert Haas 4ea51cdfe8 Use abbreviated keys for faster sorting of text datums.
This commit extends the SortSupport infrastructure to allow operator
classes the option to provide abbreviated representations of Datums;
in the case of text, we abbreviate by taking the first few characters
of the strxfrm() blob.  If the abbreviated comparison is insufficent
to resolve the comparison, we fall back on the normal comparator.
This can be much faster than the old way of doing sorting if the
first few bytes of the string are usually sufficient to resolve the
comparison.

There is the potential for a performance regression if all of the
strings to be sorted are identical for the first 8+ characters and
differ only in later positions; therefore, the SortSupport machinery
now provides an infrastructure to abort the use of abbreviation if
it appears that abbreviation is producing comparatively few distinct
keys.  HyperLogLog, a streaming cardinality estimator, is included in
this commit and used to make that determination for text.

Peter Geoghegan, reviewed by me.
2015-01-19 15:28:27 -05:00
Robert Haas 1605291b6c Typo fix.
Etsuro Fujita
2015-01-19 11:36:48 -05:00
Tom Lane c480cb9d24 Fix use-of-already-freed-memory problem in EvalPlanQual processing.
Up to now, the "child" executor state trees generated for EvalPlanQual
rechecks have simply shared the ResultRelInfo arrays used for the original
execution tree.  However, this leads to dangling-pointer problems, because
ExecInitModifyTable() is all too willing to scribble on some fields of the
ResultRelInfo(s) even when it's being run in one of those child trees.
This trashes those fields from the perspective of the parent tree, because
even if the generated subtree is logically identical to what was in use in
the parent, it's in a memory context that will go away when we're done
with the child state tree.

We do however want to share information in the direction from the parent
down to the children; in particular, fields such as es_instrument *must*
be shared or we'll lose the stats arising from execution of the children.
So the simplest fix is to make a copy of the parent's ResultRelInfo array,
but not copy any fields back at end of child execution.

Per report from Manuel Kniep.  The added isolation test is based on his
example.  In an unpatched memory-clobber-enabled build it will reliably
fail with "ctid is NULL" errors in all branches back to 9.1, as a
consequence of junkfilter->jf_junkAttNo being overwritten with $7f7f.
This test cannot be run as-is before that for lack of WITH syntax; but
I have no doubt that some variant of this problem can arise in older
branches, so apply the code change all the way back.
2015-01-15 18:52:58 -05:00
Stephen Frost c4fda14845 Fix typo in execMain.c
Wee -> We.

Pointed out by Etsuro Fujita.
2015-01-09 11:07:35 -05:00
Bruce Momjian 4baaf863ec Update copyright for 2015
Backpatch certain files through 9.0
2015-01-06 11:43:47 -05:00
Tom Lane 2db576ba8c Fix corner case where SELECT FOR UPDATE could return a row twice.
In READ COMMITTED mode, if a SELECT FOR UPDATE discovers it has to redo
WHERE-clause checking on rows that have been updated since the SELECT's
snapshot, it invokes EvalPlanQual processing to do that.  If this first
occurs within a non-first child table of an inheritance tree, the previous
coding could accidentally re-return a matching row from an earlier,
already-scanned child table.  (And, to add insult to injury, I think this
could make it miss returning a row that should have been returned, if the
updated row that this happens on should still have passed the WHERE qual.)
Per report from Kyotaro Horiguchi; the added isolation test is based on his
test case.

This has been broken for quite awhile, so back-patch to all supported
branches.
2014-12-11 19:37:36 -05:00
Tom Lane f4e031c662 Add bms_next_member(), and use it where appropriate.
This patch adds a way of iterating through the members of a bitmapset
nondestructively, unlike the old way with bms_first_member().  While
bms_next_member() is very slightly slower than bms_first_member()
(at least for typical-size bitmapsets), eliminating the need to palloc
and pfree a temporary copy of the target bitmapset is a significant win.
So this method should be preferred in all cases where a temporary copy
would be necessary.

Tom Lane, with suggestions from Dean Rasheed and David Rowley
2014-11-28 13:37:25 -05:00
Stephen Frost 143b39c185 Rename pg_rowsecurity -> pg_policy and other fixes
As pointed out by Robert, we should really have named pg_rowsecurity
pg_policy, as the objects stored in that catalog are policies.  This
patch fixes that and updates the column names to start with 'pol' to
match the new catalog name.

The security consideration for COPY with row level security, also
pointed out by Robert, has also been addressed by remembering and
re-checking the OID of the relation initially referenced during COPY
processing, to make sure it hasn't changed under us by the time we
finish planning out the query which has been built.

Robert and Alvaro also commented on missing OCLASS and OBJECT entries
for POLICY (formerly ROWSECURITY or POLICY, depending) in various
places.  This patch fixes that too, which also happens to add the
ability to COMMENT on policies.

In passing, attempt to improve the consistency of messages, comments,
and documentation as well.  This removes various incarnations of
'row-security', 'row-level security', 'Row-security', etc, in favor
of 'policy', 'row level security' or 'row_security' as appropriate.

Happy Thanksgiving!
2014-11-27 01:15:57 -05:00
Tom Lane bac27394a1 Support arrays as input to array_agg() and ARRAY(SELECT ...).
These cases formerly failed with errors about "could not find array type
for data type".  Now they yield arrays of the same element type and one
higher dimension.

The implementation involves creating functions with API similar to the
existing accumArrayResult() family.  I (tgl) also extended the base family
by adding an initArrayResult() function, which allows callers to avoid
special-casing the zero-inputs case if they just want an empty array as
result.  (Not all do, so the previous calling convention remains valid.)
This allowed simplifying some existing code in xml.c and plperl.c.

Ali Akbar, reviewed by Pavel Stehule, significantly modified by me
2014-11-25 12:21:28 -05:00
Tom Lane adbfab119b Remove dead code supporting mark/restore in SeqScan, TidScan, ValuesScan.
There seems no prospect that any of this will ever be useful, and indeed
it's questionable whether some of it would work if it ever got called;
it's certainly not been exercised in a very long time, if ever. So let's
get rid of it, and make the comments about mark/restore in execAmi.c less
wishy-washy.

The mark/restore support for Result nodes is also currently dead code,
but that's due to planner limitations not because it's impossible that
it could be useful.  So I left it in.
2014-11-20 20:20:54 -05:00
Tom Lane a34fa8ee7c Initial code review for CustomScan patch.
Get rid of the pernicious entanglement between planner and executor headers
introduced by commit 0b03e5951b.

Also, rearrange the CustomFoo struct/typedef definitions so that all the
typedef names are seen as used by the compiler.  Without this pgindent
will mess things up a bit, which is not so important perhaps, but it also
removes a bizarre discrepancy between the declaration arrangement used for
CustomExecMethods and that used for CustomScanMethods and
CustomPathMethods.

Clean up the commentary around ExecSupportsMarkRestore to reflect the
rather large change in its API.

Const-ify register_custom_path_provider's argument.  This necessitates
casting away const in the function, but that seems better than forcing
callers of the function to do so (or else not const-ify their method
pointer structs, which was sort of the whole point).

De-export fix_expr_common.  I don't like the exporting of fix_scan_expr
or replace_nestloop_params either, but this one surely has got little
excuse.
2014-11-20 18:36:07 -05:00
Tom Lane 081a6048cf Fix another oversight in CustomScan patch.
execCurrent.c's search_plan_tree() must recognize a CustomScan on the
target relation.  This would only be helpful for custom providers that
support CurrentOfExpr quals, which is probably a bit far-fetched, but
it's not impossible I think.  But even without assuming that, we need
to recognize a scanned-relation match so that we will properly throw
error if the desired relation is being scanned with both a CustomScan
and a regular scan (ie, self-join).

Also recognize ForeignScanState for similar reasons.  Supporting WHERE
CURRENT OF on a foreign table is probably even more far-fetched than
it is for custom scans, but I think in principle you could do it with
postgres_fdw (or another FDW that supports the ctid column).  This
would be a back-patchable bug fix if existing FDWs handled CurrentOfExpr,
but I doubt any do so I won't bother back-patching.
2014-11-20 15:56:39 -05:00
Tom Lane 677708032c Explicitly support the case that a plancache's raw_parse_tree is NULL.
This only happens if a client issues a Parse message with an empty query
string, which is a bit odd; but since it is explicitly called out as legal
by our FE/BE protocol spec, we'd probably better continue to allow it.

Fix by adding tests everywhere that the raw_parse_tree field is passed to
functions that don't or shouldn't accept NULL.  Also make it clear in the
relevant comments that NULL is an expected case.

This reverts commits a73c9dbab0 and
2e9650cbcf, which fixed specific crash
symptoms by hacking things at what now seems to be the wrong end, ie the
callee functions.  Making the callees allow NULL is superficially more
robust, but it's not always true that there is a defensible thing for the
callee to do in such cases.  The caller has more context and is better
able to decide what the empty-query case ought to do.

Per followup discussion of bug #11335.  Back-patch to 9.2.  The code
before that is sufficiently different that it would require development
of a separate patch, which doesn't seem worthwhile for what is believed
to be an essentially cosmetic change.
2014-11-12 15:59:01 -05:00
Tom Lane bf7ca15875 Ensure that RowExprs and whole-row Vars produce the expected column names.
At one time it wasn't terribly important what column names were associated
with the fields of a composite Datum, but since the introduction of
operations like row_to_json(), it's important that looking up the rowtype
ID embedded in the Datum returns the column names that users would expect.
That did not work terribly well before this patch: you could get the column
names of the underlying table, or column aliases from any level of the
query, depending on minor details of the plan tree.  You could even get
totally empty field names, which is disastrous for cases like row_to_json().

To fix this for whole-row Vars, look to the RTE referenced by the Var, and
make sure its column aliases are applied to the rowtype associated with
the result Datums.  This is a tad scary because we might have to return
a transient RECORD type even though the Var is declared as having some
named rowtype.  In principle it should be all right because the record
type will still be physically compatible with the named rowtype; but
I had to weaken one Assert in ExecEvalConvertRowtype, and there might be
third-party code containing similar assumptions.

Similarly, RowExprs have to be willing to override the column names coming
from a named composite result type and produce a RECORD when the column
aliases visible at the site of the RowExpr differ from the underlying
table's column names.

In passing, revert the decision made in commit 398f70ec07 to add
an alias-list argument to ExecTypeFromExprList: better to provide that
functionality in a separate function.  This also reverts most of the code
changes in d685814835, which we don't need because we're no longer
depending on the tupdesc found in the child plan node's result slot to be
blessed.

Back-patch to 9.4, but not earlier, since this solution changes the results
in some cases that users might not have realized were buggy.  We'll apply a
more restricted form of this patch in older branches.
2014-11-10 15:21:09 -05:00
Robert Haas 0b03e5951b Introduce custom path and scan providers.
This allows extension modules to define their own methods for
scanning a relation, and get the core code to use them.  It's
unclear as yet how much use this capability will find, but we
won't find out if we never commit it.

KaiGai Kohei, reviewed at various times and in various levels
of detail by Shigeru Hanada, Tom Lane, Andres Freund, Álvaro
Herrera, and myself.
2014-11-07 17:34:36 -05:00
Bruce Momjian fe65280fea C comments: adjust execTuples.c for new structure
Report by Peter Geoghegan
2014-10-13 16:54:38 -04:00
Kevin Grittner 30d7ae3c76 Increase number of hash join buckets for underestimate.
If we expect batching at the very beginning, we size nbuckets for
"full work_mem" (see how many tuples we can get into work_mem,
while not breaking NTUP_PER_BUCKET threshold).

If we expect to be fine without batching, we start with the 'right'
nbuckets and track the optimal nbuckets as we go (without actually
resizing the hash table). Once we hit work_mem (considering the
optimal nbuckets value), we keep the value.

At the end of the first batch, we check whether (nbuckets !=
nbuckets_optimal) and resize the hash table if needed. Also, we
keep this value for all batches (it's OK because it assumes full
work_mem, and it makes the batchno evaluation trivial). So the
resize happens only once.

There could be cases where it would improve performance to allow
the NTUP_PER_BUCKET threshold to be exceeded to keep everything in
one batch rather than spilling to a second batch, but attempts to
generate such a case have so far been unsuccessful; that issue may
be addressed with a follow-on patch after further investigation.

Tomas Vondra with minor format and comment cleanup by me
Reviewed by Robert Haas, Heikki Linnakangas, and Kevin Grittner
2014-10-13 10:16:36 -05:00