a physical tuple in do_tup_output(). A virtual tuple is easier to set up
and also easier for most tuple receivers to process. Per my comment on
Robert Haas' recent patch in this code.
of individually pfree'ing pass-by-reference transition values. This should
be at least as fast as the prior coding, and it has the major advantage of
clearing out any working data an aggregate function may have stored in or
underneath the aggcontext. This avoids memory leakage when an aggregate
such as array_agg() is used in GROUP BY mode. Per report from Chris Spotts.
Back-patch to 8.4. In principle the problem could arise in prior versions,
but since they didn't have array_agg the issue seems not critical.
so it doesn't go through BuildTupleFromCStrings. This is more or less a
wash for current uses, but will avoid inefficiency for planned changes to
EXPLAIN.
Robert Haas
memory leakage in error recovery. We were calling FreeExprContext, and
therefore invoking ExprContextCallback callbacks, in both normal and error
exits from subtransactions. However this isn't very safe, as shown in
recent trouble report from Frank van Vugt, in which releasing a tupledesc
refcount failed. It's also unnecessary, since the resources that callbacks
might wish to release should be cleaned up by other error recovery mechanisms
(ie the resource owners). We only really want FreeExprContext to release
memory attached to the exprcontext in the error-exit case. So, add a bool
parameter to FreeExprContext to tell it not to call the callbacks.
A more general solution would be to pass the isCommit bool parameter on to
the callbacks, so they could do only safe things during error exit. But
that would make the patch significantly more invasive and possibly break
third-party code that registers ExprContextCallback callbacks. We might want
to do that later in HEAD, but for now I'll just do what seems reasonable to
back-patch.
ArrayBuildState, per trouble report from Merlin Moncure. By adopting
this fix, we are essentially deciding that aggregate final-functions
should not modify their inputs ever. Adjust documentation and comments
to match that conclusion.
aggregated tuple of a run. Per report from Laurenz Albe. This is a new
bug in 8.4, but only because prior versions rejected SRFs in an Agg plan
node altogether.
function returning setof record. This used to work, more or less
accidentally, but I had broken it while extending the code to allow
materialize-mode functions to be called in select lists. Add a regression
test case so it doesn't get broken again. Per gripe from Greg Davidson.
by extending the ereport() API to cater for pluralization directly. This
is better than the original method of calling ngettext outside the elog.c
code because (1) it avoids double translation, which wastes cycles and in
the worst case could give a wrong result; and (2) it avoids having to use
a different coding method in PL code than in the core backend. The
client-side uses of ngettext are not touched since neither of these concerns
is very pressing in the client environment. Per my proposal of yesterday.
a toast table to be built, even if the sum-of-column-widths calculation
indicates one isn't needed. This is needed by pg_migrator because if the
old table has a toast table, we have to migrate over the toast table since
it might contain some live data, even though subsequent column drops could
mean that no recently-added rows could require toasting.
of discovery, rather than reverse order. This doesn't matter functionally
(I suppose the previous coding dates from the time when lcons was markedly
cheaper than lappend). However now that EXPLAIN is labeling subplans with
IDs that are based on order of creation, this may help produce a slightly
less surprising printout.
for simple Var targetlist entries all the time, even when there are other
entries that are not simple Vars. Also, ensure that we prefetch attributes
(with slot_getsomeattrs) for all Vars in the targetlist, even those buried
within expressions. In combination these changes seem to significantly
reduce the runtime for cases where tlists are mostly but not exclusively
Vars. Per my proposal of yesterday.
TupleTableSlots. We have functions for retrieving a minimal tuple from a slot
after storing a regular tuple in it, or vice versa; but these were implemented
by converting the internal storage from one format to the other. The problem
with that is it invalidates any pass-by-reference Datums that were already
fetched from the slot, since they'll be pointing into the just-freed version
of the tuple. The known problem cases involve fetching both a whole-row
variable and a pass-by-reference value from a slot that is fed from a
tuplestore or tuplesort object. The added regression tests illustrate some
simple cases, but there may be other failure scenarios traceable to the same
bug. Note that the added tests probably only fail on unpatched code if it's
built with --enable-cassert; otherwise the bug leads to fetching from freed
memory, which will not have been overwritten without additional conditions.
Fix by allowing a slot to contain both formats simultaneously; which turns out
not to complicate the logic much at all, if anything it seems less contorted
than before.
Back-patch to 8.2, where minimal tuples were introduced.
mode while callers hold pointers to in-memory tuples. I reported this for
the case of nodeWindowAgg's primary scan tuple, but inspection of the code
shows that all of the calls in nodeWindowAgg and nodeCtescan are at risk.
For the moment, fix it with a rather brute-force approach of copying
whenever one of the at-risk callers requests a tuple. Later we might
think of some sort of reference-count approach to reduce tuple copying.
In the backend, I changed only a handful of exemplary or important-looking
instances to make use of the plural support; there is probably more work
there. For the rest of the source, this should cover all relevant cases.
distribution, by creating a special fast path for the (first few) most common
values of the outer relation. Tuples having hashvalues matching the MCVs
are effectively forced to be in the first batch, so that we never write
them out to the batch temp files.
Bryce Cutt and Ramon Lawrence, with some editorialization by me.
from the source table. This could never happen anyway before 8.4 because
the executor invariably applied a "junk filter" to rows due to be inserted;
but now that we skip doing that when it's not necessary, the case can occur.
Problem noted 2008-11-27 by KaiGai Kohei, though I misunderstood what he
was on about at the time (the opacity of the patch he proposed didn't help).
qualifier, and add support for this in pg_dump.
This allows TOAST tables to have user-defined fillfactor, and will also
enable us to move the autovacuum parameters to reloptions without taking
away the possibility of setting values for TOAST tables.
case that the command is rewritten into another type of command. The old
behavior to return the command tag of the last executed command was
pretty surprising. In PL/pgSQL, for example, it meant that if a command
was rewritten to a utility statement, FOUND wasn't set at all.
the same page we are nanoseconds away from reading for real. There should be
something left to do on the current page before we consider issuing a prefetch.
GUC variable effective_io_concurrency controls how many concurrent block
prefetch requests will be issued.
(The best way to handle this for plain index scans is still under debate,
so that part is not applied yet --- tgl)
Greg Stark
bitmap. This is extracted from Greg Stark's posix_fadvise patch; it seems
worth committing separately, since it's potentially useful independently of
posix_fadvise.
that are set up for execution with ExecPrepareExpr rather than going through
the full planner process. By introducing an explicit notion of "expression
planning", this patch also lays a bit of groundwork for maybe someday
allowing sub-selects in standalone expressions.
OutputFunctionCall, and friends. This allows SPI-using functions to invoke
datatype I/O without concern for the possibility that a SPI-using function
will be called (which could be either the I/O function itself, or a function
used in a domain check constraint). It's a tad ugly, but not nearly as ugly
as what'd be needed to make this work via retail insertion of push/pop
operations in all the PLs.
This reverts my patch of 2007-01-30 that inserted some retail SPI_push/pop
calls into plpgsql; that approach only fixed plpgsql, and not any other PLs.
But the other PLs have the issue too, as illustrated by a recent gripe from
Christian Schröder.
Back-patch to 8.2, which is as far back as this solution will work. It's
also as far back as we need to worry about the domain-constraint case, since
earlier versions did not attempt to check domain constraints within datatype
input. I'm not aware of any old I/O functions that use SPI themselves, so
this should be sufficient for a back-patch.
not include postgres.h nor anything else it doesn't directly need. Add
#includes to calling files as needed to compensate. Per my proposal of
yesterday.
This should be noted as a source code change in the 8.4 release notes,
since it's likely to require changes in add-on modules.
practically free given prior 8.4 changes in plancache and portal management,
and it makes it a lot easier for ExecutorStart/Run/End hooks to get at the
query text. Extracted from Itagaki Takahiro's pg_stat_statements patch,
with minor editorialization.
patch. This includes the ability to force the frame to cover the whole
partition, and the ability to make the frame end exactly on the current row
rather than its last ORDER BY peer. Supporting any more of the full SQL
frame-clause syntax will require nontrivial hacking on the window aggregate
code, so it'll have to wait for 8.5 or beyond.
upcoming window-functions patch. First, tuplestore_trim is now an
exported function that must be explicitly invoked by callers at
appropriate times, rather than something that tuplestore tries to do
behind the scenes. Second, a read pointer that is marked as allowing
backward scan no longer prevents truncation. This means that a read pointer
marked as having BACKWARD but not REWIND capability can only safely read
backwards as far as the oldest other read pointer. (The expected use pattern
for this involves having another read pointer that serves as the truncation
fencepost.)
materialize-mode set results. Since it now uses the ReturnSetInfo node
to hold internal state, we need to be sure to set up the node even when
the immediately called function doesn't return set (but does have a set-valued
argument). Per report from Anupama Aherrao.
toasted values, since those could get dropped once the cursor's transaction
is over. Per bug #4553 from Andrew Gierth.
Back-patch as far as 8.1. The bug actually exists back to 7.4 when holdable
cursors were introduced, but this patch won't work before 8.1 without
significant adjustments. Given the lack of field complaints, it doesn't seem
worth the work (and risk of introducing new bugs) to try to make a patch for
the older branches.
that a Portal is a useful and sufficient additional argument for
CreateDestReceiver --- it just isn't, in most cases. Instead formalize
the approach of passing any needed parameters to the receiver separately.
One unexpected benefit of this change is that we can declare typedef Portal
in a less surprising location.
This patch is just code rearrangement and doesn't change any functionality.
I'll tackle the HOLD-cursor-vs-toast problem in a follow-on patch.
DestReceiver created during postquel_start needs to be destroyed during
postquel_end. In a moment of brain fade I had assumed this would be taken
care of by FreeQueryDesc, but it's not (and shouldn't be).
* Refactor explain.c slightly to export a convenient-to-use subroutine
for printing EXPLAIN results.
* Provide hooks for plugins to get control at ExecutorStart and ExecutorEnd
as well as ExecutorRun.
* Add some minimal support for tracking the total runtime of ExecutorRun.
This code won't actually do anything unless a plugin prods it to.
* Change the API of the DefineCustomXXXVariable functions to allow nonzero
"flags" to be specified for a custom GUC variable. While at it, also make
the "bootstrap" default value for custom GUCs be explicitly specified as a
parameter to these functions. This is to eliminate confusion over where the
default comes from, as has been expressed in the past by some users of the
custom-variable facility.
* Refactor GUC code a bit to ensure that a custom variable gets initialized to
something valid (like its default value) even if the placeholder value was
invalid.
locate the target row, if the cursor was declared with FOR UPDATE or FOR
SHARE. This approach is more flexible and reliable than digging through the
plan tree; for instance it can cope with join cursors. But we still provide
the old code for use with non-FOR-UPDATE cursors. Per gripe from Robert Haas.
return the tableoid as well as the ctid for any FOR UPDATE targets that
have child tables. All child tables are listed in the ExecRowMark list,
but the executor just skips the ones that didn't produce the current row.
Curiously, this longstanding restriction doesn't seem to have been documented
anywhere; so no doc changes.
(but not locked, as that would risk deadlocks). Also, make it work in a small
ring of buffers to avoid having bulk inserts trash the whole buffer arena.
Robert Haas, after an idea of Simon Riggs'.
and heap_deformtuple in favor of the newer functions heap_form_tuple et al
(which do the same things but use bool control flags instead of arbitrary
char values). Eliminate the former duplicate coding of these functions,
reducing the deprecated functions to mere wrappers around the newer ones.
We can't get rid of them entirely because add-on modules probably still
contain many instances of the old coding style.
Kris Jurka
it just return void instead of sometimes returning a TupleTableSlot. SQL
functions don't need that anymore, and noplace else does either. Eliminating
the return value also means one less hassle for the ExecutorRun hook functions
that will be supported beginning in 8.4.
RETURNING clause, not just a SELECT as formerly.
A side effect of this patch is that when a set-returning SQL function is used
in a FROM clause, performance is improved because the output is collected into
a tuplestore within the function, rather than using the less efficient
value-per-call mechanism.
backwards scan could actually happen. In particular, pass a flag to
materialize-mode SRFs that tells them whether they need to require random
access. In passing, also suppress unneeded backward-scan overhead for a
Portal's holdStore tuplestore. Per my proposal about reducing I/O costs for
tuplestores.
via a tuplestore instead of value-per-call. Refactor a few things to reduce
ensuing code duplication with nodeFunctionscan.c. This represents the
reasonably noncontroversial part of my proposed patch to switch SQL functions
over to returning tuplestores. For the moment, SQL functions still do things
the old way. However, this change enables PL SRFs to be called in targetlists
(observe changes in plperl regression results).
didn't actually work, because nodeRecursiveunion.c creates the underlying
tuplestore with backward scan disabled; which is a decision that we shouldn't
reverse because of performance cost. We could imagine adding signaling from
WorkTableScan to RecursiveUnion about whether backward scan is needed ...
but in practice it'd be a waste of effort, because there simply isn't any
current or plausible future scenario where WorkTableScan would be called on
to scan backward. So just dike out the code that claims to support it.
in their targetlists had better reset ps_TupFromTlist during ReScan calls.
There's no need to back-patch here since nodeAgg and nodeGroup didn't
even pretend to support SRFs in prior releases.
scanning; GiST and GIN do not, and it seems like too much trouble to make
them do so. By teaching ExecSupportsBackwardScan() about this restriction,
we ensure that the planner will protect a scroll cursor from the problem
by adding a Materialize node.
In passing, fix another longstanding bug in the same area: backwards scan of
a plan with set-returning functions in the targetlist did not work either,
since the TupFromTlist expansion code pays no attention to direction (and
has no way to run a SRF backwards anyway). Again the fix is to make
ExecSupportsBackwardScan check this restriction.
Also adjust the index AM API specification to note that mark/restore support
is unnecessary if the AM can't produce ordered output.
In the previous coding, the list of columns that needed to be hashed on
was allocated in the per-query context, but we reallocated every time
the Agg node was rescanned. Since this information doesn't change over
a rescan, just construct the list of columns once during ExecInitAgg().
according to the TupleDesc's natts, not the number of physical columns in the
tuple. The previous coding would do the wrong thing in cases where natts is
different from the tuple's column count: either incorrectly report error when
it should just treat the column as null, or actually crash due to indexing off
the end of the TupleDesc's attribute array. (The second case is probably not
possible in modern PG versions, due to more careful handling of inheritance
cases than we once had. But it's still a clear lack of robustness here.)
The incorrect error indication is ignored by all callers within the core PG
distribution, so this bug has no symptoms visible within the core code, but
it might well be an issue for add-on packages. So patch all the way back.
RecursiveUnion to which it refers. It turns out that we can just postpone the
relevant initialization steps until the first exec call for the node, by which
time the ancestor node must surely be initialized. Per report from Greg Stark.
implementation uses an in-memory hash table, so it will poop out for very
large recursive results ... but the performance characteristics of a
sort-based implementation would be pretty unpleasant too.
There are some unimplemented aspects: recursive queries must use UNION ALL
(should allow UNION too), and we don't have SEARCH or CYCLE clauses.
These might or might not get done for 8.4, but even without them it's a
pretty useful feature.
There are also a couple of small loose ends and definitional quibbles,
which I'll send a memo about to pgsql-hackers shortly. But let's land
the patch now so we can get on with other development.
Yoshiyuki Asaba, with lots of help from Tatsuo Ishii and Tom Lane
This facility replaces the former mark/restore support but is otherwise
upward-compatible with previous uses. It's expected to be needed for
single evaluation of CTEs and also for window functions, so I'm committing
it separately instead of waiting for either one of those patches to be
finished. Per discussion with Greg Stark and Hitoshi Harada.
Note: I removed nodeFunctionscan's mark/restore support, instead of bothering
to update it for this change, because it was dead code anyway.
we regenerate the SQL query text not merely the plan derived from it. This
is needed to handle contingencies such as renaming of a table or column
used in an FK. Pre-8.3, such cases worked despite the lack of replanning
(because the cached plan needn't actually change), so this is a regression.
Per bug #4417 from Benjamin Bihler.
GetOldestXmin() instead of RecentGlobalXmin; this is safer because we do not
depend on the latter being correctly set elsewhere, and while it is more
expensive, this code path is not performance-critical. This is a real
risk for autovacuum, because it can execute whole cycles without doing
a single vacuum, which would mean that RecentGlobalXmin would stay at its
initialization value, FirstNormalTransactionId, causing a bogus value to be
inserted in pg_database. This bug could explain some recent reports of
failure to truncate pg_clog.
At the same time, change the initialization of RecentGlobalXmin to
InvalidTransactionId, and ensure that it's set to something else whenever
it's going to be used. Using it as FirstNormalTransactionId in HOT page
pruning could incur in data loss. InitPostgres takes care of setting it
to a valid value, but the extra checks are there to prevent "special"
backends from behaving in unusual ways.
Per Tom Lane's detailed problem dissection in 29544.1221061979@sss.pgh.pa.us
into nodes/nodeFuncs, so as to reduce wanton cross-subsystem #includes inside
the backend. There's probably more that should be done along this line,
but this is a start anyway.
checks in ExecIndexBuildScanKeys() that were inadequate anyway: it's better
to verify the correct varno on an expected index key, not just reject OUTER
and INNER.
This makes the entire current contents of nodeFuncs.c dead code. I'll be
replacing it with some other stuff later, as per recent proposal.
subqueries into the same thing you'd have gotten from IN (except always with
unknownEqFalse = true, so as to get the proper semantics for an EXISTS).
I believe this fixes the last case within CVS HEAD in which an EXISTS could
give worse performance than an equivalent IN subquery.
The tricky part of this is that if the upper query probes the EXISTS for only
a few rows, the hashing implementation can actually be worse than the default,
and therefore we need to make a cost-based decision about which way to use.
But at the time when the planner generates plans for subqueries, it doesn't
really know how many times the subquery will be executed. The least invasive
solution seems to be to generate both plans and postpone the choice until
execution. Therefore, in a query that has been optimized this way, EXPLAIN
will show two subplans for the EXISTS, of which only one will actually get
executed.
There is a lot more that could be done based on this infrastructure: in
particular it's interesting to consider switching to the hash plan if we start
out using the non-hashed plan but find a lot more upper rows going by than we
expected. I have therefore left some minor inefficiencies in place, such as
initializing both subplans even though we will currently only use one.
match in antijoin mode, we should advance to next outer tuple not next inner.
We know we don't want to return this outer tuple, and there is no point in
advancing over matching inner tuples now, because we'd just have to do it
again if the next outer tuple has the same merge key. This makes a noticeable
difference if there are lots of duplicate keys in both inputs.
Similarly, after finding a match in semijoin mode, arrange to advance to
the next outer tuple after returning the current match; or immediately,
if it fails the extra quals. The rationale is the same. (This is a
performance bug in existing releases; perhaps worth back-patching? The
planner tries to avoid using mergejoin with lots of duplicates, so it may
not be a big issue in practice.)
Nestloop and hash got this right to start with, but I made some cosmetic
adjustments there to make the corresponding bits of logic look more similar.
the old JOIN_IN code, but antijoins are new functionality.) Teach the planner
to convert appropriate EXISTS and NOT EXISTS subqueries into semi and anti
joins respectively. Also, LEFT JOINs with suitable upper-level IS NULL
filters are recognized as being anti joins. Unify the InClauseInfo and
OuterJoinInfo infrastructure into "SpecialJoinInfo". With that change,
it becomes possible to associate a SpecialJoinInfo with every join attempt,
which permits some cleanup of join selectivity estimation. That needs to be
taken much further than this patch does, but the next step is to change the
API for oprjoin selectivity functions, which seems like material for a
separate patch. So for the moment the output size estimates for semi and
especially anti joins are quite bogus.
INSERT or UPDATE will match the target table's current rowtype. In pre-8.3
releases inconsistency can arise with stale cached plans, as reported by
Merlin Moncure. (We patched the equivalent hazard on the SELECT side in Feb
2007; I'm not sure why we thought there was no risk on the insertion side.)
In 8.3 and HEAD this problem should be impossible due to plan cache
invalidation management, but it seems prudent to make the check anyway.
Back-patch as far as 8.0. 7.x versions lack ALTER COLUMN TYPE, so there
seems no way to abuse a stale plan comparably.
hashtable entries for tuples that are found only in the second input: they
can never contribute to the output. Furthermore, this implies that the
planner should endeavor to put first the smaller (in number of groups) input
relation for an INTERSECT. Implement that, and upgrade prepunion's estimation
of the number of rows returned by setops so that there's some amount of sanity
in the estimate of which one is smaller.
This completes my project of improving usage of hashing for duplicate
elimination (aggregate functions with DISTINCT remain undone, but that's
for some other day).
As with the previous patches, this means we can INTERSECT/EXCEPT on datatypes
that can hash but not sort, and it means that INTERSECT/EXCEPT without ORDER
BY are no longer certain to produce sorted output.
would work, but in fact it didn't return the same rows when moving backwards
as when moving forwards. This would have no visible effect in a DISTINCT
query (at least assuming the column datatypes use a strong definition of
equality), but it gave entirely wrong answers for DISTINCT ON queries.
as per my recent proposal:
1. Fold SortClause and GroupClause into a single node type SortGroupClause.
We were already relying on them to be struct-equivalent, so using two node
tags wasn't accomplishing much except to get in the way of comparing items
with equal().
2. Add an "eqop" field to SortGroupClause to carry the associated equality
operator. This is cheap for the parser to get at the same time it's looking
up the sort operator, and storing it eliminates the need for repeated
not-so-cheap lookups during planning. In future this will also let us
represent GROUP/DISTINCT operations on datatypes that have hash opclasses
but no btree opclasses (ie, they have equality but no natural sort order).
The previous representation simply didn't work for that, since its only
indicator of comparison semantics was a sort operator.
3. Add a hasDistinctOn boolean to struct Query to explicitly record whether
the distinctClause came from DISTINCT or DISTINCT ON. This allows removing
some complicated and not 100% bulletproof code that attempted to figure
that out from the distinctClause alone.
This patch doesn't in itself create any new capability, but it's necessary
infrastructure for future attempts to use hash-based grouping for DISTINCT
and UNION/INTERSECT/EXCEPT.
filter to be used when INSERT or SELECT INTO has a plan that returns raw
disk tuples. The virtual-tuple-slot optimizations that were put in place
awhile ago mean that ExecInsert has to do ExecMaterializeSlot, and that
already copies the tuple if it's raw (and does so more efficiently than
a junk filter, too). So get rid of that logic. This in turn means that
we can throw away ExecMayReturnRawTuples, which wasn't used for any other
purpose, and was always a kluge anyway.
In passing, move a couple of SELECT-INTO-specific fields out of EState
and into the private state of the SELECT INTO DestReceiver, as was foreseen
in an old comment there. Also make intorel_receive use ExecMaterializeSlot
not ExecCopySlotTuple, for consistency with ExecInsert and to possibly save
a tuple copy step in some cases.
a portal are never NULL, but reliably provide the source text of the query.
It turns out that there was only one place that was really taking a short-cut,
which was the 'EXECUTE' utility statement. That doesn't seem like a
sufficiently critical performance hotspot to justify not offering a guarantee
of validity of the portal source text. Fix it to copy the source text over
from the cached plan. Add Asserts in the places that set up cached plans and
portals to reject null source strings, and simplify a bunch of places that
formerly needed to guard against nulls.
There may be a few places that cons up statements for execution without
having any source text at all; I found one such in ConvertTriggerToFK().
It seems sufficient to inject a phony source string in such a case,
for instance
ProcessUtility((Node *) atstmt,
"(generated ALTER TABLE ADD FOREIGN KEY command)",
NULL, false, None_Receiver, NULL);
We should take a second look at the usage of debug_query_string,
particularly the recently added current_query() SQL function.
ITAGAKI Takahiro and Tom Lane
bug #4290. The fundamental bug is that masking extParam by outer_params,
as finalize_plan had been doing, caused us to lose the information that
an initPlan depended on the output of a sibling initPlan. On reflection
the best thing to do seemed to be not to try to adjust outer_params for
this case but get rid of it entirely. The only thing it was really doing
for us was to filter out param IDs associated with SubPlan nodes, and that
can be done (with greater accuracy) while processing individual SubPlan
nodes in finalize_primnode. This approach was vindicated by the discovery
that the masking method was hiding a second bug: SS_finalize_plan failed to
remove extParam bits for initPlan output params that were referenced in the
main plan tree (it only got rid of those referenced by other initPlans).
It's not clear that this caused any real problems, given the limited use
of extParam by the executor, but it's certainly not what was intended.
I originally thought that there was also a problem with needing to include
indirect dependencies on external params in initPlans' param sets, but it
turns out that the executor handles this correctly so long as the depended-on
initPlan is earlier in the initPlans list than the one using its output.
That seems a bit of a fragile assumption, but it is true at the moment,
so I just documented it in some code comments rather than making what would
be rather invasive changes to remove the assumption.
Back-patch to 8.1. Previous versions don't have the case of initPlans
referring to other initPlans' outputs, so while the existing logic is still
questionable for them, there are not any known bugs to be fixed. So I'll
refrain from changing them for now.
corresponding struct definitions. This allows other headers to avoid including
certain highly-loaded headers such as rel.h and relscan.h, instead using just
relcache.h, heapam.h or genam.h, which are more lightweight and thus cause less
unnecessary dependencies.
the PARAM_FLAG_CONST flag on the parameters that are passed into the portal,
while the former's behavior is unchanged. This should only affect the case
where the portal is executing an EXPLAIN; it will cause the generated plan to
look more like what would be generated if the portal were actually executing
the command being explained. Per gripe from Pavel.
functions.
Note that because this patch changes FmgrInfo, any external C functions
you might be testing with 8.4 will need to be recompiled.
Patch by Martin Pihlak, some editorialization by me (principally, removing
tracking of getrusage() numbers)
file portability/instr_time.h, and add a couple more macros to eliminate
some abstraction leakage we formerly had. Also update psql to use this
header instead of its own copy of nearly the same code.
This commit in itself is just code cleanup and shouldn't change anything.
It lays some groundwork for the upcoming function-stats patch, though.
There are two ways to track a snapshot: there's the "registered" list, which
is used for arbitrary long-lived snapshots; and there's the "active stack",
which is used for the snapshot that is considered "active" at any time.
This also allows users of snapshots to stop worrying about snapshot memory
allocation and freeing, and about using PG_TRY blocks around ActiveSnapshot
assignment. This is all done automatically now.
As a consequence, this allows us to reset MyProc->xmin when there are no
more snapshots registered in the current backend, reducing the impact that
long-running transactions have on VACUUM.
unnecessary #include lines in it. Also, move some tuple routine prototypes and
macros to htup.h, which allows removal of heapam.h inclusion from some .c
files.
For this to work, a new header file access/sysattr.h needed to be created,
initially containing attribute numbers of system columns, for pg_dump usage.
While at it, make contrib ltree, intarray and hstore header files more
consistent with our header style.
as those for inherited columns; that is, it's no longer allowed for a child
table to not have a check constraint matching one that exists on a parent.
This satisfies the principle of least surprise (rows selected from the parent
will always appear to meet its check constraints) and eliminates some
longstanding bogosity in pg_dump, which formerly had to guess about whether
check constraints were really inherited or not.
The implementation involves adding conislocal and coninhcount columns to
pg_constraint (paralleling attislocal and attinhcount in pg_attribute)
and refactoring various ALTER TABLE actions to be more like those for
columns.
Alex Hunsaker, Nikhil Sontakke, Tom Lane
a user-supplied TID is out of range for the relation. This is needed to
preserve compatibility with our pre-8.3 behavior, and it is sensible anyway
since if the query were implemented by brute force rather than optimized
into a TidScan, the behavior for a non-existent TID would be zero rows out,
never an error. Per gripe from Gurjeet Singh.
UPDATE/SHARE couldn't occur as a subquery in a query with a non-SELECT
top-level operation. Symptoms included outright failure (as in report from
Mark Mielke) and silently neglecting to take the requested row locks.
Back-patch to 8.3, because the visible failure in the INSERT ... SELECT case
is a regression from 8.2. I'm a bit hesitant to back-patch further given the
lack of field complaints.
no particular need to do get_op_opfamily_properties() while building an
indexscan plan. Postpone that lookup until executor start. This simplifies
createplan.c a lot more than it complicates nodeIndexscan.c, and makes things
more uniform since we already had to do it that way for RowCompare
expressions. Should be a bit faster too, at least for plans that aren't
re-used many times, since we avoid palloc'ing and perhaps copying the
intermediate list data structure.
instead of plan time. Extend the amgettuple API so that the index AM returns
a boolean indicating whether the indexquals need to be rechecked, and make
that rechecking happen in nodeIndexscan.c (currently the only place where
it's expected to be needed; other callers of index_getnext are just erroring
out for now). For the moment, GIN and GIST have stub logic that just always
sets the recheck flag to TRUE --- I'm hoping to get Teodor to handle pushing
that control down to the opclass consistent() functions. The planner no
longer pays any attention to amopreqcheck, and that catalog column will go
away in due course.
indexscan always occurs in one call, and the results are returned in a
TIDBitmap instead of a limited-size array of TIDs. This should improve
speed a little by reducing AM entry/exit overhead, and it is necessary
infrastructure if we are ever to support bitmap indexes.
In an only slightly related change, add support for TIDBitmaps to preserve
(somewhat lossily) the knowledge that particular TIDs reported by an index
need to have their quals rechecked when the heap is visited. This facility
is not really used yet; we'll need to extend the forced-recheck feature to
plain indexscans before it's useful, and that hasn't been coded yet.
The intent is to use it to clean up 8.3's horrid @@@ kluge for text search
with weighted queries. There might be other uses in future, but that one
alone is sufficient reason.
Heikki Linnakangas, with some adjustments by me.
responsible for copying the query string into the new Portal. Such copying
is unnecessary in the common code path through exec_simple_query, and in
this case it can be enormously expensive because the string might contain
a large number of individual commands; we were copying the entire, long
string for each command, resulting in O(N^2) behavior for N commands.
(This is the cause of bug #4079.) A second problem with it is that
PortalDefineQuery really can't risk error, because if it elog's before
having set up the Portal, we will leak the plancache refcount that the
caller is trying to hand off to the portal. So go back to the design in
which the caller is responsible for making sure everything is copied into
the portal if necessary.
that is commands that have out-of-line parameters but the plan is prepared
assuming that the parameter values are constants. This is needed for the
plpgsql EXECUTE USING patch, but will probably have use elsewhere.
This commit includes the SPI functions and documentation, but no callers
nor regression tests. The upcoming EXECUTE USING patch will provide
regression-test coverage. I thought committing this separately made
sense since it's logically a distinct feature.
snapmgmt.c file for the former. The header files have also been reorganized
in three parts: the most basic snapshot definitions are now in a new file
snapshot.h, and the also new snapmgmt.h keeps the definitions for snapmgmt.c.
tqual.h has been reduced to the bare minimum.
This patch is just a first step towards managing live snapshots within a
transaction; there is no functionality change.
Per my proposal to pgsql-patches on 20080318191940.GB27458@alvh.no-ip.org and
subsequent discussion.
strings. This patch introduces four support functions cstring_to_text,
cstring_to_text_with_len, text_to_cstring, and text_to_cstring_buffer, and
two macros CStringGetTextDatum and TextDatumGetCString. A number of
existing macros that provided variants on these themes were removed.
Most of the places that need to make such conversions now require just one
function or macro call, in place of the multiple notational layers that used
to be needed. There are no longer any direct calls of textout or textin,
and we got most of the places that were using handmade conversions via
memcpy (there may be a few still lurking, though).
This commit doesn't make any serious effort to eliminate transient memory
leaks caused by detoasting toasted text objects before they reach
text_to_cstring. We changed PG_GETARG_TEXT_P to PG_GETARG_TEXT_PP in a few
places where it was easy, but much more could be done.
Brendan Jurd and Tom Lane
identical to tuplestore_puttuple(), except it operates on arrays of
Datums + nulls rather than a fully-formed HeapTuple. In several places
that use the tuplestore API, this means we can avoid creating a
HeapTuple altogether, saving a copy.
are declared to return set, and consist of just a single SELECT. We
can replace the FROM-item with a sub-SELECT and then optimize much as
if we were dealing with a view. Patch from Richard Rowell, cleaned up
by me.
during a bitmap index scan. This cannot affect the query results
(since we're just dumping the TIDs into a bitmap) but it might offer
some advantage in locality of access to the index. Per Greg Stark.
"multi_call_ctx" to be a distinct sub-context of the EState's per-query
context, and delete the multi_call_ctx as soon as the SRF finishes
execution. This avoids leaking SRF memory until the end of the current
query, which is particularly egregious when the SRF is scanned
multiple times. This change also fixes a leak of the fields of the
AttInMetadata struct in shutdown_MultiFuncCall().
Also fix a leak of the SRF result TupleDesc when rescanning a
FunctionScan node. The TupleDesc is allocated in the per-query context
for every call to ExecMakeTableFunctionResult(), so we should free it
after calling that function. Since the SRF might choose to return
a non-expendable TupleDesc, we only free the TupleDesc if it is
not being reference-counted.
Backpatch to 8.3 and 8.2 stable branches.
doing anything interesting, such as calling RevalidateCachedPlan(). The
necessity of this is demonstrated by an example from Willem Buitendyk:
during a replan, the planner might try to evaluate SPI-using functions,
and so we'd better be in a clean SPI context.
A small downside of this fix is that these two functions will now fail
outright if called when not inside a SPI-using procedure (ie, a
SPI_connect/SPI_finish pair). The documentation never promised or suggested
that that would work, though; and they are normally used in concert with
other functions, mainly SPI_prepare, that always have failed in such a case.
So the odds of breaking something seem pretty low.
In passing, make SPI_is_cursor_plan's error handling convention clearer,
and fix documentation's erroneous claim that SPI_cursor_open would
return NULL on error.
Before 8.3 these functions could not invoke replanning, so there is probably
no need for back-patching.
tablespace permissions failures when copying an index that is in the
database's default tablespace. A side-effect of the change is that explicitly
specifying the default tablespace no longer triggers a permissions check;
this is not how it was done in pre-8.3 releases but is argued to be more
consistent. Per bug #3921 from Andrew Gilligan. (Note: I argued in the
subsequent discussion that maybe LIKE shouldn't copy index tablespaces
at all, but since no one indicated agreement with that idea, I've refrained
from doing it.)
checking of argument compatibility right; although the problem is only exposed
with multiple-input aggregates in which some arguments are polymorphic and
some are not. Per bug #3852 from Sokolov Yura.
but no database changes have been made since the last CommandCounterIncrement.
This should result in a significant improvement in the number of "commands"
that can typically be performed within a transaction before hitting the 2^32
CommandId size limit. In particular this buys back (and more) the possible
adverse consequences of my previous patch to fix plan caching behavior.
The implementation requires tracking whether the current CommandCounter
value has been "used" to mark any tuples. CommandCounter values stored into
snapshots are presumed not to be used for this purpose. This requires some
small executor changes, since the executor used to conflate the curcid of
the snapshot it was using with the command ID to mark output tuples with.
Separating these concepts allows some small simplifications in executor APIs.
Something for the TODO list: look into having CommandCounterIncrement not do
AcceptInvalidationMessages. It seems fairly bogus to be doing it there,
but exactly where to do it instead isn't clear, and I'm disinclined to mess
with asynchronous behavior during late beta.
plan before the effects of DDL executed in an immediately prior SPI operation
had been absorbed. Per report from Chris Wood.
This patch has an unpleasant side effect of causing the number of
CommandCounterIncrement()s done by a typical plpgsql function to
approximately double. Amelioration of the consequences of that
will be undertaken in a separate patch.
in corner cases such as re-fetching a just-deleted row. We may be able to
relax this someday, but let's find out how many people really care before
we invest a lot of work in it. Per report from Heikki and subsequent
discussion.
While in the neighborhood, make the combination of INSENSITIVE and FOR UPDATE
throw an error, since they are semantically incompatible. (Up to now we've
accepted but just ignored the INSENSITIVE option of DECLARE CURSOR.)
then-delete on the current cursor row. The basic fix is that nodeTidscan.c
has to apply heap_get_latest_tid() to the current-scan-TID obtained from the
cursor query; this ensures we get the latest row version to work with.
However, since that only works if the query plan is a TID scan, we also have
to hack the planner to make sure only that type of plan will be selected.
(Formerly, the planner might decide to apply a seqscan if the table is very
small. This change is probably a Good Thing anyway, since it's hard to see
how a seqscan could really win.) That means the execQual.c code to support
CurrentOfExpr as a regular expression type is dead code, so replace it with
just an elog(). Also, add regression tests covering these cases. Note
that the added tests expose the fact that re-fetching an updated row
misbehaves if the cursor used FOR UPDATE. That's an independent bug that
should be fixed later. Per report from Dharmendra Goyal.
columns, and the new version can be stored on the same heap page, we no longer
generate extra index entries for the new version. Instead, index searches
follow the HOT-chain links to ensure they find the correct tuple version.
In addition, this patch introduces the ability to "prune" dead tuples on a
per-page basis, without having to do a complete VACUUM pass to recover space.
VACUUM is still needed to clean up dead index entries, however.
Pavan Deolasee, with help from a bunch of other people.
than two independent bits (one of which was never used in heap pages anyway,
or at least hadn't been in a very long time). This gives us flexibility to
add the HOT notions of redirected and dead item pointers without requiring
anything so klugy as magic values of lp_off and lp_len. The state values
are chosen so that for the states currently in use (pre-HOT) there is no
change in the physical representation.
no need for serialization against snapshot-taking because the xact doesn't
affect anyone else's snapshot anyway. Per discussion. Also, move various
info about the interlocking of transactions and snapshots out of code comments
and into a hopefully-more-cohesive discussion in access/transam/README.
Also, remove a couple of now-obsolete comments about having to force some WAL
to be written to persuade RecordTransactionCommit to do its thing.
null::char(3) to a simple Const node. (It already worked for non-null values,
but not when we skipped evaluation of a strict coercion function.) This
prevents loss of typmod knowledge in situations such as exhibited in bug
#3598. Unfortunately there seems no good way to fix that bug in 8.1 and 8.2,
because they simply don't carry a typmod for a plain Const node.
In passing I made all the other callers of makeNullConst supply "real" typmod
values too, though I think it probably doesn't matter anywhere else.
generating the tuples has resjunk output columns. This is not possible for
simple table scans but can happen when evaluating a whole-row Var for a view.
Per example from Patryk Kordylewski. The problem exists back to 8.0 but
I'm not going to risk back-patching further than 8.2 because of the many
changes in this area.
are not one of the query's defined result relations, but nonetheless have
triggers fired against them while the query is active. This was formerly
impossible but can now occur because of my recent patch to fix the firing
order for RI triggers. Caching a ResultRelInfo avoids duplicating work by
repeatedly opening and closing the same relation, and also allows EXPLAIN
ANALYZE to "see" and report on these extra triggers. Use the same mechanism
to cache open relations when firing deferred triggers at transaction shutdown;
this replaces the former one-element-cache strategy used in that case, and
should improve performance a bit when there are deferred triggers on a number
of relations.
row within one query: we were firing check triggers before all the updates
were done, leading to bogus failures. Fix by making the triggers queued by
an RI update go at the end of the outer query's trigger event list, thereby
effectively making the processing "breadth-first". This was indeed how it
worked pre-8.0, so the bug does not occur in the 7.x branches.
Per report from Pavel Stehule.
hash table is allocated in a child context of the agg node's memory
context, MemoryContextReset() will reset but *not* delete the child
context. Since ExecReScanAgg() proceeds to build a new hash table
from scratch (in a new sub-context), this results in leaking the
header for the previous memory context. Therefore, use
MemoryContextResetAndDeleteChildren() instead.
Credit: My colleague Sailesh Krishnamurthy at Truviso for isolating
the cause of the leak.
few lines in sql_exec_error_callback() by using the function source string
field that the patch added to SQL function cache entries. This doesn't work
because the fn_extra field isn't filled in yet during init_sql_fcache().
Probably it could be made to work, but it doesn't seem appropriate to contort
the main code paths to make an error-reporting path a tad faster. Per report
from Pavel Stehule.
with a plpgsql-defined cursor. The underlying mechanism for this is that the
main SQL engine will now take "WHERE CURRENT OF $n" where $n is a refcursor
parameter. Not sure if we should document that fact or consider it an
implementation detail. Per discussion with Pavel Stehule.
Along the way, allow FOR UPDATE in non-WITH-HOLD cursors; there may once
have been a reason to disallow that, but it seems to work now, and it's
really rather necessary if you want to select a row via a cursor and then
update it in a concurrent-safe fashion.
Original patch by Arul Shaji, rather heavily editorialized by Tom Lane.
pseudo HeapScanDesc created for a bitmap heap scan. This avoids some useless
overhead during a bitmap scan startup, in particular invoking the syncscan
code. (We might someday want to do that, but right now it's merely useless
contention for shared memory, to say nothing of possibly pushing useful
entries out of syncscan's small LRU list.) This also allows elimination of
ugly pgstat_discount_heap_scan() kluge.
for each temp file, rather than once per sort or hashjoin; this allows
spreading the data of a large sort or join across multiple tablespaces.
(I remain dubious that this will make any difference in practice, but certain
people insisted.) Arrange to cache the results of parsing the GUC variable
instead of recomputing from scratch on every demand, and push usage of the
cache down to the bottommost fd.c level.
were accepted by prior Postgres releases. This takes care of the loose end
left by the preceding patch to downgrade implicit casts-to-text. To avoid
breaking desirable behavior for array concatenation, introduce a new
polymorphic pseudo-type "anynonarray" --- the added concatenation operators
are actually text || anynonarray and anynonarray || text.
from the other string-category types; this eliminates a lot of surprising
interpretations that the parser could formerly make when there was no directly
applicable operator.
Create a general mechanism that supports casts to and from the standard string
types (text,varchar,bpchar) for *every* datatype, by invoking the datatype's
I/O functions. These new casts are assignment-only in the to-string direction,
explicit-only in the other, and therefore should create no surprising behavior.
Remove a bunch of thereby-obsoleted datatype-specific casting functions.
The "general mechanism" is a new expression node type CoerceViaIO that can
actually convert between *any* two datatypes if their external text
representations are compatible. This is more general than needed for the
immediate feature, but might be useful in plpgsql or other places in future.
This commit does nothing about the issue that applying the concatenation
operator || to non-text types will now fail, often with strange error messages
due to misinterpreting the operator as array concatenation. Since it often
(not always) worked before, we should either make it succeed or at least give
a more user-friendly error; but details are still under debate.
Peter Eisentraut and Tom Lane
tablespace(s) in which to store temp tables and temporary files. This is a
list to allow spreading the load across multiple tablespaces (a random list
element is chosen each time a temp object is to be created). Temp files are
not stored in per-database pgsql_tmp/ directories anymore, but per-tablespace
directories.
Jaime Casanova and Albert Cervera, with review by Bernd Helmle and Tom Lane.
selecting power-of-2, rather than prime, numbers of buckets in hash joins.
If the hash functions are doing their jobs properly by making all hash bits
equally random, this is good enough, and it saves expensive integer division
and modulus operations.
EXPLAIN-only operation was a little too short; it skipped initializing the
node's result tuple type, which may be needed depending on what's above the
indexscan node. Call ExecAssignResultTypeFromTL before exiting. (For good
luck I moved up the ExecAssignScanProjectionInfo call as well, so that
everything except indexscan-specific initialization will still be done.)
Per example from Grant Finnemore.
and aborted transactions have different effects; also teach it not to assume
that prepared transactions are always committed.
Along the way, simplify the pgstats API by tying counting directly to
Relations; I cannot detect any redeeming social value in having stats
pointers in HeapScanDesc and IndexScanDesc structures. And fix a few
corner cases in which counts might be missed because the relation's
pgstat_info pointer hadn't been set.
and/or create plans for hypothetical situations; in particular, investigate
plans that would be generated using hypothetical indexes. This is a
heavily-rewritten version of the hooks proposed by Gurjeet Singh for his
Index Advisor project. In this formulation, the index advisor can be
entirely a loadable module instead of requiring a significant part to be
in the core backend, and plans can be generated for hypothetical indexes
without requiring the creation and rolling-back of system catalog entries.
The index advisor patch as-submitted is not compatible with these hooks,
but it needs significant work anyway due to other 8.2-to-8.3 planner
changes. With these hooks in the core backend, development of the advisor
can proceed as a pgfoundry project.
is using mark/restore but not rewind or backward-scan capability. Insert a
materialize plan node between a mergejoin and its inner child if the inner
child is a sort that is expected to spill to disk. The materialize shields
the sort from the need to do mark/restore and thereby allows it to perform
its final merge pass on-the-fly; while the materialize itself is normally
cheap since it won't spill to disk unless the number of tuples with equal
key values exceeds work_mem.
Greg Stark, with some kibitzing from Tom Lane.
recompute the limit/offset immediately, so that the updated values are
available when the child's ReScan function is invoked. Add a regression
test for this, too. Bug is new in HEAD (due to the bounded-sorting patch)
so no need for back-patch.
I did not do anything about merging this signaling with chgParam processing,
but if we were to do that we'd still need to compute the updated values
at this point rather than during the first ProcNode call.
Per observation and test case from Greg Stark, though I didn't use his patch.
need be returned. We keep a heap of the current best N tuples and sift-up
new tuples into it as we scan the input. For M input tuples this means
only about M*log(N) comparisons instead of M*log(M), not to mention a lot
less workspace when N is small --- avoiding spill-to-disk for large M
is actually the most attractive thing about it. Patch includes planner
and executor support for invoking this facility in ORDER BY ... LIMIT
queries. Greg Stark, with some editorialization by moi.
types of unspecified parameters when submitted via extended query protocol.
This worked in 8.2 but I had broken it during plancache changes. DECLARE
CURSOR is now treated almost exactly like a plain SELECT through parse
analysis, rewrite, and planning; only just before sending to the executor
do we divert it away to ProcessUtility. This requires a special-case check
in a number of places, but practically all of them were already special-casing
SELECT INTO, so it's not too ugly. (Maybe it would be a good idea to merge
the two by treating IntoClause as a form of utility statement? Not going to
worry about that now, though.) That approach doesn't work for EXPLAIN,
however, so for that I punted and used a klugy solution of running parse
analysis an extra time if under extended query protocol.
is in progress on the same hashtable. This seems the least invasive way to
fix the recently-recognized problem that a split could cause the scan to
visit entries twice or (with much lower probability) miss them entirely.
The only field-reported problem caused by this is the "failed to re-find
shared lock object" PANIC in COMMIT PREPARED reported by Michel Dorochevsky,
which was caused by multiply visited entries. However, it seems certain
that mdsync() is vulnerable to missing required fsync's due to missed
entries, and I am fearful that RelationCacheInitializePhase2() might be at
risk as well. Because of that and the generalized hazard presented by this
bug, back-patch all the supported branches.
Along the way, fix pg_prepared_statement() and pg_cursor() to not assume
that the hashtables they are examining will stay static between calls.
This is risky regardless of the newly noted dynahash problem, because
hash_seq_search() has never promised to cope with deletion of table entries
other than the just-returned one. There may be no bug here because the only
supported way to call these functions is via ExecMakeTableFunctionResult()
which will cycle them to completion before doing anything very interesting,
but it seems best to get rid of the assumption. This affects 8.2 and HEAD
only, since those functions weren't there earlier.
a replan. I had originally thought this was not necessary, but the new
SPI facilities create a path whereby queries planned with non-default
options can get into the cache, so it is necessary.
access to the planner's cursor-related planning options, and provide new
FETCH/MOVE routines that allow access to the full power of those commands.
Small refactoring of planner(), pg_plan_query(), and pg_plan_queries()
APIs to make it convenient to pass the planning options down from SPI.
This is the core-code portion of Pavel Stehule's patch for scrollable
cursor support in plpgsql; I'll review and apply the plpgsql changes
separately.
This commit breaks any code that assumes that the mere act of forming a tuple
(without writing it to disk) does not "toast" any fields. While all available
regression tests pass, I'm not totally sure that we've fixed every nook and
cranny, especially in contrib.
Greg Stark with some help from Tom Lane
return void ends with a SELECT, if that SELECT has a single result that is
also of type void. Without this, it's hard to write a void function that
calls another void function. Per gripe from Peter.
Back-patch as far as 8.0.
seen by code inspecting the expression. The best way to do this seems
to be to drop the original representation as a function invocation, and
instead make a special expression node type that represents applying
the element-type coercion function to each array element. In this way
the element function is exposed and will be checked for volatility.
Per report from Guillaume Smet.
if possible. I had left this undone in the first pass at the API change
for ProcessUtility, but forgot to revisit it after the plancache changes
made it possible to do it.
Vadim had included this restriction in the original design of the SPI code,
but I'm darned if I can see a reason for it.
I left the macro definition of SPI_ERROR_CURSOR in place, so as not to
needlessly break any SPI callers that are checking for it, but that code
will never actually be returned anymore.
pointer" in every Snapshot struct. This allows removal of the case-by-case
tests in HeapTupleSatisfiesVisibility, which should make it a bit faster
(I didn't try any performance tests though). More importantly, we are no
longer violating portable C practices by assuming that small integers are
distinct from all pointer values, and HeapTupleSatisfiesDirty no longer
has a non-reentrant API involving side-effects on a global variable.
There were a couple of places calling HeapTupleSatisfiesXXX routines
directly rather than through the HeapTupleSatisfiesVisibility macro.
Since these places had to be changed anyway, I chose to make them go
through the macro for uniformity.
Along the way I renamed HeapTupleSatisfiesSnapshot to HeapTupleSatisfiesMVCC
to emphasize that it's only used with MVCC-type snapshots. I was sorely
tempted to rename HeapTupleSatisfiesVisibility to HeapTupleSatisfiesSnapshot,
but forebore for the moment to avoid confusion and reduce the likelihood that
this patch breaks some of the pending patches. Might want to reconsider
doing that later.
executed in read_only mode. This could lead to various relatively-subtle
failures, such as an allegedly stable function returning non-stable results.
Bug goes all the way back to the introduction of read-only mode in 8.0.
Per report from Gaetano Mendola.
uses SPI plans, this finally fixes the ancient gotcha that you can't
drop and recreate a temp table used by a plpgsql function.
Along the way, clean up SPI's API a little bit by declaring SPI plan
pointers as "SPIPlanPtr" instead of "void *". This is cosmetic but
helps to forestall simple programming mistakes. (I have changed some
but not all of the callers to match; there are still some "void *"'s
in contrib and the PL's. This is intentional so that we can see if
anyone's compiler complains about it.)
module and teach PREPARE and protocol-level prepared statements to use it.
In service of this, rearrange utility-statement processing so that parse
analysis does not assume table schemas can't change before execution for
utility statements (necessary because we don't attempt to re-acquire locks
for utility statements when reusing a stored plan). This requires some
refactoring of the ProcessUtility API, but it ends up cleaner anyway,
for instance we can get rid of the QueryContext global.
Still to do: fix up SPI and related code to use the plan cache; I'm tempted to
try to make SQL functions use it too. Also, there are at least some aspects
of system state that we want to ensure remain the same during a replan as in
the original processing; search_path certainly ought to behave that way for
instance, and perhaps there are others.
Get rid of VARATT_SIZE and VARATT_DATA, which were simply redundant with
VARSIZE and VARDATA, and as a consequence almost no code was using the
longer names. Rename the length fields of struct varlena and various
derived structures to catch anyplace that was accessing them directly;
and clean up various places so caught. In itself this patch doesn't
change any behavior at all, but it is necessary infrastructure if we hope
to play any games with the representation of varlena headers.
Greg Stark and Tom Lane
parent query's EState. Now that there's a single flat rangetable for both
the main plan and subplans, there's no need anymore for a separate EState,
and removing it allows cleaning up some crufty code in nodeSubplan.c and
nodeSubqueryscan.c. Should be a tad faster too, although any difference
will probably be hard to measure. This is the last bit of subsidiary
mop-up work from changing to a flat rangetable.
and quals have varno OUTER, rather than zero, to indicate a reference to
an output of their lefttree subplan. This is consistent with the way
that every other upper-level node type does it, and allows some simplifications
in setrefs.c and EXPLAIN.
useless substructure for its RangeTblEntry nodes. (I chose to keep using the
same struct node type and just zero out the link fields for unneeded info,
rather than making a separate ExecRangeTblEntry type --- it seemed too
fragile to have two different rangetable representations.)
Along the way, put subplans into a list in the toplevel PlannedStmt node,
and have SubPlan nodes refer to them by list index instead of direct pointers.
Vadim wanted to do that years ago, but I never understood what he was on about
until now. It makes things a *whole* lot more robust, because we can stop
worrying about duplicate processing of subplans during expression tree
traversals. That's been a constant source of bugs, and it's finally gone.
There are some consequent simplifications yet to be made, like not using
a separate EState for subplans in the executor, but I'll tackle that later.
storing mostly-redundant Query trees in prepared statements, portals, etc.
To replace Query, a new node type called PlannedStmt is inserted by the
planner at the top of a completed plan tree; this carries just the fields of
Query that are still needed at runtime. The statement lists kept in portals
etc. now consist of intermixed PlannedStmt and bare utility-statement nodes
--- no Query. This incidentally allows us to remove some fields from Query
and Plan nodes that shouldn't have been there in the first place.
Still to do: simplify the execution-time range table; at the moment the
range table passed to the executor still contains Query trees for subqueries.
initdb forced due to change of stored rules.
plan nodes, so that the executor does not need to get these items from
the range table at runtime. This will avoid needing to include these
fields in the compact range table I'm expecting to make the executor use.
be checked at plan levels below the top; namely, we have to allow for Result
nodes inserted just above a nestloop inner indexscan. Should think about
using the general Param mechanism to pass down outer-relation variables, but
for the moment we need a back-patchable solution. Per report from Phil Frost.
WHERE clauses. createplan.c is now willing to stick a gating Result node
almost anywhere in the plan tree, and in particular one can wind up directly
underneath a MergeJoin node. This means it had better be willing to handle
Mark/Restore. Fortunately, that's trivial in such cases, since we can just
pass off the call to the input node (which the planner has previously ensured
can handle Mark/Restore). Per report from Phil Frost.
out that ExecEvalVar and friends don't necessarily have access to a tuple
descriptor with correct typmod: it definitely can contain -1, and possibly
might contain other values that are different from the Var's value.
Arguably this should be cleaned up someday, but it's not a simple change,
and in any case typmod discrepancies don't pose a security hazard.
Per reports from numerous people :-(
I'm not entirely sure whether the failure can occur in 8.0 --- the simple
test cases reported so far don't trigger it there. But back-patch the
change all the way anyway.
that aren't turned into true joins). Since this is the last missing bit of
infrastructure, go ahead and fill out the hash integer_ops and float_ops
opfamilies with cross-type operators. The operator family project is now
DONE ... er, except for documentation ...
observe the xmloption.
Reorganize the representation of the XML option in the parse tree and the
API to make it easier to manage and understand.
Add regression tests for parsing back XML expressions.
made query plan. Use of ALTER COLUMN TYPE creates a hazard for cached
query plans: they could contain Vars that claim a column has a different
type than it now has. Fix this by checking during plan startup that Vars
at relation scan level match the current relation tuple descriptor. Since
at that point we already have at least AccessShareLock, we can be sure the
column type will not change underneath us later in the query. However,
since a backend's locks do not conflict against itself, there is still a
hole for an attacker to exploit: he could try to execute ALTER COLUMN TYPE
while a query is in progress in the current backend. Seal that hole by
rejecting ALTER TABLE whenever the target relation is already open in
the current backend.
This is a significant security hole: not only can one trivially crash the
backend, but with appropriate misuse of pass-by-reference datatypes it is
possible to read out arbitrary locations in the server process's memory,
which could allow retrieving database content the user should not be able
to see. Our thanks to Jeff Trout for the initial report.
Security: CVE-2007-0556
we should check that the function code returns the claimed result datatype
every time we parse the function for execution. Formerly, for simple
scalar result types we assumed the creation-time check was sufficient, but
this fails if the function selects from a table that's been redefined since
then, and even more obviously fails if check_function_bodies had been OFF.
This is a significant security hole: not only can one trivially crash the
backend, but with appropriate misuse of pass-by-reference datatypes it is
possible to read out arbitrary locations in the server process's memory,
which could allow retrieving database content the user should not be able
to see. Our thanks to Jeff Trout for the initial report.
Security: CVE-2007-0555
Standard English uses "may", "can", and "might" in different ways:
may - permission, "You may borrow my rake."
can - ability, "I can lift that log."
might - possibility, "It might rain today."
Unfortunately, in conversational English, their use is often mixed, as
in, "You may use this variable to do X", when in fact, "can" is a better
choice. Similarly, "It may crash" is better stated, "It might crash".
Hashing for aggregation purposes still needs work, so it's not time to
mark any cross-type operators as hashable for general use, but these cases
work if the operators are so marked by hand in the system catalogs.
match because they contain a null join key (and the join operator is
known strict). Improves performance significantly when the inner
relation contains a lot of nulls, as per bug #2930.
- Add new SQL command SET XML OPTION (also available via regular GUC) to
control the DOCUMENT vs. CONTENT option in implicit parsing and
serialization operations.
- Subtle corrections in the handling of the standalone property in
xmlroot().
- Allow xmlroot() to work on content fragments.
- Subtle corrections in the handling of the version property in
xmlconcat().
- Code refactoring for producing XML declarations.
involving unions of types having typmods. Variants of the failure are known
to occur in 8.1 and up; not sure if it's possible in 8.0 and 7.4, but since
the code exists that far back, I'll just patch 'em all. Per report from
Brian Hurt.
which comparison operators to use for plan nodes involving tuple comparison
(Agg, Group, Unique, SetOp). Formerly the executor looked up the default
equality operator for the datatype, which was really pretty shaky, since it's
possible that the data being fed to the node is sorted according to some
nondefault operator class that could have an incompatible idea of equality.
The planner knows what it has sorted by and therefore can provide the right
equality operator to use. Also, this change moves a couple of catalog lookups
out of the executor and into the planner, which should help startup time for
pre-planned queries by some small amount. Modify the planner to remove some
other cavalier assumptions about always being able to use the default
operators. Also add "nulls first/last" info to the Plan node for a mergejoin
--- neither the executor nor the planner can cope yet, but at least the API is
in place.
per-column options for btree indexes. The planner's support for this is still
pretty rudimentary; it does not yet know how to plan mergejoins with
nondefault ordering options. The documentation is pretty rudimentary, too.
I'll work on improving that stuff later.
Note incompatible change from prior behavior: ORDER BY ... USING will now be
rejected if the operator is not a less-than or greater-than member of some
btree opclass. This prevents less-than-sane behavior if an operator that
doesn't actually define a proper sort ordering is selected.
involving HashAggregate over SubqueryScan (this is the known case, there
may well be more). The bug is only latent in releases before 8.2 since they
didn't try to access tupletable slots' descriptors during ExecDropTupleTable.
The least bogus fix seems to be to make subqueries share the parent query's
memory context, so that tupdescs they create will have the same lifespan as
those of the parent query. There are comments in the code envisioning going
even further by not having a separate child EState at all, but that will
require rethinking executor access to range tables, which I don't want to
tackle right now. Per bug report from Jean-Pierre Pelletier.
ps_TupFromTlist in plan nodes that make use of it. This was being done
correctly in join nodes and Result nodes but not in any relation-scan nodes.
Bug would lead to bogus results if a set-returning function appeared in the
targetlist of a subquery that could be rescanned after partial execution,
for example a subquery within EXISTS(). Bug has been around forever :-(
... surprising it wasn't reported before.
were marked canSetTag. While it's certainly correct to return the result
of the last one that is marked canSetTag, it's less clear what to do when
none of them are. Since plpgsql will complain if zero is returned, the
8.2.0 behavior isn't good. I've fixed it to restore the prior behavior of
returning the physically last query's result code when there are no
canSetTag queries.
the XmlExpr code in various lists, use a representation that has some hope
of reverse-listing correctly (though it's still a de-escaping function
shy of correctness), generally try to make it look more like Postgres
coding conventions.
cases. Operator classes now exist within "operator families". While most
families are equivalent to a single class, related classes can be grouped
into one family to represent the fact that they are semantically compatible.
Cross-type operators are now naturally adjunct parts of a family, without
having to wedge them into a particular opclass as we had done originally.
This commit restructures the catalogs and cleans up enough of the fallout so
that everything still works at least as well as before, but most of the work
needed to actually improve the planner's behavior will come later. Also,
there are not yet CREATE/DROP/ALTER OPERATOR FAMILY commands; the only way
to create a new family right now is to allow CREATE OPERATOR CLASS to make
one by default. I owe some more documentation work, too. But that can all
be done in smaller pieces once this infrastructure is in place.
release it in a subtransaction abort, but this neglects possibility that
someone outside SPI already did. Fix is for spi.c to forget about a tuptable
as soon as it's handed it back to the caller.
Per bug #2817 from Michael Andreen.
by name on each and every row processed. Profiling suggests this may
buy a percent or two for simple UPDATE scenarios, which isn't huge,
but when it's so easy to get ...
by the change to make limit values int8 instead of int4. (Specifically, you
can do DatumGetInt32 safely on a null value, but not DatumGetInt64.) Per
bug #2803 from Greg Johnson.
in the middle of executing a SPI query. This doesn't entirely fix the
problem of memory leakage in plpgsql exception handling, but it should
get rid of the lion's share of leakage.
sub-arrays. Per discussion, if all inputs are empty arrays then result
must be an empty array too, whereas a mix of empty and nonempty arrays
should (and already did) draw an error. In the back branches, the
construct was strict: any NULL input immediately yielded a NULL output;
so I left that behavior alone. HEAD was simply ignoring NULL sub-arrays,
which doesn't seem very sensible. For lack of a better idea it now
treats NULL sub-arrays the same as empty ones.
rows --- if the surrounding query queued any trigger events between the rows,
the events would be fired at the wrong time, leading to bizarre behavior.
Per report from Merlin Moncure.
This is a simple patch that should solve the problem fully in the back
branches, but in HEAD we also need to consider the possibility of queries
with RETURNING clauses. Will look into a fix for that separately.
the SQL spec, viz IS NULL is true if all the row's fields are null, IS NOT
NULL is true if all the row's fields are not null. The former coding got
this right for a limited number of cases with IS NULL (ie, those where it
could disassemble a ROW constructor at parse time), but was entirely wrong
for IS NOT NULL. Per report from Teodor.
I desisted from changing the behavior for arrays, since on closer inspection
it's not clear that there's any support for that in the SQL spec. This
probably needs more consideration.
proposal. Parameter logging works even for binary-format parameters, and
logging overhead is avoided when disabled.
log_statement = all output for the src/test/examples/testlibpq3.c example
now looks like
LOG: statement: execute <unnamed>: SELECT * FROM test1 WHERE t = $1
DETAIL: parameters: $1 = 'joe''s place'
LOG: statement: execute <unnamed>: SELECT * FROM test1 WHERE i = $1::int4
DETAIL: parameters: $1 = '2'
and log_min_duration_statement = 0 results in
LOG: duration: 2.431 ms parse <unnamed>: SELECT * FROM test1 WHERE t = $1
LOG: duration: 2.335 ms bind <unnamed> to <unnamed>: SELECT * FROM test1 WHERE t = $1
DETAIL: parameters: $1 = 'joe''s place'
LOG: duration: 0.394 ms execute <unnamed>: SELECT * FROM test1 WHERE t = $1
DETAIL: parameters: $1 = 'joe''s place'
LOG: duration: 1.251 ms parse <unnamed>: SELECT * FROM test1 WHERE i = $1::int4
LOG: duration: 0.566 ms bind <unnamed> to <unnamed>: SELECT * FROM test1 WHERE i = $1::int4
DETAIL: parameters: $1 = '2'
LOG: duration: 0.173 ms execute <unnamed>: SELECT * FROM test1 WHERE i = $1::int4
DETAIL: parameters: $1 = '2'
(This example demonstrates the folly of ignoring parse/bind steps for duration
logging purposes, BTW.)
Along the way, create a less ad-hoc mechanism for determining which commands
are logged by log_statement = mod and log_statement = ddl. The former coding
was actually missing quite a few things that look like ddl to me, and it
did not handle EXECUTE or extended query protocol correctly at all.
This commit does not do anything about the question of whether log_duration
should be removed or made less redundant with log_min_duration_statement.
that has parameters is always planned afresh for each Bind command,
treating the parameter values as constants in the planner. This removes
the performance penalty formerly often paid for using out-of-line
parameters --- with this definition, the planner can do constant folding,
LIKE optimization, etc. After a suggestion by Andrew@supernews.
optionally bind. I re-added the "statement:" label so people will
understand why the line is being printed (it is log_*statement
behavior).
Use single quotes for bind values, instead of double quotes, and double
literal single quotes in bind values (and document that). I also made
use of the DETAIL line to have much cleaner output.
Fix all the standard PLs to be able to return tuples from FOO_RETURNING
statements as well as utility statements that return tuples. Also,
fix oversight that SPI_processed wasn't set for a utility statement
returning tuples. Per recent discussion.
cannot assume that there's exactly one Query in the Portal, as we can for
ONE_SELECT mode, because non-SELECT queries might have extra queries added
during rule rewrites. Fix things up so that we'll use ONE_RETURNING mode
when a Portal contains one primary (canSetTag) query and that query has
a RETURNING list. This appears to be a second showstopper reason for running
the Portal to completion before we start to hand anything back --- we want
to be sure that the rule-added queries get run too.
_SPI_execute_plan's return code should reflect the type of the query
that is marked canSetTag, not necessarily the last one in the list.
This is arguably a bug fix, but I'm hesitant to back-patch it because
it's the sort of subtle change that might break someone's code, and it's
best not to do that kind of thing in point releases.
merely a matter of fixing the error check, since the underlying Portal
infrastructure already handles it. This in turn allows these statements
to be used in some existing plpgsql and plperl contexts, such as a
plpgsql FOR loop. Also, do some marginal code cleanup in places that
were being sloppy about distinguishing SELECT from SELECT INTO.
plpgsql support to come later. Along the way, convert execMain's
SELECT INTO support into a DestReceiver, in order to eliminate some ugly
special cases.
Jonah Harris and Tom Lane
o print user name for all
o print portal name if defined for all
o print query for all
o reduce log_statement header to single keyword
o print bind parameters as DETAIL if text mode
that's shorter-lived than the expression state being evaluated in it really
doesn't work :-( --- we end up with fn_extra caches getting deleted while
still in use. Rather than abandon the notion of caching expression state
across domain_in calls altogether, I chose to make domain_in a bit cozier
with ExprContext. All we really need for evaluating variable-free
expressions is an ExprContext, not an EState, so I invented the notion of a
"standalone" ExprContext. domain_in can prevent resource leakages by doing
a ReScanExprContext on this rather than having to free it entirely; so we
can make the ExprContext have the same lifespan (and particularly the same
per_query memory context) as the expression state structs.
temporary context that can be reset when advancing to the next sublist.
This is faster and more thorough at recovering space than the previous
method; moreover it will do the right thing if something in the sublist
tries to register an expression context callback.
(e.g. "INSERT ... VALUES (...), (...), ...") and elsewhere as allowed
by the spec. (e.g. similar to a FROM clause subselect). initdb required.
Joe Conway and Tom Lane.
(table or index) before trying to open its relcache entry. This fixes
race conditions in which someone else commits a change to the relation's
catalog entries while we are in process of doing relcache load. Problems
of that ilk have been reported sporadically for years, but it was not
really practical to fix until recently --- for instance, the recent
addition of WAL-log support for in-place updates helped.
Along the way, remove pg_am.amconcurrent: all AMs are now expected to support
concurrent update.
created in the bootstrap phase proper, rather than added after-the-fact
by initdb. This is cleaner than before because it allows us to retire the
undocumented ALTER TABLE ... CREATE TOAST TABLE command, but the real reason
I'm doing it is so that toast tables of shared catalogs will now have
predetermined OIDs. This will allow a reasonably clean solution to the
problem of locking tables before we load their relcache entries, to appear
in a forthcoming patch.
the opportunity to treat COUNT(*) as a zero-argument aggregate instead
of the old hack that equated it to COUNT(1); this is materially cleaner
(no more weird ANYOID cases) and ought to be at least a tiny bit faster.
Original patch by Sergey Koposov; review, documentation, simple regression
tests, pg_dump and psql support by moi.
eliminate unnecessary code, force initdb because stored rules change
(limit nodes are now supposed to be int8 not int4 expressions).
Update comments and error messages, which still all said 'integer'.
discussion (including making def_arg allow reserved words), add missed
opt_definition for UNIQUE case. Put the reloptions support code in a less
random place (I chose to make a new file access/common/reloptions.c).
Eliminate header inclusion creep. Make the index options functions safely
user-callable (seems like client apps might like to be able to test validity
of options before trying to make an index). Reduce overhead for normal case
with no options by allowing rd_options to be NULL. Fix some unmaintainably
klugy code, including getting rid of Natts_pg_class_fixed at long last.
Some stylistic cleanup too, and pay attention to keeping comments in sync
with code.
Documentation still needs work, though I did fix the omissions in
catalogs.sgml and indexam.sgml.
tuple hash table entries. This addresses the problem previously noted
that use of a 'physical tlist' in the input scan node could bloat the
hash table entries far beyond what the planner expects. It's a better
answer than my previous thought of undoing the physical tlist optimization,
because we can also remove columns that are needed to compute the aggregate
functions but aren't part of the grouping column set.
per-tuple space overhead for sorts in memory. I chose to replace the
previous patch that tried to write out the bare minimum amount of data
when sorting on disk; instead, just dump the MinimalTuples as-is. This
wastes 3 to 10 bytes per tuple depending on architecture and null-bitmap
length, but the simplification in the writetup/readtup routines seems
worth it.
tuples with less header overhead than a regular HeapTuple, per my
recent proposal. Teach TupleTableSlot code how to deal with these.
As proof of concept, change tuplestore.c to store MinimalTuples instead
of HeapTuples. Future patches will expand the concept to other places
where it is useful.
aggregates. We just disallowed that, and AFAICS there should be no other
cases where direct (non-aggregated) references to input columns are allowed
in a query with aggregation and no GROUP BY.
by creating a reference-count mechanism, similar to what we did a long time
ago for catcache entries. The back branches have an ugly solution involving
lots of extra copies, but this way is more efficient. Reference counting is
only applied to tupdescs that are actually in caches --- there seems no need
to use it for tupdescs that are generated in the executor, since they'll go
away during plan shutdown by virtue of being in the per-query memory context.
Neil Conway and Tom Lane
it is just the total time to do INSTR_TIME_SET_CURRENT(), and not any of
the other code involved in InstrStartNode/InstrStopNode. Even though I
fear we may end up reverting this patch altogether, we may as well have
the most correct version in our CVS archive.
any use in the past many years, we'd have made some effort to include
them in all executor node types; but in fact they were only in
nodeAppend.c and nodeIndexscan.c, up until I copied nodeIndexscan.c's
occurrence into the new bitmap node types. Remove some other unused
macros in execdebug.h, too. Some day the whole header probably ought to
go away in favor of better-designed facilities.
support both FOR UPDATE and FOR SHARE in one command, as well as both
NOWAIT and normal WAIT behavior. The more general code is actually
simpler and cleaner.
not named ones, and replace linear searches of the list with array indexing.
The named-parameter support has been dead code for many years anyway,
and recent profiling suggests that the searching was costing a noticeable
amount of performance for complex queries.
functions are not strict, they will be called (passing a NULL first parameter)
during any attempt to input a NULL value of their datatype. Currently, all
our input functions are strict and so this commit does not change any
behavior. However, this will make it possible to build domain input functions
that centralize checking of domain constraints, thereby closing numerous holes
in our domain support, as per previous discussion.
While at it, I took the opportunity to introduce convenience functions
InputFunctionCall, OutputFunctionCall, etc to use in code that calls I/O
functions. This eliminates a lot of grotty-looking casts, but the main
motivation is to make it easier to grep for these places if we ever need
to touch them again.
2005-05-13. When we find that a new inner tuple can't possibly match any
outer tuple (because it contains a NULL), we can't immediately skip the
tuple when we are in NEXTINNER state. Doing so can lead to emitting
multiple copies of the tuple in FillInner mode, because we may rescan the
tuple after returning to a previous marked tuple. Instead, proceed to
NEXTOUTER state the same as we used to do. After we've found that there's
no need to return to the marked position, we can go to SKIPINNER_ADVANCE
state instead of SKIP_TEST when the inner tuple is unmatchable; this
preserves the performance improvement. Per bug report from Bruce.
I also made a couple of cosmetic code rearrangements and added a regression
test for the problem.
The original coding stored the raw parser output (ColumnDef and TypeName
nodes) which was ugly, bulky, and wrong because it failed to create any
dependency on the referenced datatype --- and in fact would not track type
renamings and suchlike. Instead store a list of column type OIDs in the
RTE.
Also fix up general failure of recordDependencyOnExpr to do anything sane
about recording dependencies on datatypes. While there are many cases where
there will be an indirect dependency (eg if an operator returns a datatype,
the dependency on the operator is enough), we do have to record the datatype
as a separate dependency in examples like CoerceToDomain.
initdb forced because of change of stored rules.
during parse analysis, not only errors detected in the flex/bison stages.
This is per my earlier proposal. This commit includes all the basic
infrastructure, but locations are only tracked and reported for errors
involving column references, function calls, and operators. More could
be done later but this seems like a good set to start with. I've also
moved the ReportSyntaxErrorPosition logic out of psql and into libpq,
which should make it available to more people --- even within psql this
is an improvement because warnings weren't handled by ReportSyntaxErrorPosition.
bits indicating which optional capabilities can actually be exercised
at runtime. This will allow Sort and Material nodes, and perhaps later
other nodes, to avoid unnecessary overhead in common cases.
This commit just adds the infrastructure and arranges to pass the correct
flag values down to plan nodes; none of the actual optimizations are here
yet. I'm committing this separately in case anyone wants to measure the
added overhead. (It should be negligible.)
Simon Riggs and Tom Lane
each tuple, as per my proposal of several days ago. Also, clean up
sort memory management by keeping all working data in a separate memory
context, and refine the handling of low-memory conditions.
possible ScanDirection alternatives rather than magic numbers
(-1, 0, 1). Also, use the ScanDirection macros in a few places
rather than directly checking whether `dir == ForwardScanDirection'
and the like. Per patch from James William Pye. His patch also
changed ScanDirection to be a "char" rather than an enum, which
I haven't applied.
relations: fix the executor so that we can have an Append plan on the
inside of a nestloop and still pass down outer index keys to index scans
within the Append, then generate such plans as if they were regular
inner indexscans. This avoids the need to evaluate the outer relation
multiple times.
cursors. Patch from Joachim Wieland, review and ediorialization by Neil
Conway. The view lists cursors defined by DECLARE CURSOR, using SPI, or
via the Bind message of the frontend/backend protocol. This means the
view does not list the unnamed portal or the portal created to implement
EXECUTE. Because we do list SPI portals, there might be more rows in
this view than you might expect if you are using SPI implicitly (e.g.
via a procedural language).
Per recent discussion on -hackers, the query string included in the
view for cursors defined by DECLARE CURSOR is based on
debug_query_string. That means it is not accurate if multiple queries
separated by semicolons are submitted as one query string. However,
there doesn't seem a trivial fix for that: debug_query_string
is better than nothing. I also changed SPI_cursor_open() to include
the source text for the portal it creates: AFAICS there is no reason
not to do this.
Update the documentation and regression tests, bump the catversion.
isn't being used anywhere anymore, and there seems no point in a generic
index_keytest() routine when two out of three remaining access methods
aren't using it. Also, add a comment documenting a convention for
letting access methods define private flag bits in ScanKey sk_flags.
There are no such flags at the moment but I'm thinking about changing
btree's handling of "required keys" to use flag bits in the keys
rather than a count of required key positions. Also, if some AM did
still want SK_NEGATE then it would be reasonable to treat it as a private
flag bit.
our own command (or more generally, xmin = our xact and cmin >= current
command ID) should not be seen as good. Else we may try to update rows
we already updated. This error was inserted last August while fixing the
even bigger problem that the old coding wouldn't see *any* tuples inserted
by our own transaction as good. Per report from Euler Taveira de Oliveira.
rather than "return expr;" -- the latter style is used in most of the
tree. I kept the parentheses when they were necessary or useful because
the return expression was complex.
(previously we only did = and <> correctly). Also, allow row comparisons
with any operators that are in btree opclasses, not only those with these
specific names. This gets rid of a whole lot of indefensible assumptions
about the behavior of particular operators based on their names ... though
it's still true that IN and NOT IN expand to "= ANY". The patch adds a
RowCompareExpr expression node type, and makes some changes in the
representation of ANY/ALL/ROWCOMPARE SubLinks so that they can share code
with RowCompareExpr.
I have not yet done anything about making RowCompareExpr an indexable
operator, but will look at that soon.
initdb forced due to changes in stored rules.
if we already have a stronger lock due to the index's table being the
update target table of the query. Same optimization I applied earlier
at the table level. There doesn't seem to be much interest in the more
radical idea of not locking indexes at all, so do what we can ...
relation if it's already been locked by execMain.c as either a result
relation or a FOR UPDATE/SHARE relation. This avoids an extra trip to
the shared lock manager state. Per my suggestion yesterday.
child plan nodes until we have acquired lock on the relation to scan.
The relative order of initialization of plan nodes isn't real important in
other cases, but it's critical here because one is supposed to lock a
relation before its indexes, not vice versa. The original coding was at
least vulnerable to deadlock against DROP INDEX, and perhaps worse things.
it's worth probing the outer relation for emptiness before building the
hash table. To wit, if we're rescanning a join previously performed,
remember whether we found it nonempty the previous time, and don't bother
with the probe if it was nonempty. This buys back the performance lost
in examples like Mario Weilguni's.
one child or the other had a problem: they did not leave the node in a
state that ExecReScanHashJoin would understand. In particular it would
tend to fail to reset the child plans when needed. Per report from
Mario Weilguni.
"ctid IN (list)" will still work after we convert IN to ScalarArrayOpExpr.
Make some minor efficiency improvements while at it, such as ensuring that
multiple TIDs are fetched in physical heap order. And fix EXPLAIN so that
it shows what's really going on for a TID scan.
when we first read the page, rather than checking them one at a time.
This allows us to take and release the buffer content lock just once
per page, instead of once per tuple. Since it's a shared lock the
contention penalty for holding the lock longer shouldn't be too bad.
We can safely do this only when using an MVCC snapshot; else the
assumption that visibility won't change over time is uncool. Therefore
there are now two code paths depending on the snapshot type. I also
made the same change in nodeBitmapHeapscan.c, where it can be done always
because we only support MVCC snapshots for bitmap scans anyway.
Also make some incidental cleanups in the APIs of these functions.
Per a suggestion from Qingqing Zhou.
qualification when the underlying operator is indexable and useOr is true.
That is, indexkey op ANY (ARRAY[...]) is effectively translated into an
OR combination of one indexscan for each array element. This only works
for bitmap index scans, of course, since regular indexscans no longer
support OR'ing of scans. There are still some loose ends to clean up
before changing 'x IN (list)' to translate as a ScalarArrayOpExpr;
for instance predtest.c ought to be taught about it. But this gets the
basic functionality in place.
a TupleTableSlot: instead of calling ExecClearTuple, inline the needed
operations, so that we can avoid redundant steps. In particular, when
the old and new tuples are both on the same disk page, avoid releasing
and re-acquiring the buffer pin --- this saves work in both the bufmgr
and ResourceOwner modules. To make this improvement actually useful,
partially revert a change I made on 2004-04-21 that caused SeqNext
et al to call ExecClearTuple before ExecStoreTuple. The motivation
for that, to avoid grabbing the BufMgrLock separately for releasing
the old buffer and grabbing the new one, no longer applies. My
profiling says that this saves about 5% of the CPU time for an
all-in-memory seqscan.
generate their output tuple descriptors from their target lists (ie, using
ExecAssignResultTypeFromTL()). We long ago fixed things so that all node
types have minimally valid tlists, so there's no longer any good reason to
have two different ways of doing it. This change is needed to fix bug
reported by Hayden James: the fix of 2005-11-03 to emit the correct column
names after optimizing away a SubqueryScan node didn't work if the new
top-level plan node used ExecAssignResultTypeFromOuterPlan to generate its
tupdesc, since the next plan node down won't have the correct column labels.
comment line where output as too long, and update typedefs for /lib
directory. Also fix case where identifiers were used as variable names
in the backend, but as typedefs in ecpg (favor the backend for
indenting).
Backpatch to 8.1.X.
tuple in-place, but instead passes back an all-new tuple structure if
any changes are needed. This is a much cleaner and more robust solution
for the bug discovered by Alexey Beschiokov; accordingly, revert the
quick hack I installed yesterday.
With this change, HeapTupleData.t_datamcxt is no longer needed; will
remove it in a separate commit in HEAD only.
doing heap_insert or heap_update, wipe out any extracted fields in
the TupleTableSlot containing the tuple, because they might not be valid
anymore if tuptoaster.c changed the tuple. Safe because slot must be
in the materialized state, but mighty ugly --- find a better answer!
functionality, but I still need to make another pass looking at places
that incidentally use arrays (such as ACL manipulation) to make sure they
are null-safe. Contrib needs work too.
I have not changed the behaviors that are still under discussion about
array comparison and what to do with lower bounds.
slot of the topmost plan node when a trigger returns a modified tuple.
These appear to be the only places where a plan node's caller did not
treat the result slot as read-only, which is an assumption that nodeUnique
makes as of 8.1. Fixes trigger-vs-DISTINCT bug reported by Frank van Vugt.
generated from subquery outputs: use the type info stored in the Var
itself. To avoid making ExecEvalVar and slot_getattr more complex
and slower, I split out the whole-row case into a separate ExecEval routine.
type ID information even when it's a record type. This is needed to
handle whole-row Vars referencing subquery outputs. Per example from
Richard Huxton.
generated by bitmap index scans. Along the way, simplify and speed up
the code for counting sequential and index scans; it was both confusing
and inefficient to be taking care of that in the per-tuple loops, IMHO.
initdb forced because of internal changes in pg_stat view definitions.
the ProcessUtility case, resulting in an intratransaction memory leak
if a utility command actually did return any tuples, as reported by
Dmitry Karasik. Fix this and also make the behavior more consistent
for cases involving nested SPI operations and multiple query trees,
by ensuring that we store the state locally until it is ready to be
returned to the caller.
outer relation is empty did not work, per test case from Patrick Welche.
It tried to use nodeHashjoin.c's high-level mechanisms for fetching an
outer-relation tuple, but that code expected the hash table to be filled
already. As patched, the code failed in corner cases such as having no
outer-relation tuples for the first hash batch. Revert and rewrite.
the parent table, even if the command that creates them is executed by
someone else (such as a superuser or a member of the owning role).
Per gripe from Michael Fuhr.
insufficient paranoia in code that follows t_ctid links. (We must do both
because even with VACUUM doing it properly, the intermediate state with
a dangling t_ctid link is visible concurrently during lazy VACUUM, and
could be seen afterwards if either type of VACUUM crashes partway through.)
Also try to improve documentation about what's going on. Patch is a bit
bulky because passing the XMAX information around required changing the
APIs of some low-level heapam.c routines, but it's not conceptually very
complicated. Per trouble report from Teodor and subsequent analysis.
This needs to be back-patched, but I'll do that after 8.1 beta is out.
and pg_auth_members. There are still many loose ends to finish in this
patch (no documentation, no regression tests, no pg_dump support for
instance). But I'm going to commit it now anyway so that Alvaro can
make some progress on shared dependencies. The catalog changes should
be pretty much done.
(a/k/a SELECT INTO). Instead, flush and fsync the whole relation before
committing. We do still need the WAL log when PITR is active, however.
Simon Riggs and Tom Lane.
work if either of the join relations are empty. The logic is:
(1) if the inner relation's startup cost is less than the outer
relation's startup cost and this is not an outer join, read
a single tuple from the inner relation via ExecHash()
- if NULL, we're done
(2) read a single tuple from the outer relation
- if NULL, we're done
(3) build the hash table on the inner relation
- if hash table is empty and this is not an outer join,
we're done
(4) otherwise, do hash join as usual
The implementation uses the new MultiExecProcNode API, per a
suggestion from Tom: invoking ExecHash() now produces the first
tuple from the Hash node's child node, whereas MultiExecHash()
builds the hash table.
I had to put in a bit of a kludge to get the row count returned
for EXPLAIN ANALYZE to be correct: since ExecHash() is invoked to
return a tuple, and then MultiExecHash() is invoked, we would
return one too many tuples to EXPLAIN ANALYZE. I hacked around
this by just manually detecting this situation and subtracting 1
from the EXPLAIN ANALYZE row count.
this in turn causes CREATE TABLE AS in plpgsql to set ROW_COUNT.
This is how it behaved before 7.4; I had unintentionally changed the
behavior in a bit of sloppy micro-optimization.
spotted by Qingqing Zhou. The HASH_ENTER action now automatically
fails with elog(ERROR) on out-of-memory --- which incidentally lets
us eliminate duplicate error checks in quite a bunch of places. If
you really need the old return-NULL-on-out-of-memory behavior, you
can ask for HASH_ENTER_NULL. But there is now an Assert in that path
checking that you aren't hoping to get that behavior in a palloc-based
hash table.
Along the way, remove the old HASH_FIND_SAVE/HASH_REMOVE_SAVED actions,
which were not being used anywhere anymore, and were surely too ugly
and unsafe to want to see revived again.
aren't doing anything useful (ie, neither selection nor projection).
Also, extend to SubqueryScan the hacks already in place to avoid
unnecessary ExecProject calls when the result would just be the same
tuple the subquery already delivered. This saves some overhead in
UNION and other set operations, as well as avoiding overhead for
unflatten-able subqueries. Per example from Sokolov Yura.
in an inconsistent state. (This is only latent because in reality
ExecSeqRestrPos is dead code at the moment ... but someday maybe it won't
be.) Add some comments about what the API for plan node mark/restore
actually is, because it's not immediately obvious.
When one side of the join has a NULL, we don't want to uselessly try
to match it against every remaining tuple of the other side. While
at it, rewrite the comparison machinery to avoid multiple evaluations
of the left and right input expressions and to use a btree comparator
where available, instead of double operator calls. Also revise the
state machine to eliminate redundant comparisons and hopefully make it
more readable too.
which is neither needed by nor related to that header. Remove the bogus
inclusion and instead include the header in those C files that actually
need it. Also fix unnecessary inclusions and bad inclusion order in
tsearch2 files.
startup to end, rather than re-opening it in each MultiExecBitmapIndexScan
call. I had foolishly thought that opening/closing wouldn't be much
more expensive than a rescan call, but that was sheer brain fade.
This seems to fix about half of the performance lossage reported by
Sergey Koposov. I'm still not sure where the other half went.
to produce when running the executor. This is consistent with the internal
executor APIs (such as ExecutorRun), which also use a long for this purpose.
It also allows FETCH_ALL to be passed -- since FETCH_ALL is defined as
LONG_MAX, this wouldn't have worked on platforms where int and long are of
different sizes. Per report from Tzahi Fadida.
only one argument. (Per recent discussion, the option to accept multiple
arguments is pretty useless for user-defined types, and would be a likely
source of security holes if it was used.) Simplify call sites of
output/send functions to not bother passing more than one argument.
to eliminate unnecessary deadlocks. This commit adds SELECT ... FOR SHARE
paralleling SELECT ... FOR UPDATE. The implementation uses a new SLRU
data structure (managed much like pg_subtrans) to represent multiple-
transaction-ID sets. When more than one transaction is holding a shared
lock on a particular row, we create a MultiXactId representing that set
of transactions and store its ID in the row's XMAX. This scheme allows
an effectively unlimited number of row locks, just as we did before,
while not costing any extra overhead except when a shared lock actually
has to be shared. Still TODO: use the regular lock manager to control
the grant order when multiple backends are waiting for a row lock.
Alvaro Herrera and Tom Lane.
node, as this behavior is now better done as a bitmap OR indexscan.
This allows considerable simplification in nodeIndexscan.c itself as
well as several planner modules concerned with indexscan plan generation.
Also we can improve the sharing of code between regular and bitmap
indexscans, since they are now working with nigh-identical Plan nodes.
but just to open and close it during MultiExecBitmapIndexScan. This
avoids acquiring duplicate resources (eg, multiple locks on the same
relation) in a tree with many bitmap scans. Also, don't bother to
lock the parent heap at all here, since we must be underneath a
BitmapHeapScan node that will be holding a suitable lock.
ExprContexts will be freed anyway when FreeExecutorState() is reached,
and letting that routine do the work is more efficient because it will
automatically free the ExprContexts in reverse creation order. The
existing coding was effectively freeing them in exactly the worst
possible order, resulting in O(N^2) behavior inside list_delete_ptr,
which becomes highly visible in cases with a few thousand plan nodes.
ExecFreeExprContext is now effectively a no-op and could be removed,
but I left it in place in case we ever want to put it back to use.
but the code is basically working. Along the way, rewrite the entire
approach to processing OR index conditions, and make it work in join
cases for the first time ever. orindxpath.c is now basically obsolete,
but I left it in for the time being to allow easy comparison testing
against the old implementation.
scans, using in-memory tuple ID bitmaps as the intermediary. The planner
frontend (path creation and cost estimation) is not there yet, so none
of this code can be executed. I have tested it using some hacked planner
code that is far too ugly to see the light of day, however. Committing
now so that the bulk of the infrastructure changes go in before the tree
drifts under me.
return just a single tuple at a time. Currently the only such node
type is Hash, but I expect we will soon have indexscans that can return
tuple bitmaps. A side benefit is that EXPLAIN ANALYZE now shows the
correct tuple count for a Hash node.
indexes. Replace all heap_openr and index_openr calls by heap_open
and index_open. Remove runtime lookups of catalog OID numbers in
various places. Remove relcache's support for looking up system
catalogs by name. Bulky but mostly very boring patch ...
indexes. Extend the macros in include/catalog/*.h to carry the info
about hand-assigned OIDs, and adjust the genbki script and bootstrap
code to make the relations actually get those OIDs. Remove the small
number of RelOid_pg_foo macros that we had in favor of a complete
set named like the catname.h and indexing.h macros. Next phase will
get rid of internal use of names for looking up catalogs and indexes;
but this completes the changes forcing an initdb, so it looks like a
good place to commit.
Along the way, I made the shared relations (pg_database etc) not be
'bootstrap' relations any more, so as to reduce the number of hardwired
entries and simplify changing those relations in future. I'm not
sure whether they ever really needed to be handled as bootstrap
relations, but it seems to work fine to not do so now.
ExecProcNode() with a NULL value, so the test couldn't do anything
for us except maybe mask bugs. Removing it probably doesn't save
anything much either, but then again this is a hot-spot routine.
few palloc's. I also chose to eliminate the restype and restypmod fields
entirely, since they are redundant with information stored in the node's
contained expression; re-examining the expression at need seems simpler
and more reliable than trying to keep restype/restypmod up to date.
initdb forced due to change in contents of stored rules.
old comment in the code claimed that this was necessary. Since it is not
actually necessary any more, it is clearer to remove the comment and
just return NULL instead -- the return value of ExecHash() is not used.
change saves a great deal of space in pg_proc and its primary index,
and it eliminates the former requirement that INDEX_MAX_KEYS and
FUNC_MAX_ARGS have the same value. INDEX_MAX_KEYS is still embedded
in the on-disk representation (because it affects index tuple header
size), but FUNC_MAX_ARGS is not. I believe it would now be possible
to increase FUNC_MAX_ARGS at little cost, but haven't experimented yet.
There are still a lot of vestigial references to FUNC_MAX_ARGS, which
I will clean up in a separate pass. However, getting rid of it
altogether would require changing the FunctionCallInfoData struct,
and I'm not sure I want to buy into that.
executing a statement that fires triggers. Formerly this time was
included in "Total runtime" but not otherwise accounted for.
As a side benefit, we avoid re-opening relations when firing non-deferred
AFTER triggers, because the trigger code can re-use the main executor's
ResultRelInfo data structure.
convention for isnull flags. Also, remove the useless InsertIndexResult
return struct from index AM aminsert calls --- there is no reason for
the caller to know where in the index the tuple was inserted, and we
were wasting a palloc cycle per insert to deliver this uninteresting
value (plus nontrivial complexity in some AMs).
I forced initdb because of the change in the signature of the aminsert
routines, even though nothing really looks at those pg_proc entries...
of tuples when passing data up through multiple plan nodes. A slot can now
hold either a normal "physical" HeapTuple, or a "virtual" tuple consisting
of Datum/isnull arrays. Upper plan levels can usually just copy the Datum
arrays, avoiding heap_formtuple() and possible subsequent nocachegetattr()
calls to extract the data again. This work extends Atsushi Ogawa's earlier
patch, which provided the key idea of adding Datum arrays to TupleTableSlots.
(I believe however that something like this was foreseen way back in Berkeley
days --- see the old comment on ExecProject.) A test case involving many
levels of join of fairly wide tables (about 80 columns altogether) showed
about 3x overall speedup, though simple queries will probably not be
helped very much.
I have also duplicated some code in heaptuple.c in order to provide versions
of heap_formtuple and friends that use "bool" arrays to indicate null
attributes, instead of the old convention of "char" arrays containing either
'n' or ' '. This provides a better match to the convention used by
ExecEvalExpr. While I have not made a concerted effort to get rid of uses
of the old routines, I think they should be deprecated and eventually removed.
a tuple are being accessed via ExecEvalVar and the attcacheoff shortcut
isn't usable (due to nulls and/or varlena columns). To do this, cache
Datums extracted from a tuple in the associated TupleTableSlot.
Also some code cleanup in and around the TupleTable handling.
Atsushi Ogawa with some kibitzing by Tom Lane.
can tell whether it is being used as an aggregate or not. This allows
such a function to avoid re-pallocing a pass-by-reference transition
value; normally it would be unsafe for a function to scribble on an input,
but in the aggregate case it's safe to reuse the old transition value.
Make int8inc() do this. This gets a useful improvement in the speed of
COUNT(*), at least on narrow tables (it seems to be swamped by I/O when
the table rows are wide). Per a discussion in early December with
Neil Conway. I also fixed int_aggregate.c to check this, thereby
turning it into something approaching a supportable technique instead
of being a crude hack.
Formerly, if such a clause contained no aggregate functions we mistakenly
treated it as equivalent to WHERE. Per spec it must cause the query to
be treated as a grouped query of a single group, the same as appearance
of aggregate functions would do. Also, the HAVING filter must execute
after aggregate function computation even if it itself contains no
aggregate functions.
on-the-fly, and thereby avoid blowing out memory when the planner has
underestimated the hash table size. Hash join will now obey the
work_mem limit with some faithfulness. Per my recent proposal
(hash aggregate part isn't done yet though).
that return tuples (such as EXPLAIN). Per gripe from Michael Fuhr.
Side effect: fix an old bug that unintentionally disabled backward scans
for all SPI-created cursors.
look at the actual aggregate transition datatypes and the actual overhead
needed by nodeAgg.c, instead of using pessimistic round numbers.
Per a discussion with Michael Tiemann.
Also performed an initial run through of upgrading our Copyright date to
extend to 2005 ... first run here was very simple ... change everything
where: grep 1996-2004 && the word 'Copyright' ... scanned through the
generated list with 'less' first, and after, to make sure that I only
picked up the right entries ...
of an inheritance child table is binary-compatible with the rowtype of
its parent, invent an expression node type that does the conversion
correctly. Fixes the new bug exhibited by Kris Shannon as well as a
lot of old bugs that would only show up when using multiple inheritance
or after altering the parent table.
We don't really want to start a new SPI connection, just keep using the old
one; otherwise we have memory management problems as illustrated by
John Kennedy's bug report of today. This requires a bit of a hack to
ensure the SPI stack state is properly restored, but then again what we
were doing before was a hack too, strictly speaking. Add a regression
test to cover this case.
returning a NULL pointer (some callers remembered to check the return
value, but some did not -- it is safer to just bail out).
Also, cleanup pgstat.c to use elog(ERROR) rather than elog(LOG) followed
by exit().
to make life cushy for the JDBC driver. Centralize the decision-making
that affects this by inventing a get_type_func_class() function, rather
than adding special cases in half a dozen places.
- remove another senseless "extern" keyword that was applied to a
function definition
- change a foo more function signatures from "some_type foo()" to
"some_type foo(void)"
- rewrite another K&R style function definition
- make the type of the "action" function pointer in the KeyWord struct
in src/backend/utils/adt/formatting.c more precise
columns. The returned tuple needs to have appropriate NULL columns
inserted so that it actually matches the declared rowtype. It seemed
convenient to use a JunkFilter for this, so I made some cleanups and
simplifications in the JunkFilter code to allow it to support this
additional functionality. (That in turn exposed a latent bug in
nodeAppend.c, which is that it was returning a tuple slot whose
descriptor didn't match its data.) Also, move check_sql_fn_retval
out of pg_proc.c and into functions.c, where it seems to more naturally
belong.
now are supposed to take some kind of lock on an index whenever you
are going to access the index contents, rather than relying only on a
lock on the parent table.
when a function that returns a single tuple (not a setof tuple) returns
NULL. This seems to be the most consistent behavior. It would have
taken a bit less code to make it return an empty table (zero rows) but
ISTM a non-SETOF function ought always return exactly one row. Per
bug report from Ivan-Sun1.
was large enough to be batched and the tuples fell into a batch where
there were no inner tuples at all. Thanks to Xiaoyu Wang for finding a
test case that exposed this long-standing bug.
as per recent discussions. Invent SubTransactionIds that are managed like
CommandIds (ie, counter is reset at start of each top transaction), and
use these instead of TransactionIds to keep track of subtransaction status
in those modules that need it. This means that a subtransaction does not
need an XID unless it actually inserts/modifies rows in the database.
Accordingly, don't assign it an XID nor take a lock on the XID until it
tries to do that. This saves a lot of overhead for subtransactions that
are only used for error recovery (eg plpgsql exceptions). Also, arrange
to release a subtransaction's XID lock as soon as the subtransaction
exits, in both the commit and abort cases. This avoids holding many
unique locks after a long series of subtransactions. The price is some
additional overhead in XactLockTableWait, but that seems acceptable.
Finally, restructure the state machine in xact.c to have a more orthogonal
set of states for subtransactions.
mode see a fresh snapshot for each command in the function, rather than
using the latest interactive command's snapshot. Also, suppress fresh
snapshots as well as CommandCounterIncrement inside STABLE and IMMUTABLE
functions, instead using the snapshot taken for the most closely nested
regular query. (This behavior is only sane for read-only functions, so
the patch also enforces that such functions contain only SELECT commands.)
As per my proposal of 6-Sep-2004; I note that I floated essentially the
same proposal on 19-Jun-2002, but that discussion tailed off without any
action. Since 8.0 seems like the right place to be taking possibly
nontrivial backwards compatibility hits, let's get it done now.
((Snapshot) NULL) can no longer be confused with a valid snapshot,
as per my recent suggestion. Define a macro InvalidSnapshot for 0.
Use InvalidSnapshot instead of SnapshotAny as the do-nothing special
case for heap_update and heap_delete crosschecks; this seems a little
cleaner even though the behavior is really the same.
rather than when returning to the idle loop. This makes no particular
difference for interactively-issued queries, but it makes a big difference
for queries issued within functions: trigger execution now occurs before
the calling function is allowed to proceed. This responds to numerous
complaints about nonintuitive behavior of foreign key checking, such as
http://archives.postgresql.org/pgsql-bugs/2004-09/msg00020.php, and
appears to be required by the SQL99 spec.
Also take the opportunity to simplify the data structures used for the
pending-trigger list, rename them for more clarity, and squeeze out a
bit of space.
to the physical layout of the rowtype, ie, there are dummy arguments
corresponding to any dropped columns in the rowtype. We formerly had a
couple of places that did it this way and several others that did not.
Fixes Gaetano Mendola's "cache lookup failed for type 0" bug of 5-Aug.
executed. Previously, the DECLARE would succeed but subsequent FETCHes
would fail since the parameter values supplied to DECLARE were not
propagated to the portal created for the cursor.
In support of this, add type Oids to ParamListInfo entries, which seems
like a good idea anyway since code that extracts a value can double-check
that it got the type of value it was expecting.
Oliver Jowett, with minor editorialization by Tom Lane.
Create a shared function to convert a SPI error code into a string
(replacing near-duplicate code in several PLs), and use it anywhere
that a SPI function call error is reported.
SAVEPOINT/RELEASE/ROLLBACK-TO syntax. (Alvaro)
Cause COMMIT of a failed transaction to report ROLLBACK instead of
COMMIT in its command tag. (Tom)
Fix a few loose ends in the nested-transactions stuff.
This is required by SQL spec to avoid failures in cases like
SELECT sum(win)/sum(lose) FROM ... GROUP BY ... HAVING sum(lose) > 0;
AFAICT we have gotten this wrong since day one. Kudos to Holger Jakobs
for being the first to notice.
for cleaning up. It seems possible that the memory contexts SPI_finish
would try to touch are already gone; and there's no need for SPI itself
to delete them, since the containing contexts will surely be going away
anyway at transaction end.
performance front, but with feature freeze upon us I think it's time to
drive a stake in the ground and say that this will be in 7.5.
Alvaro Herrera, with some help from Tom Lane.
There are various things left to do: contrib dbsize and oid2name modules
need work, and so does the documentation. Also someone should think about
COMMENT ON TABLESPACE and maybe RENAME TABLESPACE. Also initlocation is
dead, it just doesn't know it yet.
Gavin Sherry and Tom Lane.
until Bind is received, so that actual parameter values are visible to the
planner. Make use of the parameter values for estimation purposes (but
don't fold them into the actual plan). This buys back most of the
potential loss of plan quality that ensues from using out-of-line
parameters instead of putting literal values right into the query text.
This patch creates a notion of constant-folding expressions 'for
estimation purposes only', in which case we can be more aggressive than
the normal eval_const_expressions() logic can be. Right now the only
difference in behavior is inserting bound values for Params, but it will
be interesting to look at other possibilities. One that we've seen
come up repeatedly is reducing now() and related functions to current
values, so that queries like ... WHERE timestampcol > now() - '1 day'
have some chance of being planned effectively.
Oliver Jowett, with some kibitzing from Tom Lane.
As a side effect, cause subscripts in INSERT targetlists to do something
more or less sensible; previously we evaluated such subscripts and then
effectively ignored them. Another side effect is that UPDATE-ing an
element or slice of an array value that is NULL now produces a non-null
result, namely an array containing just the assigned-to positions.
of a composite type to get that type's OID as their second parameter,
in place of typelem which is useless. The actual changes are mostly
centralized in getTypeInputInfo and siblings, but I had to fix a few
places that were fetching pg_type.typelem for themselves instead of
using the lsyscache.c routines. Also, I renamed all the related variables
from 'typelem' to 'typioparam' to discourage people from assuming that
they necessarily contain array element types.
loop over the fields instead of a loop around heap_getattr. This is
considerably faster (O(N) instead of O(N^2)) when there are nulls or
varlena fields, since those prevent use of attcacheoff. Replace loops
over heap_getattr with heap_deformtuple in situations where all or most
of the fields have to be fetched, such as printtup and tuptoaster.
Profiling done more than a year ago shows that this should be a nice
win for situations involving many-column tables.
In the past, we used a 'Lispy' linked list implementation: a "list" was
merely a pointer to the head node of the list. The problem with that
design is that it makes lappend() and length() linear time. This patch
fixes that problem (and others) by maintaining a count of the list
length and a pointer to the tail node along with each head node pointer.
A "list" is now a pointer to a structure containing some meta-data
about the list; the head and tail pointers in that structure refer
to ListCell structures that maintain the actual linked list of nodes.
The function names of the list API have also been changed to, I hope,
be more logically consistent. By default, the old function names are
still available; they will be disabled-by-default once the rest of
the tree has been updated to use the new API names.
permissions tests in about the same amount of code as before. Exactly what
the GRANT/REVOKE code ought to be doing is still up for debate, but this
should be helpful in any case, and it already solves an efficiency problem
in executor startup.
rather than allowing them only in a few special cases as before. In
particular you can now pass a ROW() construct to a function that accepts
a rowtype parameter. Internal generation of RowExprs fixes a number of
corner cases that used to not work very well, such as referencing the
whole-row result of a JOIN or subquery. This represents a further step in
the work I started a month or so back to make rowtype values into
first-class citizens.
the next are handled by ReleaseAndReadBuffer rather than separate
ReleaseBuffer and ReadBuffer calls. This cuts the number of acquisitions
of the BufMgrLock by a factor of 2 (possibly more, if an indexscan happens
to pull successive rows from the same heap page). Unfortunately this
doesn't seem enough to get us out of the recently discussed context-switch
storm problem, but it's surely worth doing anyway.
'SELECT foo()' in a SQL function returning a rowtype, to simply pass
back the results of another function returning the same rowtype.
However, that hasn't actually worked in many years. Now it works again.
results with tuples as ordinary varlena Datums. This commit does not
in itself do much for us, except eliminate the horrid memory leak
associated with evaluation of whole-row variables. However, it lays the
groundwork for allowing composite types as table columns, and perhaps
some other useful features as well. Per my proposal of a few days ago.
is measured in kilobytes and checked against actual physical execution
stack depth, as per my proposal of 30-Dec. This gives us a fairly
bulletproof defense against crashing due to runaway recursive functions.
remove separate implementation of ALTER TABLE SET WITHOUT OIDS in favor
of doing a regular DROP. Also, cause CREATE TABLE to account completely
correctly for the inheritance status of the OID column. This fixes
problems with dropping OID columns that have dependencies, as noted by
Christopher Kings-Lynne, as well as making sure that you can't drop an
OID column that was inherited from a parent.
so that the 'val' is computed only once, per recent discussion. The
speedup is not much when 'val' is just a simple variable, but could be
significant for larger expressions. More importantly this avoids issues
with multiple evaluations of a volatile 'val', and it allows the CASE
expression to be reverse-listed in its original form by ruleutils.c.
directly to the appropriate per-node execution function, using a function
pointer stored by ExecInitExpr. This speeds things up by eliminating one
level of function call. The function-pointer technique also enables further
small improvements such as only making one-time tests once (and then
changing the function pointer). Overall this seems to gain about 10%
on evaluation of simple expressions, which isn't earthshaking but seems
a worthwhile gain for a relatively small hack. Per recent discussion
on pghackers.
7.4 rewrite for hashed aggregate support. If the transition data type
is pass-by-reference, the transValue must be pfreed when starting a new
group boundary, else we have a one-value-per-group leakage. Thanks to
Rae Steining for providing a reproducible test case.
+extern Oid SPI_getargtypeid(void *plan, int argIndex);
+extern int SPI_getargcount(void *plan);
+extern bool SPI_is_cursor_plan(void *plan);
Thomas Hallgren
Make btree index creation and initial validation of foreign-key constraints
use maintenance_work_mem rather than work_mem as their memory limit.
Add some code to guc.c to allow these variables to be referenced by their
old names in SHOW and SET commands, for backwards compatibility.
when scanning a table that we need all the columns from. In case of
SELECT INTO, we have to check that the hasoids flag matches the desired
output type, too. Per report from Mike Mascari.
for sure...). Rather than relying on the query context of a rangetable
entry to identify what permissions it wants checked, store a full AclMode
mask in each RTE, and check exactly those bits. This allows an RTE
specifying, say, INSERT privilege on a view to be copied into a derived
UPDATE query without changing meaning. Per recent discussion thread.
initdb forced due to change of stored rule representation.
intended to allow application authors to insulate themselves from
changes to the default value of 'default_with_oids' in future releases
of PostgreSQL.
This patch also fixes a bug in the earlier implementation of the
'default_with_oids' GUC variable: code in gram.y should not examine
the value of GUC variables directly due to synchronization issues.
pointer type when it is not necessary to do so.
For future reference, casting NULL to a pointer type is only necessary
when (a) invoking a function AND either (b) the function has no prototype
OR (c) the function is a varargs function.
regular qpqual ('filter condition'), add special-purpose code to
nodeIndexscan.c to recheck them. This ends being almost no net addition
of code, because the removal of planner code balances out the extra
executor code, but it is significantly more efficient when a lossy
operator is involved in an OR indexscan. The old implementation had
to recheck the entire indexqual in such cases.
about whether it is applied before or after eval_const_expressions().
I believe there were some corner cases where the system would fail to
recognize that a partial index is applicable because of the previous
inconsistency. Store normal rather than 'implicit AND' representations
of constraints and index predicates in the catalogs.
initdb forced due to representation change of constraints/predicates.
shut down cleanly if the plan node is ReScanned before the SRFs are run
to completion. This fixes the problem for SQL-language functions, but
still need work on functions using the SRF_XXX() macros.
proposal for eventually deprecating OIDs on user tables that I posted
earlier to pgsql-hackers. pg_dump now always specifies WITH OIDS or
WITHOUT OIDS when dumping a table. The documentation has been updated.
Neil Conway
the hashclauses field of the parent HashJoin. This avoids problems with
duplicated links to SubPlans in hash clauses, as per report from
Andrew Holm-Hansen.
pghackers proposal of 8-Nov. All the existing cross-type comparison
operators (int2/int4/int8 and float4/float8) have appropriate support.
The original proposal of storing the right-hand-side datatype as part of
the primary key for pg_amop and pg_amproc got modified a bit in the event;
it is easier to store zero as the 'default' case and only store a nonzero
when the operator is actually cross-type. Along the way, remove the
long-since-defunct bigbox_ops operator class.
Remove the 'strategy map' code, which was a large amount of mechanism
that no longer had any use except reverse-mapping from procedure OID to
strategy number. Passing the strategy number to the index AM in the
first place is simpler and faster.
This is a preliminary step in planned support for cross-datatype index
operations. I'm committing it now since the ScanKeyEntryInitialize()
API change touches quite a lot of files, and I want to commit those
changes before the tree drifts under me.
discussion on pgsql-hackers: in READ COMMITTED mode we just have to force
a QuerySnapshot update in the trigger, but in SERIALIZABLE mode we have
to run the scan under a current snapshot and then complain if any rows
would be updated/deleted that are not visible in the transaction snapshot.
to allow es_snapshot to be set to SnapshotNow rather than a query snapshot.
This solves a bug reported by Wade Klaver, wherein triggers fired as a
result of RI cascade updates could misbehave.
now able to cope with assigning new relfilenode values to nailed-in-cache
indexes, so they can be reindexed using the fully crash-safe method. This
leaves only shared system indexes as special cases. Remove the 'index
deactivation' code, since it provides no useful protection in the shared-
index case. Require reindexing of shared indexes to be done in standalone
mode, but remove other restrictions on REINDEX. -P (IgnoreSystemIndexes)
now prevents using indexes for lookups, but does not disable index updates.
It is therefore safe to allow from PGOPTIONS. Upshot: reindexing system catalogs
can be done without a standalone backend for all cases except
shared catalogs.
really general fix might be difficult, I believe the only case where
AtCommit_Notify could see an uncommitted tuple is where the other guy
has just unlistened and not yet committed. The best solution seems to
be to just skip updating that tuple, on the assumption that the other
guy does not want to hear about the notification anyway. This is not
perfect --- if the other guy rolls back his unlisten instead of committing,
then he really should have gotten this notify. But to do that, we'd have
to wait to see if he commits or not, or make UNLISTEN hold exclusive lock
on pg_listener until commit. Either of these answers is deadlock-prone,
not to mention horrible for interactive performance. Do it this way
for now. (What happened to that project to do LISTEN/NOTIFY in memory
with no table, anyway?)
handling many-way scans: instead of re-evaluating all prior indexscan
quals to see if a tuple has been fetched more than once, use a hash table
indexed by tuple CTID. But fall back to the old way if the hash table
grows to exceed SortMem.
as well as the hash function (formerly the comparison function was hardwired
as memcmp()). This makes it possible to eliminate the special-purpose
hashtable management code in execGrouping.c in favor of using dynahash to
manage tuple hashtables; which is a win because dynahash knows how to expand
a hashtable when the original size estimate was too small, whereas the
special-purpose code was too stupid to do that. (See recent gripe from
Stephan Szabo about poor performance when hash table size estimate is way
off.) Free side benefit: when using string_hash, the default comparison
function is now strncmp() instead of memcmp(). This should eliminate some
part of the overhead associated with larger NAMEDATALEN values.
be anything yielding an array of the proper kind, not only sub-ARRAY[]
constructs; do subscript checking at runtime not parse time. Also,
adjust array_cat to make array || array comply with the SQL99 spec.
Joe Conway
yet, though). Avoid using nth() to fetch tlist entries; provide a
common routine get_tle_by_resno() to search a tlist for a particular
resno. This replaces a couple uses of nth() and a dozen hand-coded
search loops. Also, replace a few uses of nth(length-1, list) with
llast().
error, if any input element is NULL. This is not what we ultimately want,
but until arrays can have NULL elements, it will have to do. Patch from
Joe Conway.
It also works to create a non-polymorphic aggregate from polymorphic
functions, should you want to do that. Regression test added, docs still
lacking. By Joe Conway, with some kibitzing from Tom Lane.
ANYELEMENT. The effect is to postpone typechecking of the function
body until runtime. Documentation is still lacking.
Original patch by Joe Conway, modified to postpone type checking
by Tom Lane.
'scalar op ALL (array)', where the operator is applied between the
lefthand scalar and each element of the array. The operator must
yield boolean; the result of the construct is the OR or AND of the
per-element results, respectively.
Original coding by Joe Conway, after an idea of Peter's. Rewritten
by Tom to keep the implementation strictly separate from subqueries.
comparison functions), replacing the highly bogus bitwise array_eq. Create
a btree index opclass for ANYARRAY --- it is now possible to create indexes
on array columns.
Arrange to cache the results of catalog lookups across multiple array
operations, instead of repeating the lookups on every call.
Add string_to_array and array_to_string functions.
Remove singleton_array, array_accum, array_assign, and array_subscript
functions, since these were for proof-of-concept and not intended to become
supported functions.
Minor adjustments to behavior in some corner cases with empty or
zero-dimensional arrays.
Joe Conway (with some editorializing by Tom Lane).
specific hash functions used by hash indexes, rather than the old
not-datatype-aware ComputeHashFunc routine. This makes it safe to do
hash joining on several datatypes that previously couldn't use hashing.
The sets of datatypes that are hash indexable and hash joinable are now
exactly the same, whereas before each had some that weren't in the other.
extensions to support our historical behavior. An aggregate belongs
to the closest query level of any of the variables in its argument,
or the current query level if there are no variables (e.g., COUNT(*)).
The implementation involves adding an agglevelsup field to Aggref,
and treating outer aggregates like outer variables at planning time.
when the plan is ReScanned, we don't have to rebuild the hash table
if there is no parameter change for its child node. This idea has
been used for a long time in Sort and Material nodes, but was not in
the hash code till now.
introducing new 'FastList' list-construction subroutines to use in
hot spots. This avoids the O(N^2) behavior of repeated lappend's
by keeping a tail pointer, while not changing behavior by reversing
list order as the lcons() method would do.
of an index can now be a computed expression instead of a simple variable.
Restrictions on expressions are the same as for predicates (only immutable
functions, no sub-selects). This fixes problems recently introduced with
inlining SQL functions, because the inlining transformation is applied to
both expression trees so the planner can still match them up. Along the
way, improve efficiency of handling index predicates (both predicates and
index expressions are now cached by the relcache) and fix 7.3 oversight
that didn't record dependencies of predicate expressions.
handle multiple 'formats' for data I/O. Restructure CommandDest and
DestReceiver stuff one more time (it's finally starting to look a bit
clean though). Code now matches latest 3.0 protocol document as far
as message formats go --- but there is no support for binary I/O yet.
DestReceiver pointers instead of just CommandDest values. The DestReceiver
is made at the point where the destination is selected, rather than
deep inside the executor. This cleans up the original kluge implementation
of tstoreReceiver.c, and makes it easy to support retrieving results
from utility statements inside portals. Thus, you can now do fun things
like Bind and Execute a FETCH or EXPLAIN command, and it'll all work
as expected (e.g., you can Describe the portal, or use Execute's count
parameter to suspend the output partway through). Implementation involves
stuffing the utility command's output into a Tuplestore, which would be
kind of annoying for huge output sets, but should be quite acceptable
for typical uses of utility commands.
the column by table OID and column number, if it's a simple column
reference. Along the way, get rid of reskey/reskeyop fields in Resdoms.
Turns out that representation was not convenient for either the planner
or the executor; we can make the planner deliver exactly what the
executor wants with no more effort.
initdb forced due to change in stored rule representation.
which does the same thing. Perhaps at one time there was a reason to
allow plan nodes to store their result types in different places, but
AFAICT that's been unnecessary for a good while.
Both plannable queries and utility commands are now always executed
within Portals, which have been revamped so that they can handle the
load (they used to be good only for single SELECT queries). Restructure
code to push command-completion-tag selection logic out of postgres.c,
so that it won't have to be duplicated between simple and extended queries.
initdb forced due to addition of a field to Query nodes.
that the types of untyped string-literal constants are deduced (ie,
when coerce_type is applied to 'em, that's what the type must be).
Remove the ancient hack of storing the input Param-types array as a
global variable, and put the info into ParseState instead. This touches
a lot of files because of adjustment of routine parameter lists, but
it's really not a large patch. Note: PREPARE statement still insists on
exact specification of parameter types, but that could easily be relaxed
now, if we wanted to do so.
I had inadvertently omitted it while rearranging things to support
length-counted incoming messages. Also, change the parser's API back
to accepting a 'char *' query string instead of 'StringInfo', as the
latter wasn't buying us anything except overhead. (I think when I put
it in I had some notion of making the parser API 8-bit-clean, but
seeing that flex depends on null-terminated input, that's not really
ever gonna happen.)
rewritten and the protocol is changed, but most elog calls are still
elog calls. Also, we need to contemplate mechanisms for controlling
all this functionality --- eg, how much stuff should appear in the
postmaster log? And what API should libpq expose for it?
expressions, ARRAY(sub-SELECT) expressions, some array functions.
Polymorphic functions using ANYARRAY/ANYELEMENT argument and return
types. Some regression tests in place, documentation is lacking.
Joe Conway, with some kibitzing from Tom Lane.
(materialization into a tuple store) discussed on pgsql-hackers earlier.
I've updated the documentation and the regression tests.
Notes on the implementation:
- I needed to change the tuple store API slightly -- it assumes that it
won't be used to hold data across transaction boundaries, so the temp
files that it uses for on-disk storage are automatically reclaimed at
end-of-transaction. I added a flag to tuplestore_begin_heap() to control
this behavior. Is changing the tuple store API in this fashion OK?
- in order to store executor results in a tuple store, I added a new
CommandDest. This works well for the most part, with one exception: the
current DestFunction API doesn't provide enough information to allow the
Executor to store results into an arbitrary tuple store (where the
particular tuple store to use is chosen by the call site of
ExecutorRun). To workaround this, I've temporarily hacked up a solution
that works, but is not ideal: since the receiveTuple DestFunction is
passed the portal name, we can use that to lookup the Portal data
structure for the cursor and then use that to get at the tuple store the
Portal is using. This unnecessarily ties the Portal code with the
tupleReceiver code, but it works...
The proper fix for this is probably to change the DestFunction API --
Tom suggested passing the full QueryDesc to the receiveTuple function.
In that case, callers of ExecutorRun could "subclass" QueryDesc to add
any additional fields that their particular CommandDest needed to get
access to. This approach would work, but I'd like to think about it for
a little bit longer before deciding which route to go. In the mean time,
the code works fine, so I don't think a fix is urgent.
- (semi-related) I added a NO SCROLL keyword to DECLARE CURSOR, and
adjusted the behavior of SCROLL in accordance with the discussion on
-hackers.
- (unrelated) Cleaned up some SGML markup in sql.sgml, copy.sgml
Neil Conway
utility statement (DeclareCursorStmt) with a SELECT query dangling from
it, rather than a SELECT query with a few unusual fields in it. Add
code to determine whether a planned query can safely be run backwards.
If DECLARE CURSOR specifies SCROLL, ensure that the plan can be run
backwards by adding a Materialize plan node if it can't. Without SCROLL,
you get an error if you try to fetch backwards from a cursor that can't
handle it. (There is still some discussion about what the exact
behavior should be, but this is necessary infrastructure in any case.)
Along the way, make EXPLAIN DECLARE CURSOR work.
entire contents of the subplan into the tuplestore before we can return
any tuples. Instead, the tuplestore holds what we've already read, and
we fetch additional rows from the subplan as needed. Random access to
the previously-read rows works with the tuplestore, and doesn't affect
the state of the partially-read subplan. This is a step towards fixing
the problems with cursors over complex queries --- we don't want to
stick in Materialize nodes if they'll prevent quick startup for a cursor.
rid of the assumption that sizeof(Oid)==sizeof(int). This is one small
step towards someday supporting 8-byte OIDs. For the moment, it doesn't
do much except get rid of a lot of unsightly casts.
locParam lists can be converted to bitmapsets to speed updating. Also,
replace 'locParam' with 'allParam', which contains all the paramIDs
relevant to the node (i.e., the union of extParam and locParam); this
saves a step during SetChangedParamList() without costing anything
elsewhere.
startup, not in the parser; this allows ALTER DOMAIN to work correctly
with domain constraint operations stored in rules. Rod Taylor;
code review by Tom Lane.
nodes where it's not really necessary. In many cases where the scan node
is not the topmost plan node (eg, joins, aggregation), it's possible to
just return the table tuple directly instead of generating an intermediate
projection tuple. In preliminary testing, this reduced the CPU time
needed for 'SELECT COUNT(*) FROM foo' by about 10%.
Try to model the effect of rescanning input tuples in mergejoins;
account for JOIN_IN short-circuiting where appropriate. Also, recognize
that mergejoin and hashjoin clauses may now be more than single operator
calls, so we have to charge appropriate execution costs.
that's selecting into a RECORD variable returns zero rows, make it
assign an all-nulls row to the RECORD; this is consistent with what
happens when the SELECT INTO target is not a RECORD. In support of
this, tweak the SPI code so that a valid tuple descriptor is returned
even when a SPI select returns no rows.
There are two implementation techniques: the executor understands a new
JOIN_IN jointype, which emits at most one matching row per left-hand row,
or the result of the IN's sub-select can be fed through a DISTINCT filter
and then joined as an ordinary relation.
Along the way, some minor code cleanup in the optimizer; notably, break
out most of the jointree-rearrangement preprocessing in planner.c and
put it in a new file prep/prepjointree.c.
Simplify SubLink by storing just a List of operator OIDs, instead of
a list of incomplete OpExprs --- that was a bizarre and bulky choice,
with no redeeming social value since we have to build new OpExprs
anyway when forming the plan tree.
'NOT (x IN (subselect))', that is 'NOT (x = ANY (subselect))',
rather than 'x <> ALL (subselect)' as we formerly did. This
opens the door to optimizing NOT IN the same way as IN, whereas
there's no hope of optimizing the expression using <>. Also,
convert 'x <> ALL (subselect)' to the NOT(IN) style, so that
the optimization will be available when processing rules dumped
by older Postgres versions.
initdb forced due to small change in SubLink node representation.
computation: reduce the bucket number mod nbatch. This changes the
association between original bucket numbers and batches, but that
doesn't matter. Minor other cleanups in hashjoin code to help
centralize decisions.
given any malloc block until something is first allocated in it; but
thereafter, MemoryContextReset won't release that first malloc block.
This preserves the quick-reset property of the original policy, without
forcing 8K to be allocated to every context whether any of it is ever
used or not. Also, remove some more no-longer-needed explicit freeing
during ExecEndPlan.
a per-query memory context created by CreateExecutorState --- and destroyed
by FreeExecutorState. This provides a final solution to the longstanding
problem of memory leaked by various ExecEndNode calls.
in the planned representation of a subplan at all any more, only SubPlan.
This means subselect.c doesn't scribble on its input anymore, which seems
like a good thing; and there are no longer three different possible
interpretations of a SubLink. Simplify node naming and improve comments
in primnodes.h. No change to stored rules, though.
execution state trees, and ExecEvalExpr takes an expression state tree
not an expression plan tree. The plan tree is now read-only as far as
the executor is concerned. Next step is to begin actually exploiting
this property.
make VALUE a non-reserved word again, use less invasive method of passing
ConstraintTestValue into transformExpr, fix problems with nested constraint
testing, do correct thing with NULL result from a constraint expression,
remove memory leak. Domain checks still need much more work if we are going
to allow ALTER DOMAIN, however.
so that all executable expression nodes inherit from a common supertype
Expr. This is somewhat of an exercise in code purity rather than any
real functional advance, but getting rid of the extra Oper or Func node
formerly used in each operator or function call should provide at least
a little space and speed improvement.
initdb forced by changes in stored-rules representation.
to plan nodes, not vice-versa. All executor state nodes now inherit from
struct PlanState. Copying of plan trees has been simplified by not
storing a list of SubPlans in Plan nodes (eliminating duplicate links).
The executor still needs such a list, but it can build it during
ExecutorStart since it has to scan the plan tree anyway.
No initdb forced since no stored-on-disk structures changed, but you
will need a full recompile because of node-numbering changes.
well as function calls. This is needed for cases where the planner has
constant-folded or inlined the original function call. Possibly we should
back-patch this change into 7.3 branch as well.
logic, dissuade planner from thinking that 'x IS DISTINCT FROM 42' may
be optimized into 'x = 42' (!!), cause dependency on = operator to be
recorded correctly, minor other improvements.
instead of only one. This should speed up planning (only one hash path
to consider for a given pair of relations) as well as allow more effective
hashing, when there are multiple hashable joinclauses.
operations: make sure we use operators that are compatible, as determined
by a mergejoin link in pg_operator. Also, add code to planner to ensure
we don't try to use hashed grouping when the grouping operators aren't
marked hashable.
sublink results and COPY's domain constraint checking. A Const that
isn't really constant is just a Bad Idea(tm). Remove hacks in
parse_coerce and other places that were needed because of the former
klugery.
-hackers a couple days ago.
Notes/caveats:
- added regression tests for the new functionality, all
regression tests pass on my machine
- added pg_dump support
- updated PL/PgSQL to support per-statement triggers; didn't
look at the other procedural languages.
- there's (even) more code duplication in trigger.c than there
was previously. Any suggestions on how to refactor the
ExecXXXTriggers() functions to reuse more code would be
welcome -- I took a brief look at it, but couldn't see an
easy way to do it (there are several subtly-different
versions of the code in question)
- updated the documentation. I also took the liberty of
removing a big chunk of duplicated syntax documentation in
the Programmer's Guide on triggers, and moving that
information to the CREATE TRIGGER reference page.
- I also included some spelling fixes and similar small
cleanups I noticed while making the changes. If you'd like
me to split those into a separate patch, let me know.
Neil Conway
one more row from the subplan than the COUNT would appear to require.
This costs a little more logic but a number of people have complained
about the old implementation.
of groups produced by GROUP BY. This improves the accuracy of planning
estimates for grouped subselects, and is needed to check whether a
hashed aggregation plan risks memory overflow.
before commit, not after :-( --- the original coding is not only unsafe
if an error occurs while it's processing, but it generates an invalid
sequence of WAL entries. Resurrect 7.2 logic for deleting items when
no longer needed. Use an enum instead of random macros. Editorialize
on names used for routines and constants. Teach backend/nodes routines
about new field in CreateTable struct. Add a regression test.
node now does its own grouping of the input rows, and has no need for a
preceding GROUP node in the plan pipeline. This allows elimination of
the misnamed tuplePerGroup option for GROUP, and actually saves more code
in nodeGroup.c than it costs in nodeAgg.c, as well as being presumably
faster. Restructure the API of query_planner so that we do not commit to
using a sorted or unsorted plan in query_planner; instead grouping_planner
makes the decision. (Right now it isn't any smarter than query_planner
was, but that will change as soon as it has the option to select a hash-
based aggregation step.) Despite all the hackery, no initdb needed since
only in-memory node types changed.
command status at the interactive level. SPI_processed, etc are set
in the same way as the returned command status would have been set if
the same querystring were issued interactively. Per gripe from
Michael Paesold 25-Sep-02.
query that uses it. This ensures that triggers will be applied consistently
throughout a query even if someone commits changes to the relation's
pg_class.reltriggers field meanwhile. Per crash report from Laurette Cisneros.
While at it, simplify memory management in relcache.c, which no longer
needs the old hack to try to keep trigger info in the same place over
a relcache entry rebuild. (Should try to fix rd_att and rewrite-rule
access similarly, someday.) And make RelationBuildTriggers simpler and
more robust by making it build the trigdesc in working memory and then
CopyTriggerDesc() into cache memory.
just the significant fields of FunctionCallInfoData, rather than MemSet'ing
the whole struct to zero. Unused positions in the arg[] array will
thereby contain garbage rather than zeroes. This buys back some of the
performance hit from increasing FUNC_MAX_ARGS. Also tweak tuplesort.c
code for more speed by marking some routines 'inline'. All together
these changes speed up simple sorts, like count(distinct int4column),
by about 25% on a P4 running RH Linux 7.2.
executor should not return the tuple as successfully marked, because in
fact it's been deleted. Not clear that this case has ever been seen
in practice (I think you'd have to write a SELECT FOR UPDATE that calls
a function that deletes some row the SELECT will visit later...) but we
should be consistent. Also add comments to several other places that
got it right but didn't explain what they were doing.
to be flexible about assignment casts without introducing ambiguity in
operator/function resolution. Introduce a well-defined promotion hierarchy
for numeric datatypes (int2->int4->int8->numeric->float4->float8).
Change make_const to initially label numeric literals as int4, int8, or
numeric (never float8 anymore).
Explicitly mark Func and RelabelType nodes to indicate whether they came
from a function call, explicit cast, or implicit cast; use this to do
reverse-listing more accurately and without so many heuristics.
Explicit casts to char, varchar, bit, varbit will truncate or pad without
raising an error (the pre-7.2 behavior), while assigning to a column without
any explicit cast will still raise an error for wrong-length data like 7.3.
This more nearly follows the SQL spec than 7.2 behavior (we should be
reporting a 'completion condition' in the explicit-cast cases, but we have
no mechanism for that, so just do silent truncation).
Fix some problems with enforcement of typmod for array elements;
it didn't work at all in 'UPDATE ... SET array[n] = foo', for example.
Provide a generalized array_length_coerce() function to replace the
specialized per-array-type functions that used to be needed (and were
missing for NUMERIC as well as all the datetime types).
Add missing conversions int8<->float4, text<->numeric, oid<->int8.
initdb forced.
(overlaying low byte of page size) and add HEAP_HASOID bit to t_infomask,
per earlier discussion. Simplify scheme for overlaying fields in tuple
header (no need for cmax to live in more than one place). Don't try to
clear infomask status bits in tqual.c --- not safe to do it there. Don't
try to force output table of a SELECT INTO to have OIDs, either. Get rid
of unnecessarily complex three-state scheme for TupleDesc.tdhasoids, which
has already caused one recent failure. Improve documentation.
type for runtime constraint checks, instead of misusing the parse-time
Constraint node for the purpose. Fix some damage introduced into type
coercion logic; in particular ensure that a coerced expression tree will
read out the correct result type when inspected (patch had broken some
RelabelType cases). Enforce domain NOT NULL constraints against columns
that are omitted from an INSERT.
to the table function, thus preventing memory leakage accumulation across
calls. This means that SRFs need to be careful to distinguish permanent
and local storage; adjust code and documentation accordingly. Patch by
Joe Conway, very minor tweaks by Tom Lane.
should be pretty safe in practice, but it's probably better to be safe
than sorry.
I was actually looking for cases where NAMEDATALEN is assumed to be
32, but only found one. That's fixed too, as well as a few bits of
code cleanup.
Neil Conway
array header, and to compute sizing and alignment of array elements
the same way normal tuple access operations do --- viz, using the
tupmacs.h macros att_addlength and att_align. This makes the world
safe for arrays of cstrings or intervals, and should make it much
easier to write array-type-polymorphic functions; as examples see
the cleanups of array_out and contrib/array_iterator. By Joe Conway
and Tom Lane.
value '-2' is used to indicate a variable-width type whose width is
computed as strlen(datum)+1. Everything that looks at typlen is updated
except for array support, which Joe Conway is working on; at the moment
it wouldn't work to try to create an array of cstring.
> There's no longer a separate call to heap_storage_create in that routine
> --- the right place to make the test is now in the storage_create
> boolean parameter being passed to heap_create. A simple change, but
> it passeth patch's understanding ...
Thanks.
Attached is a patch against cvs tip as of 8:30 PM PST or so. Turned out
that even after fixing the failed hunks, there was a new spot in
bufmgr.c which needed to be fixed (related to temp relations;
RelationUpdateNumberOfBlocks). But thankfully the regression test code
caught it :-)
Joe Conway
The local buffer manager is no longer used for newly-created relations
(unless they are TEMP); a new non-TEMP relation goes through the shared
bufmgr and thus will participate normally in checkpoints. But TEMP relations
use the local buffer manager throughout their lifespan. Also, operations
in TEMP relations are not logged in WAL, thus improving performance.
Since it's no longer necessary to fsync relations as they move out of the
local buffers into shared buffers, quite a lot of smgr.c/md.c/fd.c code
is no longer needed and has been removed: there's no concept of a dirty
relation anymore in md.c/fd.c, and we never fsync anything but WAL.
Still TODO: improve local buffer management algorithms so that it would
be reasonable to increase NLocBuffer.
of functions returning domain types, update documentation for typtype,
move get_typtype to lsyscache.c (actually, resurrect the old version),
add defense against creating pseudo-typed table columns, fix some
bogus list-parsing in grammar. Issues remain with respect to alias
handling and type checking; Joe is on those.
types for Table Functions, as previously proposed on HACKERS. Here is a
brief explanation:
1. Creates a new pg_type typtype: 'p' for pseudo type (currently either
'b' for base or 'c' for catalog, i.e. a class).
2. Creates new builtin type of typtype='p' named RECORD. This is the
first of potentially several pseudo types.
3. Modify FROM clause grammer to accept:
SELECT * FROM my_func() AS m(colname1 type1, colname2 type1, ...)
where m is the table alias, colname1, etc are the column names, and
type1, etc are the column types.
4. When typtype == 'p' and the function return type is RECORD, a list
of column defs is required, and when typtype != 'p', it is
disallowed.
5. A check was added to ensure that the tupdesc provide via the parser
and the actual return tupdesc match in number and type of
attributes.
When creating a function you can do:
CREATE FUNCTION foo(text) RETURNS setof RECORD ...
When using it you can do:
SELECT * from foo(sqlstmt) AS (f1 int, f2 text, f3 timestamp)
or
SELECT * from foo(sqlstmt) AS f(f1 int, f2 text, f3 timestamp)
or
SELECT * from foo(sqlstmt) f(f1 int, f2 text, f3 timestamp)
Included in the patches are adjustments to the regression test sql and
expected files, and documentation.
p.s.
This potentially solves (or at least improves) the issue of builtin
Table Functions. They can be bootstrapped as returning RECORD, and
we can wrap system views around them with properly specified column
defs. For example:
CREATE VIEW pg_settings AS
SELECT s.name, s.setting
FROM show_all_settings()AS s(name text, setting text);
Then we can also add the UPDATE RULE that I previously posted to
pg_settings, and have pg_settings act like a virtual table, allowing
settings to be queried and set.
Joe Conway
ERROR: ExecInsert: rejected due to CHECK constraint insert_con
To be like this:
ERROR: ExecInsert: rejected due to CHECK constraint "insert_con" on
"insert_tbl"
Updated regression tests to match.
I got sick of seeing 'rejected due to CHECK constraint "$1" in my log and
not being able to find the bug in our website code...
Christopher Kings-Lynne
> submitted on July 9:
>
> http://archives.postgresql.org/pgsql-patches/2002-07/msg00056.php
>
> Please disregard that one *if* this one is applied. If this one is
> rejected please go ahead with the July 9th patch.
The July 9th Table Function API patch mentioned above is now in CVS, so
here is an updated version of the guc patch which should apply cleanly
against CVS tip.
Joe Conway
bitmap, if present).
Per Tom Lane's suggestion the information whether a tuple has an oid
or not is carried in the tuple descriptor. For debugging reasons
tdhasoid is of type char, not bool. There are predefined values for
WITHOID, WITHOUTOID and UNDEFOID.
This patch has been generated against a cvs snapshot from last week
and I don't expect it to apply cleanly to current sources. While I
post it here for public review, I'm working on a new version against a
current snapshot. (There's been heavy activity recently; hope to
catch up some day ...)
This is a long patch; if it is too hard to swallow, I can provide it
in smaller pieces:
Part 1: Accessor macros
Part 2: tdhasoid in TupDesc
Part 3: Regression test
Part 4: Parameter withoid to heap_addheader
Part 5: Eliminate t_oid from HeapTupleHeader
Part 2 is the most hairy part because of changes in the executor and
even in the parser; the other parts are straightforward.
Up to part 4 the patched postmaster stays binary compatible to
databases created with an unpatched version. Part 5 is small (100
lines) and finally breaks compatibility.
Manfred Koizar
Implements between (symmetric / asymmetric) as a node.
Executes the left or right expression once, makes a Const out of the
resulting Datum and executes the >=, <= portions out of the Const sets.
Of course, the parser does a fair amount of preparatory work for this to
happen.
Rod Taylor
Conway (BuildTupleFromCStrings sets NULL for pass-by-value types when
intended value is 0). It also implements some other improvements
suggested by Neil.
Joe Conway
are managed as per request.
Moved from merging with table attributes to applying themselves during
coerce_type() and coerce_type_typmod.
Regression tests altered to test the cast() scenarios.
Rod Taylor
Reused the Expr node to hold DISTINCT which strongly resembles
the existing OP info. Define DISTINCT_EXPR which strongly resembles
the existing OPER_EXPR opType, but with handling for NULLs required
by SQL99.
We have explicit support for single-element DISTINCT comparisons
all the way through to the executor. But, multi-element DISTINCTs
are handled by expanding into a comparison tree in gram.y as is done for
other row comparisons. Per discussions, it might be desirable to move
this into one or more purpose-built nodes to be handled in the backend.
Define the optional ROW keyword and token per SQL99.
This allows single-element row constructs, which were formerly disallowed
due to shift/reduce conflicts with parenthesized a_expr clauses.
Define the SQL99 TREAT() function. Currently, use as a synonym for CAST().