When a backend needs to flush the WAL, and someone else is already flushing
the WAL, wait until it releases the WALInsertLock and check if we still need
to do the flush or if the other backend already did the work for us, before
acquiring WALInsertLock. This helps group commit, because when the WAL flush
finishes, all the backends that were waiting for it can be woken up in one
go, and the can all concurrently observe that they're done, rather than
waking them up one by one in a cascading fashion.
This is based on a new LWLock function, LWLockWaitUntilFree(), which has
peculiar semantics. If the lock is immediately free, it grabs the lock and
returns true. If it's not free, it waits until it is released, but then
returns false without grabbing the lock. This is used in XLogFlush(), so
that when the lock is acquired, the backend flushes the WAL, but if it's
not, the backend first checks the current flush location before retrying.
Original patch and benchmarking by Peter Geoghegan and Simon Riggs, although
this patch as committed ended up being very different from that.
When default_text_search_config, default_tablespace, or temp_tablespaces
setting is set per-user or per-database, with an "ALTER USER/DATABASE SET
..." statement, don't throw an error if the text search configuration or
tablespace does not exist. In case of text search configuration, even if
it doesn't exist in the current database, it might exist in another
database, where the setting is intended to have its effect. This behavior
is now the same as search_path's.
Tablespaces are cluster-wide, so the same argument doesn't hold for
tablespaces, but there's a problem with pg_dumpall: it dumps "ALTER USER
SET ..." statements before the "CREATE TABLESPACE" statements. Arguably
that's pg_dumpall's fault - it should dump the statements in such an order
that the tablespace is created first and then the "ALTER USER SET
default_tablespace ..." statements after that - but it seems better to be
consistent with search_path and default_text_search_config anyway. Besides,
you could still create a dump that throws an error, by creating the
tablespace, running "ALTER USER SET default_tablespace", then dropping the
tablespace and running pg_dumpall on that.
Backpatch to all supported versions.
btcostestimate() makes an estimate of the number of index tuples that will
be visited based on knowledge of which index clauses can actually bound the
scan within nbtree. However, it forgot to account for partial indexes in
this calculation, with the result that the cost of the index scan could be
significantly overestimated for a partial index. Fix that by merging the
predicate with the abbreviated indexclause list, in the same way as we do
with the full list to estimate how many heap tuples will be visited.
Also, slightly increase the "fudge factor" that's meant to give preference
to smaller indexes over larger ones. While this is applied to all indexes,
it's most important for partial indexes since it can be the only factor
that makes a partial index look cheaper than a similar full index.
Experimentation shows that the existing value is so small as to easily get
swamped by noise such as page-boundary-roundoff behavior. I'm tempted to
kick it up more than this, but will refrain for now.
Per report from Ruben Blanco. These are long-standing issues, but given
the lack of prior complaints I'm not going to risk changing planner
behavior in back branches by back-patching.
In commit 57664ed25e, I made the planner
wrap non-simple-variable outputs of appendrel children (IOW, child SELECTs
of UNION ALL subqueries) inside PlaceHolderVars, in order to solve some
issues with EquivalenceClass processing. However, this means that any
upper-level WHERE clauses mentioning such outputs will now contain
PlaceHolderVars after they're pushed down into the appendrel child,
and that prevents indxpath.c from recognizing that they could be matched
to index expressions. To fix, add explicit stripping of PlaceHolderVars
from index operands, same as we have long done for RelabelType nodes.
Add a regression test covering both this and the plain-UNION case (which
is a totally different code path, but should also be able to do it).
Per bug #6416 from Matteo Beccati. Back-patch to 9.1, same as the
previous change.
Formerly we passed an empty list to each per-child-table invocation of
grouping_planner, and then merged the results into the global list.
However, that fails if there's a CTE attached to the statement, because
create_ctescan_plan uses the list to find the plan referenced by a CTE
reference; so it was unable to find any CTEs attached to the outer UPDATE
or DELETE. But there's no real reason not to use the same list throughout
the process, and doing so is simpler and faster anyway.
Per report from Josh Berkus of "could not find plan for CTE" failures.
Back-patch to 9.1 where we added support for WITH attached to UPDATE or
DELETE. Add some regression test cases, too.
Much more could be done here, but at least now we have *some* automated
test coverage of that mechanism. In particular this tests the writable-CTE
case reported by Phil Sorber.
In passing, remove isolationtester's arbitrary restriction on the number of
steps in a permutation list. I used this so that a single spec file could
be used to run several related test scenarios, but there are other possible
reasons to want a step series that's not exactly a permutation. Improve
documentation and fix a couple other nits as well.
We can't just skip initializing such subplans, because the referencing CTE
node will expect to find the subplan available when it initializes. That
in turn means that ExecInitModifyTable must allow the case (which actually
it needed to do anyway, since there's no guarantee that ModifyTable is
exactly at the top of the CTE plan tree). So move the complaint about not
being allowed in EvalPlanQual mode to execution instead of initialization.
Testing turned up yet another problem, which is that we'd try to
re-initialize the result relation's index list, leading to leaks and
dangling pointers.
Per report from Phil Sorber. Back-patch to 9.1 where data-modifying CTEs
were introduced.
After the planner was fixed to convert some IN/EXISTS subqueries into
semijoins or antijoins, we had to prevent it from doing that in some
cases where the plans risked getting much worse. The reason the plans
got worse was that in the unoptimized implementation, subqueries could
reference parameters from the outer query at any join level, and so
full table scans could be avoided even if they were one or more levels
of join below where the semi/anti join would be. Now that we have
sufficient mechanism in the planner to handle such cases properly,
it should no longer be necessary to play dumb here.
This reverts commits 07b9936a0f and
cd1f0d04bf. The latter was a stopgap
fix that wasn't really sufficiently analyzed at the time. Rather
than just restricting ourselves to cases where the new join can be
stacked on the right-hand input, we should also consider whether it
can be stacked on the left-hand input.
This patch fixes the planner so that it can generate nestloop-with-
inner-indexscan plans even with one or more levels of joining between
the indexscan and the nestloop join that is supplying the parameter.
The executor was fixed to handle such cases some time ago, but the
planner was not ready. This should improve our plans in many situations
where join ordering restrictions formerly forced complete table scans.
There is probably a fair amount of tuning work yet to be done, because
of various heuristics that have been added to limit the number of
parameterized paths considered. However, we are not going to find out
what needs to be adjusted until the code gets some real-world use, so
it's time to get it in there where it can be tested easily.
Note API change for index AM amcostestimate functions. I'm not aware of
any non-core index AMs, but if there are any, they will need minor
adjustments.
Hitherto, the information schema only showed explicitly granted
privileges that were visible in the *acl catalog columns. If no
privileges had been granted, the implicit privileges were not shown.
To fix that, add an SQL-accessible version of the acldefault()
function, and use that inside the aclexplode() calls to substitute the
catalog-specific default privilege set for null values.
reviewed by Abhijit Menon-Sen
In e5e2fc842c, blank lines were removed
after a comment block, which now looks as though the comment refers to
the immediately following code, but it actually refers to the
preceding code. So put the blank lines back.
This has been the behavior already in most cases, but through
omission, ALTER DOMAIN / OWNER TO and ALTER DOMAIN / SET SCHEMA would
silently work on non-domain types as well.
Those fields only appear in the structs so that genbki.pl can create
the BKI bootstrap files for the catalogs. But they are not actually
usable from C. So hiding them can prevent coding mistakes, saves
stack space, and can help the compiler.
In certain catalogs, the first variable-length field has been kept
visible after manual inspection. These exceptions are noted in C
comments.
reviewed by Tom Lane
Normally, accessing variable-length members of catalog structures past
the first one doesn't work at all. Here, it happened to work because
indnatts was checked to be 1, and so the defined FormData_pg_index
layout, using int2vector[1] and oidvector[1] for variable-length
arrays, happened to match the actual memory layout. But it's a very
fragile assumption, and it's not in a performance-critical path, so
code it properly using heap_getattr() instead.
bug analysis by Tom Lane
Parallel dump will need to repeat these steps for each new connection,
so it's better to have this logic in its own function.
Extracted (with some changes) from a much larger patch
by Joachim Wieland.
Our own qsort_arg() implementation doesn't have the defect previously
observed to affect only QNX 4, so it seems sufficiently to assert that
it isn't broken rather than retesting. Also, update a few comments to
clarify why it's valuable to retain a tie-break rule based on CTID
during index builds.
Peter Geoghegan, with slight tweaks by me.
We now use the same error message for ALTER TABLE .. ADD COLUMN or
ALTER TABLE .. RENAME COLUMN that we do for CREATE TABLE. The old
message was accurate, but might be confusing to users not aware of our
system columns.
Vik Reykja, with some changes by me, and further proofreading by Tom Lane
To make it wake up promptly when activity starts again, backends nudge it
by setting a latch in MarkBufferDirty(). The latch is kept set while
bgwriter is active, so there is very little overhead from that when the
system is busy. It is only armed before going into longer sleep.
Peter Geoghegan, with some changes by me.
This doesn't do anything useful just yet, but is intended as supporting
infrastructure for allowing sepgsql to sensibly check DROP permissions.
KaiGai Kohei and Robert Haas
Add counters for number and size of temporary files used
for spill-to-disk queries for each database to the
pg_stat_database view.
Tomas Vondra, review by Magnus Hagander
Rip out a regression test that doesn't play well with settings put in
place by the build farm, and rewrite the code in CheckIndexCompatible
in a hopefully more transparent style.
This enables a bunch of features, notably ON_ERROR_ROLLBACK. It also
makes COPY failure (either in the server or psql) as a whole behave more
sanely in psql.
Additionally, having more commands in the same command line as COPY
works better (though since psql splits lines at semicolons, this doesn't
matter much unless you're using -c).
Also tighten a couple of switches on PQresultStatus() to add
PGRES_COPY_BOTH support and stop assuming that unknown statuses received
are errors; have those print diagnostics where warranted.
Author: Noah Misch
This gives up the "don't rewrite the index" behavior in a couple of
relatively unimportant cases, such as changing between an array type
and an unconstrained domain over that array type, in return for
making this code more future-proof.
Noah Misch
Base backup follows recommended procedure, plus goes to great
lengths to ensure that partial page writes are avoided.
Jun Ishizuka and Fujii Masao, with minor modifications
This reports the depth level of triggers currently in execution, or zero
if not called from inside a trigger.
No catversion bump in this patch, but you have to initdb if you want
access to the new function.
Author: Kevin Grittner
Replication occurs only to memory on standby, not to disk,
so provides additional performance if user wishes to
reduce durability level slightly. Adds concept of multiple
independent sync rep queues.
Fujii Masao and Simon Riggs
Drop the role we create, so regression tests pass even when run more
than once against the same cluster, a problem noted by Tom Lane and
Jeff Janes. Also, rename the temporary role so that it starts with
"regress_", to make it unlikely that we'll collide with an existing
role name while running "make installcheck", per further gripe from
Tom Lane.
We log AccessExclusiveLocks for replay onto standby nodes,
but because of timing issues on ProcArray it is possible to
log a lock that is still held by a just committed transaction
that is very soon to be removed. To avoid any timing issue we
avoid applying locks made by transactions with InvalidXid.
Simon Riggs, bug report Tom Lane, diagnosis Pavan Deolasee
This separates the state (running/idle/idleintransaction etc) into
it's own field ("state"), and leaves the query field containing just
query text.
The query text will now mean "current query" when a query is running
and "last query" in other states. Accordingly,the field has been
renamed from current_query to query.
Since backwards compatibility was broken anyway to make that, the procpid
field has also been renamed to pid - along with the same field in
pg_stat_replication for consistency.
Scott Mead and Magnus Hagander, review work from Greg Smith
That avoids errors when the functions are used in queries like "SELECT
pg_relation_size(oid) FROM pg_class", and a table is dropped concurrently.
Phil Sorber
When the only remaining active transactions are READ ONLY, we do a "partial
cleanup" of committed transactions because certain types of conflicts
aren't possible anymore. For committed r/w transactions, we release the
SIREAD locks but keep the SERIALIZABLEXACT. However, for committed r/o
transactions, we can go further and release the SERIALIZABLEXACT too. The
problem was with the latter case: we were returning the SERIALIZABLEXACT to
the free list without removing it from the finished list.
The only real change in the patch is the SHMQueueDelete line, but I also
reworked some of the surrounding code to make it obvious that r/o and r/w
transactions are handled differently -- the existing code felt a bit too
clever.
Dan Ports
This prevents the postmaster from unexpectedly croaking if postgresql.conf
contains something like:
include 'invalid_directory_name'
Noah Misch. Reviewed by Tom Lane and myself.
When creating a child table, or when attaching an existing table as
child of another, we must not allow inheritable constraints to be
merged with non-inheritable ones, because then grandchildren would not
properly get the constraint. This would violate the grandparent's
expectations.
Bugs noted by Robert Haas.
Author: Nikhil Sontakke
In the previous coding, it was possible for a relation to be created
via CREATE TABLE, CREATE VIEW, CREATE SEQUENCE, CREATE FOREIGN TABLE,
etc. in a schema while that schema was meanwhile being concurrently
dropped. This led to a pg_class entry with an invalid relnamespace
value. The same problem could occur if a relation was moved using
ALTER .. SET SCHEMA while the target schema was being concurrently
dropped. This patch prevents both of those scenarios by locking the
schema to which the relation is being added using AccessShareLock,
which conflicts with the AccessExclusiveLock taken by DROP.
As a desirable side effect, this also prevents the use of CREATE OR
REPLACE VIEW to queue for an AccessExclusiveLock on a relation on which
you have no rights: that will now fail immediately with a permissions
error, before trying to obtain a lock.
We need similar protection for all other object types, but as everything
other than relations uses a slightly different set of code paths, I'm
leaving that for a separate commit.
Original complaint (as far as I could find) about CREATE by Nikhil
Sontakke; risk for ALTER .. SET SCHEMA pointed out by Tom Lane;
further details by Dan Farina; patch by me; review by Hitoshi Harada.
When the remote end of the pipe is closed, select() reports the fd as
readable, but poll() has a separate POLLHUP return code for that.
Spotted by Peter Geoghegan.
Allows a user to use pg_cancel_queries() to cancel queries in
other backends if they are running under the same role.
pg_terminate_backend() still requires superuser permissoins.
Short patch, many authors working on the bikeshed: Magnus Hagander,
Josh Kupershmidt, Edward Muller, Greg Smith.
isolationtester is now able to continue running other permutations when
it detects that one of them is invalid, which is useful during initial
development of spec files.
Author: Alexander Shulgin
superuser doesn't have doesn't make much sense, as a superuser can do
whatever he wants through other means, anyway. So instead of granting
replication privilege to superusers in CREATE USER time by default, allow
replication connection from superusers whether or not they have the
replication privilege.
Patch by Noah Misch, per discussion on bug report #6264
As noted by Tom Lane, the previous coding in this area, which I
introduced in commit bbb6e559c4, was
poorly tested and caused the vacuum's second heap to go into what would
have been an infinite loop but for the fact that it eventually caused a
memory allocation failure. This version seems to work better.
Previously we used ReadRecPtr rather than EndRecPtr, which was
not a serious error but caused pg_stat_replication to report
incorrect replay_location until at least one WAL record is replayed.
Fujii Masao
In commit 7b0d0e9356, I made CLUSTER and
VACUUM FULL try to preserve toast value OIDs from the original toast table
to the new one. However, if we have to copy both live and recently-dead
versions of a row that has a toasted column, those versions may well
reference the same toast value with the same OID. The patch then led to
duplicate-key failures as we tried to insert the toast value twice with the
same OID. (The previous behavior was not very desirable either, since it
would have silently inserted the same value twice with different OIDs.
That wastes space, but what's worse is that the toast values inserted for
already-dead heap rows would not be reclaimed by subsequent ordinary
VACUUMs, since they go into the new toast table marked live not deleted.)
To fix, check if the copied OID already exists in the new toast table, and
if so, assume that it stores the desired value. This is reasonably safe
since the only case where we will copy an OID from a previous toast pointer
is when toast_insert_or_update was given that toast pointer and so we just
pulled the data from the old table; if we got two different values that way
then we have big problems anyway. We do have to assume that no other
backend is inserting items into the new toast table concurrently, but
that's surely safe for CLUSTER and VACUUM FULL.
Per bug #6393 from Maxim Boguk. Back-patch to 9.0, same as the previous
patch.
The originally-chosen test case gives different results in es_EC locale
because of unusual rule for sorting strings beginning with "LL". Adjust
the comparison value to avoid that, while hopefully not introducing new
locale dependencies elsewhere. Per report from Jaime Casanova.
A permutation that specifies more steps than defined causes
isolationtester to crash, so avoid that. Using less steps than defined
should probably not be a problem, but no spec currently does that.
constructed before acquiring WALInsertLock, which slightly reduces the time
the lock is held. Although I could not measure any benefit in benchmarks,
the code is more readable this way.
Teach pg_basebackup in streaming mode to deal with keepalive messages.
Also change the order of checks to complain at the message rather than
block size when a new message is introduced.
In passing, switch to using sizeof() instead of hardcoded sizes for
WAL protocol structs.
The original implementation of this interpreted it as a kind of
"inheritance" facility and named all the internal structures
accordingly. This turned out to be very confusing, because it has
nothing to do with the INHERITS feature. So rename all the internal
parser infrastructure, update the comments, adjust the error messages,
and split up the regression tests.
Historically we've used the SWPB instruction for TAS() on ARM, but this
is deprecated and not available on ARMv6 and later. Instead, make use
of a GCC builtin if available. We'll still fall back to SWPB if not,
so as not to break existing ports using older GCC versions.
Eventually we might want to try using __sync_lock_test_and_set() on some
other architectures too, but for now that seems to present only risk and
not reward.
Back-patch to all supported versions, since people might want to use any
of them on more recent ARM chips.
Martin Pitt
This squeezes out a bunch of alignment padding, reducing the size
from 72 to 56 bytes on my machine. At least in my testing, this
didn't produce any measurable performance improvement, but the space
savings seem like enough justification.
Andres Freund
ALTER TABLE (and ALTER VIEW, ALTER SEQUENCE, etc.) now use a
RangeVarGetRelid callback to check permissions before acquiring a table
lock. We also now use the same callback for all forms of ALTER TABLE,
rather than having separate, almost-identical callbacks for ALTER TABLE
.. SET SCHEMA and ALTER TABLE .. RENAME, and no callback at all for
everything else.
I went ahead and changed the code so that no form of ALTER TABLE works
on foreign tables; you must use ALTER FOREIGN TABLE instead. In 9.1,
it was possible to use ALTER TABLE .. SET SCHEMA or ALTER TABLE ..
RENAME on a foreign table, but not any other form of ALTER TABLE, which
did not seem terribly useful or consistent.
Patch by me; review by Noah Misch.
Previously, this was hardcoded: we always had 8. Performance testing
shows that isn't enough, especially on big SMP systems, so we allow it
to scale up as high as 32 when there's adequate memory. On the flip
side, when shared_buffers is very small, drop the number of CLOG buffers
down to as little as 4, so that we can start the postmaster even
when very little shared memory is available.
Per extensive discussion with Simon Riggs, Tom Lane, and others on
pgsql-hackers.
In commit 6545a901aa, I removed the mini SQL
lexer that was in pg_backup_db.c, thinking that it had no real purpose
beyond separating COPY data from SQL commands, which purpose had been
obsoleted by long-ago fixes in pg_dump's archive file format.
Unfortunately this was in error: that code was also used to identify
command boundaries in INSERT-style table data, which is run together as a
single string in the archive file for better compressibility. As a result,
direct-to-database restores from archive files made with --inserts or
--column-inserts fail in our latest releases, as reported by Dick Visser.
To fix, restore the mini SQL lexer, but simplify it by adjusting the
calling logic so that it's only required to cope with INSERT-style table
data, not arbitrary SQL commands. This allows us to not have to deal with
SQL comments, E'' strings, or dollar-quoted strings, none of which have
ever been emitted by dumpTableData_insert.
Also, fix the lexer to cope with standard-conforming strings, which was the
actual bug that the previous patch was meant to solve.
Back-patch to all supported branches. The previous patch went back to 8.2,
which unfortunately means that the EOL release of 8.2 contains this bug,
but I don't think we're doing another 8.2 release just because of that.
As noted by Heikki Linnakangas, the previous coding confused the "flags"
variable with the "mask" variable. The affect of this appears to be that
unlogged buffers would get written out at every checkpoint rather than
only at shutdown time. Although that's arguably an acceptable failure
mode, I'm back-patching this change, since it seems like a poor idea to
rely on this happening to work.
pg_dump sorts operators by name, but operators with the same name come
out in random order. Now operators with the same name are dumped in
the order prefix, postfix, infix. (This is consistent with functions,
which are dumped in increasing number of argument order.)
ALTER DOMAIN / DROP CONSTRAINT on a nonexistent constraint name did
not report any error. Now it reports an error. The IF EXISTS option
was added to get the usual behavior of ignoring nonexistent objects to
drop.
Certain things like typeglobs or readonly things like $^V cause
perl's SvPVutf8() to die nastily and crash the backend. To avoid
that bug we make a copy of the object, which will subsequently be
garbage collected.
Back patched to 9.1 where we first started using SvPVutf8().
Per -hackers discussion. Original problem reported by David Wheeler.
As previously coded, the QueryDesc's dest pointer was left dangling
(pointing at an already-freed receiver object) after ExecutorEnd. It's a
bit astonishing that it took us this long to notice, and I'm not sure that
the known problem case with SQL functions is the only one. Fix it by
saving and restoring the original receiver pointer, which seems the most
bulletproof way of ensuring any related bugs are also covered.
Per bug #6379 from Paul Ramsey. Back-patch to 8.4 where the current
handling of SELECT INTO was introduced.
Further testing convinces me that this is helpful at sufficiently high
contention levels, though it's still worrisome that it loses slightly
at lower contention levels.
Per Manabu Ori.
Because coerce_type recurses into the argument of a CollateExpr,
coerce_to_target_type's longstanding code for detecting whether coerce_type
had actually done anything (to wit, returned a different node than it
passed in) was broken in 9.1. This resulted in unexpected failures in
hide_coercion_node; which was not the latter's fault, since it's critical
that we never call it on anything that wasn't inserted by coerce_type.
(Else we might decide to "hide" a user-written function call.)
Fix by removing and replacing the CollateExpr in coerce_to_target_type
itself. This is all pretty ugly but I don't immediately see a way to make
it nicer.
Per report from Jean-Yves F. Barbier.
This is allegedly a win, at least on some PPC implementations, according
to the PPC ISA documents. However, as with LWARX hints, some PPC
platforms give an illegal-instruction failure. Use the same trick as
before of assuming that PPC64 platforms will accept it; we might need to
refine that based on experience, but there are other projects doing
likewise according to google.
I did not add an assembler compatibility test because LWSYNC has been
around much longer than hint bits, and it seems unlikely that any
toolchains currently in use don't recognize it.
Previously we defined slock_t as 8 bytes on PPC64, but the TAS assembly
code uses word-wide operations regardless, so that the second word was
just wasted space. There doesn't appear to be any performance benefit
in adding the second word, so get rid of it to simplify the code.
The hint bit makes for a small but measurable performance improvement
in access to contended spinlocks.
On the other hand, some PPC chips give an illegal-instruction failure.
There doesn't seem to be a completely bulletproof way to tell whether the
hint bit will cause an illegal-instruction failure other than by trying
it; but most if not all 64-bit PPC machines should accept it, so follow
the Linux kernel's lead and assume it's okay to use it in 64-bit builds.
Of course we must also check whether the assembler accepts the command,
since even with a recent CPU the toolchain could be old.
Patch by Manabu Ori, significantly modified by me.
This reverts commit ff68b256a5.
The recent change to use -fexcess-precision=standard should make those
Asserts safe, and does fix a test case that formerly crashed for me,
so I think there's no need to have a cross-version difference in the
code here.
The original test cases gave varying results depending on whether the
locale sorts digits before or after letters. Since that's not really
what we wish to test here, adjust the test data to not contain any strings
beginning with digits. Per report from Pavel Stehule.
All supported platforms support the C89 standard function atexit()
(SunOS 4 probably being the last one not to), and supporting both
makes the code clumsy.
In commit e2c2c2e8b1 I made use of nested
list structures to show which clauses went with which index columns, but
on reflection that's a data structure that only an old-line Lisp hacker
could love. Worse, it adds unnecessary complication to the many places
that don't much care which clauses go with which index columns. Revert
to the previous arrangement of flat lists of clauses, and instead add a
parallel integer list of column numbers. The places that care about the
pairing can chase both lists with forboth(), while the places that don't
care just examine one list the same as before.
The only real downside to this is that there are now two more lists that
need to be passed to amcostestimate functions in case they care about
column matching (which btcostestimate does, so not passing the info is not
an option). Rather than deal with 11-argument amcostestimate functions,
pass just the IndexPath and expect the functions to extract fields from it.
That gets us down to 7 arguments which is better than 11, and it seems
more future-proof against likely additions to the information we keep
about an index path.
It's potentially useful for an index to repeat the same indexable column
or expression in multiple index columns, if the columns have different
opclasses. (If they share opclasses too, the duplicate column is pretty
useless, but nonetheless we've allowed such cases since 9.0.) However,
the planner failed to cope with this, because createplan.c was relying on
simple equal() matching to figure out which index column each index qual
is intended for. We do have that information available upstream in
indxpath.c, though, so the fix is to not flatten the multi-level indexquals
list when putting it into an IndexPath. Then we can rely on the sublist
structure to identify target index columns in createplan.c. There's a
similar issue for index ORDER BYs (the KNNGIST feature), so introduce a
multi-level-list representation for that too. This adds a bit more
representational overhead, but we might more or less buy that back by not
having to search for matching index columns anymore in createplan.c;
likewise btcostestimate saves some cycles.
Per bug #6351 from Christian Rudolph. Likely symptoms include the "btree
index keys must be ordered by attribute" failure shown there, as well as
"operator MMMM is not a member of opfamily NNNN".
Although this is a pre-existing problem that can be demonstrated in 9.0 and
9.1, I'm not going to back-patch it, because the API changes in the planner
seem likely to break things such as index plugins. The corner cases where
this matters seem too narrow to justify possibly breaking things in a minor
release.
When a view is marked as a security barrier, it will not be pulled up
into the containing query, and no quals will be pushed down into it,
so that no function or operator chosen by the user can be applied to
rows not exposed by the view. Views not configured with this
option cannot provide robust row-level security, but will perform far
better.
Patch by KaiGai Kohei; original problem report by Heikki Linnakangas
(in October 2009!). Review (in earlier versions) by Noah Misch and
others. Design advice by Tom Lane and myself. Further review and
cleanup by me.
You could already rename domains using ALTER TYPE, but with this new
command it is more consistent with how other commands treat domains as
a subcategory of types.
This has been broken just about forever (or more specifically, commit
7f4981f4af) and nobody noticed until
Richard Huxton reported it recently. Analysis and fix by Ross
Reedstrom, although I didn't use his patch. This doesn't seem
important enough to back-patch and is mildly backward incompatible, so
I'm just doing this in master.
We forgot to modify column ACLs, so privileges were still shown as having
been granted by the old owner. This meant that neither the new owner nor
a superuser could revoke the now-untraceable-to-table-owner permissions.
Per bug #6350 from Marc Balmer.
This has been wrong since column ACLs were added, so back-patch to 8.4.
In the previous coding, a user could queue up for an AccessExclusiveLock
on a table they did not have permission to cluster, thus potentially
interfering with access by authorized users who got stuck waiting behind
the AccessExclusiveLock. This approach avoids that. cluster() has the
same permissions-checking requirements as REINDEX TABLE, so this commit
moves the now-shared callback to tablecmds.c and renames it, per
discussion with Noah Misch.
When a PORTAL_ONE_SELECT query is executed, we can opportunistically
reuse the parse/plan shot for the execution phase. This cuts down the
number of snapshots per simple query from 2 to 1 for the simple
protocol, and 3 to 2 for the extended protocol. Since we are only
reusing a snapshot taken early in the processing of the same protocol
message, the change shouldn't be user-visible, except that the remote
possibility of the planning and execution snapshots being different is
eliminated.
Note that this change does not make it safe to assume that the parse/plan
snapshot will certainly be reused; that will currently only happen if
PortalStart() decides to use the PORTAL_ONE_SELECT strategy. It might
be worth trying to provide some stronger guarantees here in the future,
but for now we don't.
Patch by me; review by Dimitri Fontaine.
The original coding of this function overlooked the possibility that
it could be passed anything except simple OpExpr indexquals. But
ScalarArrayOpExpr is possible too, and the code would probably crash
(and surely give ridiculous answers) in such a case. Add logic to try
to estimate sanely for such cases.
In passing, fix the treatment of inner-indexscan cost estimation: it was
failing to scale up properly for multiple iterations of a nestloop.
(I think somebody might've thought that index_pages_fetched() is linear,
but of course it's not.)
Report, diagnosis, and preliminary patch by Marti Raudsepp; I refactored
it a bit and fixed the cost estimation.
Back-patch into 9.1 where the bogus code was introduced.
smgrdounlink takes care to not throw an ERROR if it fails to unlink
something, but that caution was rendered useless by commit
3396000684, which put an smgrexists call in
front of it; smgrexists *does* throw error if anything looks funny, such
as getting a permissions error from trying to open the file. If that
happens post-commit, you get a PANIC, and what's worse the same logic
appears in the WAL replay code, so the database even fails to restart.
Restore the intended behavior by removing the smgrexists call --- it isn't
accomplishing anything that we can't do better by adjusting mdunlink's
ideas of whether it ought to warn about ENOENT or not.
Per report from Joseph Shraibman of unrecoverable crash after trying to
drop a table whose FSM fork had somehow gotten chmod'd to 000 permissions.
Backpatch to 8.4, where the bogus coding was introduced.
This adds support for the more or less SQL-conforming USAGE privilege
on types and domains. The intent is to be able restrict which users
can create dependencies on types, which restricts the way in which
owners can alter types.
reviewed by Yeb Havinga
On reflection, the original name seems way too generic for a global
symbol. A quick check shows this is the only exported function name
in SP-GiST that doesn't begin with "spg" or contain "SpGist", so the
rest of them seem all right.
This makes them enforceable only on the parent table, not on children
tables. This is useful in various situations, per discussion involving
people bitten by the restrictive behavior introduced in 8.4.
Message-Id:
8762mp93iw.fsf@comcast.netCAFaPBrSMMpubkGf4zcRL_YL-AERUbYF_-ZNNYfb3CVwwEqc9TQ@mail.gmail.com
Authors: Nikhil Sontakke, Alex Hunsaker
Reviewed by Robert Haas and myself
Operator classes can specify whether or not they support this; this
preserves the flexibility to use lossy representations within an index.
In passing, move constant data about a given index into the rd_amcache
cache area, instead of doing fresh lookups each time we start an index
operation. This is mainly to try to make sure that spgcanreturn() has
insignificant cost; I still don't have any proof that it matters for
actual index accesses. Also, get rid of useless copying of FmgrInfo
pointers; we can perfectly well use the relcache's versions in-place.
The need for this was debated when we put in the index-only-scan feature,
but at the time we had no near-term expectation of having AMs that could
support such scans for only some indexes; so we kept it simple. However,
the SP-GiST AM forces the issue, so let's fix it.
This patch only installs the new API; no behavior actually changes.
This moves the code around from one huge file into hopefully logical
and more manageable modules. For the most part, the code itself was
not touched, except: PLy_function_handler and PLy_trigger_handler were
renamed to PLy_exec_function and PLy_exec_trigger, because they were
not actually handlers in the PL handler sense, and it makes the naming
more similar to the way PL/pgSQL is organized. The initialization of
the procedure caches was separated into a new function
init_procedure_caches to keep the hash tables private to
plpy_procedures.c.
Jan Urbański and Peter Eisentraut
Ever since we introduced real prepared statements this should work for
different connections. The old solution just emulating prepared statements,
though, wasn't able to handle this.
Closes: #6309
These entries could never be matched to an index clause because they don't
have the index datatype on the left-hand side of the operator. (Their
commutators are in the opclass, which is sensible, but that doesn't mean
these operators should be.) Spotted by a test that I recently added to
opr_sanity to catch exactly this type of thinko. AFAICT there is no code
in gistproc.c that is specifically meant to cover these cases, so nothing
to remove at that level.
SP-GiST is comparable to GiST in flexibility, but supports non-balanced
partitioned search structures rather than balanced trees. As described at
PGCon 2011, this new indexing structure can beat GiST in both index build
time and query speed for search problems that it is well matched to.
There are a number of areas that could still use improvement, but at this
point the code seems committable.
Teodor Sigaev and Oleg Bartunov, with considerable revisions by Tom Lane
Heikki Linnakangas had the idea of rearranging GetSnapshotData to
avoid checking for sub-XIDs when no top-level XID is present. This
patch does that plus further a bit of further, related rearrangement.
Benchmarking show a significant improvement on unlogged tables at
higher concurrency levels, and mostly indifferent result on permanent
tables (which are presumably bottlenecked elsewhere). Most of the
benefit seems to come from using the new NormalTransactionIdPrecedes()
macro rather than the function call TransactionIdPrecedes().
Valid values are --pre-data, data and post-data. The option can be
given more than once. --schema-only is equivalent to
--section=pre-data --section=post-data. --data-only is equivalent
to --section=data.
Andrew Dunstan, reviewed by Joachim Wieland and Josh Berkus.
This works the same as include, except that an error is not thrown
if the file is missing. Instead the fact that it's missing is
logged.
Greg Smith, reviewed by Euler Taveira de Oliveira.
If the referrent of a name changes while we're waiting for the lock,
we must recheck permissons. We also now check the relkind before
locking, since it's easy to do that long the way.
Patch by me; review by Noah Misch.
Previously, renaming a table, sequence, view, index, foreign table,
column, or trigger checked permissions before locking the object, which
meant that if permissions were revoked during the lock wait, we would
still allow the operation. Similarly, if the original object is dropped
and a new one with the same name is created, the operation will be allowed
if we had permissions on the old object; the permissions on the new
object don't matter. All this is now fixed.
Along the way, attempting to rename a trigger on a foreign table now gives
the same error message as trying to create one there in the first place
(i.e. that it's not a table or view) rather than simply stating that no
trigger by that name exists.
Patch by me; review by Noah Misch.
Andrew Dunstan, reviewed by Josh Berkus, Robert Haas and Peter Geoghegan.
This allows dumping of a table definition but not its data, on a per table basis.
Table name patterns are supported just as for --exclude-table.
Removing this bit from xl_info allows us to restore the old limit of four
(not three) separate pages touched by a WAL record, which is needed for the
upcoming SP-GiST feature, and will likely be useful elsewhere in future.
When we implemented XLR_BKP_REMOVABLE in 2007, we had to do it like that
because no special WAL-visible action was taken when starting a backup.
However, now we force a segment switch when starting a backup, so a
compressing WAL archiver (such as pglesslog) that uses the state shown in
the current page header will not be fooled as to removability of backup
blocks. The only downside is that the archiver will not return to
compressing mode for up to one WAL page after the backup is over, which is
a small price to pay for getting back the extra xl_info bit. In any case
the archiver could look for XLOG_BACKUP_END records if it thought it was
worth the trouble to do so.
Bump XLOG_PAGE_MAGIC since this is effectively a change in WAL format.
I forgot to change the functions to use the PG_GETARG_INET_PP() macro,
when I changed DatumGetInetP() to unpack the datum, like Datum*P macros
usually do. Also, I screwed up the definition of the PG_GETARG_INET_PP()
macro, and didn't notice because it wasn't used.
This fixes the memory leak when sorting inet values, as reported
by Jochen Erwied and debugged by Andres Freund. Backpatch to 8.3, like
the previous patch that broke it.
Original patch by Lars Kanis, reviewed by Nishiyama Tomoaki and tweaked some by me.
This compiler, or at least the latest version of it, is currently broken, and
only passes the regression tests if built with -O0.
we don't reach consistency before replaying all of the WAL. Rename the
variable to reachedConsistency, to make its intention clearer.
In master, that was an active bug because of the recent patch to
immediately PANIC if a reference to a missing page is found in WAL after
reaching consistency, as Tom Lane's test case demonstrated. In 9.1 and 9.0,
the only consequence was a misleading "consistent recovery state reached at
%X/%X" message in the log at the beginning of crash recovery (the database
is not consistent at that point yet). In 8.4, the log message was not
printed in crash recovery, even though there was a similar
reachedMinRecoveryPoint local variable that was also set early. So,
backpatch to 9.1 and 9.0.
lost. The only way we detect that at the moment is when write() fails when
we try to write to the socket.
Florian Pflug with small changes by me, reviewed by Greg Jaskiewicz.
Make sure all calls are protected by HAVE_READLINK, and get the buffer
overflow tests right. Be a bit more paranoid about string length in
_tarWriteHeader(), too.
We don't have any such platforms now, but might in the future.
Also, detect cases when a tablespace symlink points to a path that
is longer than we can handle, and give a warning.
Instead, add a function pg_tablespace_location(oid) used to return
the same information, and do this by reading the symbolic link.
Doing it this way makes it possible to relocate a tablespace when the
database is down by simply changing the symbolic link.
This patch creates an API whereby a btree index opclass can optionally
provide non-SQL-callable support functions for sorting. In the initial
patch, we only use this to provide a directly-callable comparator function,
which can be invoked with a bit less overhead than the traditional
SQL-callable comparator. While that should be of value in itself, the real
reason for doing this is to provide a datatype-extensible framework for
more aggressive optimizations, as in Peter Geoghegan's recent work.
Robert Haas and Tom Lane
If unable to connect to "postgres", try "template1". This allows things to
work more smoothly in the case where the postgres database has been
dropped. And just in case that's not good enough, also allow the user to
specify a maintenance database to be used for the initial connection, to
cover the case where neither postgres nor template1 is suitable.
While logically correct, these two Asserts could fail depending on the
vagaries of floating-point arithmetic. In particular, on machines with
floating-point registers wider than standard "double" values, it was
possible for the compiler to compare a rounded-to-double value already
stored in memory with an unrounded long double value still in a register.
Given the preceding checks, these assertions aren't adding much, so let's
just get rid of them rather than try to find a compiler-proof fix.
Per report from Pavel Stehule.
Given the lack of previous complaints, and the fact that only developers
would be likely to trip over it, I'm only going to change this in HEAD,
even though the code has been like this for a long time.
Add a function plpy.cursor that is similar to plpy.execute but uses an
SPI cursor to avoid fetching the entire result set into memory.
Jan Urbański, reviewed by Steve Singer
This can be used to set (or unset) environment variables that will
affect programs called by psql (such as the PAGER), probably most
usefully in a .psqlrc file.
Andrew Dunstan, reviewed by Josh Kupershmidt.
This makes it possible to use a libpq app with home directory set
to /dev/null, for example - treating it the same as if the file
doesn't exist (which it doesn't).
Per bug #6302, reported by Diego Elio Petteno
invalid-page hash table, PANIC immediately. Immediate PANIC is much better
than waiting for end-of-recovery, which is what we did before, because the
end-of-recovery might not come until months later if this is a standby
server.
Also refrain from creating a restartpoint if there are invalid-page entries
in the hash table. Restarting recovery from such a restartpoint would not
see the invalid references, and wouldn't be able to cross-check them when
consistency is reached. That wouldn't matter when things are going smoothly,
but the more sanity checks you have the better.
Fujii Masao
Since record[] uses array_in, it needs to have its element type passed
as typioparam. In HEAD and 9.1, this fix essentially reverts commit
9bc933b212, which was a hack that is no
longer needed since domains don't set their typelem anymore. Before
that, adjust the logic so that only domains are excluded from being
treated like arrays, rather than assuming that only base types should
be included. Add a regression test to demonstrate the need for this.
Per report from Maxim Boguk.
Back-patch to 8.4, where type record[] was added.
In the previous coding, callers were faced with an awkward choice:
look up the name, do permissions checks, and then lock the table; or
look up the name, lock the table, and then do permissions checks.
The first choice was wrong because the results of the name lookup
and permissions checks might be out-of-date by the time the table
lock was acquired, while the second allowed a user with no privileges
to interfere with access to a table by users who do have privileges
(e.g. if a malicious backend queues up for an AccessExclusiveLock on
a table on which AccessShareLock is already held, further attempts
to access the table will be blocked until the AccessExclusiveLock
is obtained and the malicious backend's transaction rolls back).
To fix, allow callers of RangeVarGetRelid() to pass a callback which
gets executed after performing the name lookup but before acquiring
the relation lock. If the name lookup is retried (because
invalidation messages are received), the callback will be re-executed
as well, so we get the best of both worlds. RangeVarGetRelid() is
renamed to RangeVarGetRelidExtended(); callers not wishing to supply
a callback can continue to invoke it as RangeVarGetRelid(), which is
now a macro. Since the only one caller that uses nowait = true now
passes a callback anyway, the RangeVarGetRelid() macro defaults nowait
as well. The callback can also be used for supplemental locking - for
example, REINDEX INDEX needs to acquire the table lock before the index
lock to reduce deadlock possibilities.
There's a lot more work to be done here to fix all the cases where this
can be a problem, but this commit provides the general infrastructure
and fixes the following specific cases: REINDEX INDEX, REINDEX TABLE,
LOCK TABLE, and and DROP TABLE/INDEX/SEQUENCE/VIEW/FOREIGN TABLE.
Per discussion with Noah Misch and Alvaro Herrera.
On a platform that isn't supplying __FILE__, previous coding would either
crash or give a stale result for the filename string. Not sure how likely
that is, but the original code catered for it, so let's keep doing so.
In vpath builds, the __FILE__ macro that is used in verbose error
reports contains the full absolute file name, which makes the error
messages excessively verbose. So keep only the base name, thus
matching the behavior of non-vpath builds.
Force the transaction isolation level to READ COMMITTED in autovacuum
worker and launcher processes. There is no benefit to using a higher
isolation level, and doing so could result in delaying foreground
transactions (or maybe even causing unnecessary serialization failures?).
Noted by Dan Ports.
Also, make sure we disable zero_damaged_pages and statement_timeout in
the autovac launcher, not only workers. Now that the launcher can run
transactions, these settings could affect its behavior, and it seems
like the same arguments apply to the launcher as the workers.
This should make it easier to identify which row is problematic when an
insert or update is processing many rows.
The formatting is similar to that for unique-index violation messages,
except that we limit field widths to 64 bytes since otherwise the message
could get unreasonably long. (In particular, there's currently no attempt
to quote or escape field values that contain commas etc.)
Jan Kundrát, reviewed by Royce Ausburn, somewhat rewritten by me.
The old expression sed 's,$(srcdir),python3,' would normally resolve
as sed 's,.,python3,', which is not really what we wanted. While it
doesn't actually break anything right now, it's still wrong, so put in
a bit more work to make it more robust.
Moving the code two full tab stops to the right requires rethinking of
cosmetic code layout choices, which pgindent isn't really able to do for
us. Whitespace and comment adjustments only, no code changes.
While the deletion in itself wouldn't break things, any further creation
of objects in the script would result in dangling pg_depend entries being
added by recordDependencyOnCurrentExtension(). An example from Phil
Sorber convinced me that this is just barely likely enough to be worth
expending a couple lines of code to defend against. The resulting error
message might be confusing, but it's better than leaving corrupted catalog
contents for the user to deal with.
This function has now grown enough cases that a switch seems appropriate.
This results in a measurable speed improvement on some platforms, and
should certainly not hurt. The code's in need of a pgindent run now,
though.
Andres Freund
The server name for a foreign table was not quoted at need, as per report
from Ronan Dunklau. Also, queries related to FDW options were inadequately
schema-qualified in places where the search path isn't just pg_catalog, and
were inconsistently formatted everywhere, and we didn't always check that
we got the expected number of rows from them.
The EvalPlanQual machinery assumes that whole-row Vars generated for the
outputs of non-table RTEs will be of composite types. However, for the
case where the RTE is a function call returning a scalar type, we were
doing the wrong thing, as a result of sharing code with a parser case
where the function's scalar output is wanted. (Or at least, that's what
that case has done historically; it does seem a bit inconsistent.)
To fix, extend makeWholeRowVar's API so that it can support both use-cases.
This fixes Belinda Cussen's report of crashes during concurrent execution
of UPDATEs involving joins to the result of UNNEST() --- in READ COMMITTED
mode, we'd run the EvalPlanQual machinery after a conflicting row update
commits, and it was expecting to get a HeapTuple not a scalar datum from
the "wholerowN" variable referencing the function RTE.
Back-patch to 9.0 where the current EvalPlanQual implementation appeared.
In 9.1 and up, this patch also fixes failure to attach the correct
collation to the Var generated for a scalar-result case. An example:
regression=# select upper(x.*) from textcat('ab', 'cd') x;
ERROR: could not determine which collation to use for upper() function
This fixes a longstanding but up to now benign bug in the way pg_dumpall
was built. The bug was exposed by recent code adjustments. The Makefile
does not use $(OBJS) to build pg_dumpall, so this fix removes their source
files from the pg_dumpall object and adds in the one source file it
consequently needs.
Use of a randomly chosen large value was never exactly graceful, and
now that there are penalty functions that are intentionally using infinity,
it doesn't seem like a good idea for null-vs-not-null to be using something
less.
In the original implementation, a range-contained-by search had to scan
the entire index because an empty range could be lurking anywhere.
Improve that by adding a flag to upper GiST entries that says whether the
represented subtree contains any empty ranges.
Also, make a simple mod to the penalty function to discourage empty ranges
from getting pushed into subtrees without any. This needs more work, and
the picksplit function should be taught about it too, but that code can be
improved without causing an on-disk compatibility break; so we'll leave it
for another day.
Since we're breaking on-disk compatibility of range values anyway, I took
the opportunity to reorganize the range flags bits; the unused
RANGE_xB_NULL bits are now adjacent, which might open the door for using
them in some other way later.
In passing, remove the GiST range opclass entry for <>, which doesn't seem
like it can really be indexed usefully.
Alexander Korotkov, with some editorializing by Tom
It runs the regression tests, runs pg_upgrade on the populated
database, and compares the before and after dumps. While not actually
a cross-version upgrade, this does detect omissions and bugs in the
involved tools from time to time. It's also possible to do a
cross-version upgrade by manually supplying parameters.
The original coding was
var->value = (Datum) state;
which is bogus, and then in commit 2f0f7b4bce
it was "corrected" to
var->value = PointerGetDatum(state);
which is a faithful translation but still wrong.
This seems purely cosmetic, though, so no need for a back-patch.
Pavel Stehule
distro version of perl.
David Wheeler and Alex Hunsaker.
Backpatch to 9.1 where it applies cleanly. A simple workaround is available for earlier
branches, and further effort doesn't seem warranted.
In the cases where the result of the called proc is negated, we should
explicitly test both inputs for empty, to ensure we'll never return "true"
for an unsatisfiable query. In other cases we can rely on the called proc
to say the right thing.
Same bug as reported by Thom Brown for check constraints on tables: the
constraint must be dumped separately from the domain, otherwise it is
restored before the data and thus prevents potentially-violating data
from being loaded in the first place.
Per Dean Rasheed
This adds some I/O stats to the logging of autovacuum (when the
operation takes long enough that log_autovacuum_min_duration causes it
to be logged), so that it is easier to tune. Notably, it adds buffer
I/O counts (hits, misses, dirtied) and read and write rate.
Authors: Greg Smith and Noah Misch
A simple thinko in ginRedoUpdateMetapage, namely failing to increment a
loop counter, led to inserting records into the last pending-list page in
the wrong order (the opposite of that intended). So far as I can tell,
this would not upset the code that eventually flushes pending items into
the main part of the GIN index. But it did break the code that searched
the pending list for matches, resulting in transient failure to find
matching entries during index lookups, as illustrated in bug #6307 from
Maksym Boguk.
Back-patch to 8.4 where the incorrect code was introduced.
This speeds up snapshot-taking and reduces ProcArrayLock contention.
Also, the PGPROC (and PGXACT) structures used by two-phase commit are
now allocated as part of the main array, rather than in a separate
array, and we keep ProcArray sorted in pointer order. These changes
are intended to minimize the number of cache lines that must be pulled
in to take a snapshot, and testing shows a substantial increase in
performance on both read and write workloads at high concurrencies.
Pavan Deolasee, Heikki Linnakangas, Robert Haas
The WITH [NO] DATA option was not supported, nor the ability to specify
replacement column names; the former limitation wasn't even documented, as
per recent complaint from Naoya Anzai. Fix by moving the responsibility
for supporting these options into the executor. It actually takes less
code this way ...
catversion bump due to change in representation of IntoClause, which might
affect stored rules.
exception handler. This was a regression in 9.1, when the capability
to catch specific SPI errors was added, so backpatch to 9.1.
Mika Eloranta, with some editing by Jan Urbański.
The original coding would not work for discrete ranges in which the
canonicalization rule is to produce symmetric boundaries (either [] or ()
style), as noted by Jeff Davis. Florian Pflug pointed out that we could
fix that by invoking the canonicalization function to see if the range
"between" the two given ranges normalizes to empty. This implementation
of Florian's idea is a tad slower than the original code, but only in the
case where there actually is a canonicalization function --- if not, it's
essentially the same logic as before.
Since range types can be created by non-superusers, we need to consider
their permissions. Ideally we'd check this when the type is used, not
when it's created, but that seems like much more trouble than it's worth.
The existing restriction that the support functions be immutable already
prevents most cases where an unauthorized call to a function might be
thought a security issue, and the fact that the user has no access to
the results of the system's calls to subtype_diff closes off the other
plausible reason for concern. So this check is basically pro-forma,
but let's make it anyway.
It's not clear that a per-datatype typanalyze function would be any more
useful than a generic typanalyze for ranges. What *is* clear is that
letting unprivileged users select typanalyze functions is a crash risk or
worse. So remove the option from CREATE TYPE AS RANGE, and instead put in
a generic typanalyze function for ranges. The generic function does
nothing as yet, but hopefully we'll improve that before 9.2 release.
Per discussion, the zero-argument forms aren't really worth the catalog
space (just write 'empty' instead). The one-argument forms have some use,
but they also have a serious problem with looking too much like functional
cast notation; to the point where in many real use-cases, the parser would
misinterpret what was wanted.
Committing this as a separate patch, with the thought that we might want
to revert part or all of it if we can think of some way around the cast
ambiguity.
Implement these tests directly instead of constructing a singleton range
and then applying range-contains. This saves a range serialize/deserialize
cycle as well as a couple of redundant bound-comparison steps, and adds
very little code on net.
Remove elem_contained_by_range from the GiST opclass: it doesn't belong
there because there is no way to use it in an index clause (where the
indexed column would have to be on the left). Its commutator is in the
opclass, and that's what counts.
In the normal course of events, this matters only if ALTER DEFAULT
PRIVILEGES has been used to revoke default INSERT permission. Whether
or not the new behavior is more or less likely to be what the user wants
when dealing only with the built-in privilege facilities is arguable,
but it's clearly better when using a loadable module such as sepgsql
that may use the hook in ExecCheckRTPerms to enforce additional
permissions checks.
KaiGai Kohei, reviewed by Albe Laurenz
Per discussion, relax the range input/construction rules so that the
only hard error is lower bound > upper bound. Cases where the lower
bound is <= upper bound, but the range nonetheless normalizes to empty,
are now permitted.
Fix core dump in range_adjacent when bounds are infinite. Marginal
cleanup of regression test cases, some more code commenting.
Fix up some infelicitous coding in DefineRange, and add some missing error
checks. Rearrange operator strategy number assignments for GiST anyrange
opclass so that they don't make such a mess of opr_sanity's table of
operator names associated with different strategy numbers. Assign
hopefully-temporary selectivity estimators to range operators that didn't
have one --- poor as the estimates are, they're still a lot better than the
default 0.5 estimate, and they'll shut up the opr_sanity test that wants to
see selectivity estimators on all built-in operators.
When the system is idle for awhile after activity, the "smoothed_alloc"
state variable in BgBufferSync converges slowly to zero. With standard
IEEE float arithmetic this results in several iterations with denormalized
values, which causes kernel traps and annoying log messages on some
poorly-designed platforms. There's no real need to track such small values
of smoothed_alloc, so we can prevent the kernel traps by forcing it to zero
as soon as it's too small to be interesting for our purposes. This issue
is purely cosmetic, since the iterations don't happen fast enough for the
kernel traps to pose any meaningful performance problem, but still it seems
worth shutting up the log messages.
The kernel log messages were previously reported by a number of people,
but kudos to Greg Matthews for tracking down exactly where they were coming
from.
When wal_level = 'hot_standby' we touched the last page of the
relation during a VACUUM, even if nothing else had happened.
That would alter the LSN of the last block and set the mtime
of the relation file unnecessarily. Noted by Thom Brown.
This gets rid of an impressive amount of duplicative code, with only
minimal behavior changes. DROP FOREIGN DATA WRAPPER now requires object
ownership rather than superuser privileges, matching the documentation
we already have. We also eliminate the historical warning about dropping
a built-in function as unuseful. All operations are now performed in the
same order for all object types handled by dropcmds.c.
KaiGai Kohei, with minor revisions by me
Use of anynonarray was a crude hack to get around ambiguity versus the
array inclusion operators of the same names. My previous patch to extend
the parser's type resolution heuristics makes that unnecessary, so use
the more general declaration instead. This eliminates a wart that these
operators couldn't be used with ranges over arrays, which are otherwise
supported just fine.
Also, mark range_before and range_after as commutator operators,
per discussion with Jeff Davis.
For a very long time, one of the parser's heuristics for resolving
ambiguous operator calls has been to assume that unknown-type literals are
of the same type as the other input (if it's known). However, this was
only used in the first step of quickly checking for an exact-types match,
and thus did not help in resolving matches that require coercion, such as
matches to polymorphic operators. As we add more polymorphic operators,
this becomes more of a problem. This patch adds another use of the same
heuristic as a last-ditch check before failing to resolve an ambiguous
operator or function call. In particular this will let us define the range
inclusion operator in a less limited way (to come in a follow-on patch).
A very long time ago, language names were specified as literals rather
than identifiers, so this code was added to do case-folding. But that
style has ben deprecated for many years so this isn't needed any more.
Language names will still be downcased when specified as unquoted
identifiers, but quoted identifiers or the old style using string
literals will be left as-is.
This gives a much better error message when the object of interest is
concurrently dropped and avoids needlessly failing when the object of
interest is concurrently dropped and recreated. It also improves the
behavior of two concurrent DROP IF EXISTS operations targeted at the
same object; as before, one will drop the object, but now the other
will emit the usual NOTICE indicating that the object does not exist,
instead of rolling back. As a fringe benefit, it's also slightly
less code.
Fix assorted infelicities, such as dependency on OIDs that aren't
hardwired, as well as outright misdeclaration of daterange_canonical(),
which resulted in crashes if you invoked it directly. Add some more
regression tests to try to catch similar mistakes in future.
This can change the meaning of queries, if the blank line happens to
occur in the middle of a quoted literal, as per complaint from Tomas Vondra.
Back-patch to all supported branches.
Move the responsibility for caching specialized information about range
types into the type cache, so that the catalog lookups only have to occur
once per session. Rearrange APIs a bit so that fn_extra caching is
actually effective in the GiST support code. (Use of OidFunctionCallN is
bad enough for performance in itself, but it also prevents the function
from exploiting fn_extra caching.)
The range I/O functions are still not very bright about caching repeated
lookups, but that seems like material for a separate patch.
Also, avoid unnecessary use of memcpy to fetch/store the range type OID and
flags, and don't use the full range_deserialize machinery when all we need
to see is the flags value.
Also fix API error in range_gist_penalty --- it was failing to set *penalty
for any case involving an empty range.
A range type whose element type has 'd' alignment must have 'd' alignment
itself, else there is no guarantee that the element value can be used
in-place. (Because range_deserialize uses att_align_pointer which forcibly
aligns the given pointer, violations of this rule did not lead to SIGBUS
but rather to garbage data being extracted, as in one of the added
regression test cases.)
Also, you can't put a toast pointer inside a range datum, since the
referenced value could disappear with the range datum still present.
For consistency with the handling of arrays and records, I also forced
decompression of in-line-compressed bound values. It would work to store
them as-is, but our policy is to avoid situations that might result in
double compression.
Add assorted regression tests for this, and bump catversion because of
fixes to built-in pg_type entries.
Also some marginal cleanup of inconsistent/unnecessary error checks.
Change range_lower and range_upper to return NULL rather than throwing an
error when the input range is empty or the relevant bound is infinite. Per
discussion, throwing an error seems likely to be unduly hard to work with.
Also, this is more consistent with the behavior of the constructors, which
treat NULL as meaning an infinite bound.
Change range_before, range_after, range_adjacent to return false rather
than throwing an error when one or both input ranges are empty.
The original definition is unnecessarily difficult to use, and also can
result in undesirable planner failures since the planner could try to
compare an empty range to something else while deriving statistical
estimates. (This was, in fact, the cause of repeatable regression test
failures on buildfarm member jaguar, as well as intermittent failures
elsewhere.)
Also tweak rangetypes regression test to not drop all the objects it
creates, so that the final state of the regression database contains
some rangetype objects for pg_dump testing.
No functional changes in this commit (except I could not resist the
temptation to re-word a couple of error messages). This is just manual
cleanup after pgindent to make the code look reasonably like other PG
code, in preparation for more detailed code review to come.
Previously we waited for wal_writer_delay before flushing WAL. Now
we also wake WALWriter as soon as a WAL buffer page has filled.
Significant effect observed on performance of asynchronous commits
by Robert Haas, attributed to the ability to set hint bits on tuples
earlier and so reducing contention caused by clog lookups.
This seems to have been just an oversight in previous foreign-table work.
A quick grep didn't turn up any other places where RELKIND_FOREIGN_TABLE
was obviously omitted.
One change noted by Alexander Soudakov, the other by me.
Back-patch to 9.1.
This adds the "auto" option to the \x command, which switches to the
expanded mode when the normal output would be wider than the screen.
reviewed by Noah Misch
If it turns out we've locked the wrong OID, release the old lock. In
most cases, it's pretty harmless to retain the extra lock, but this
seems tidier and avoids using lock table slots unnecessarily.
Per discussion with Tom Lane.
Previously, you'd get "function pg_catalog.pg_get_functiondef(integer) does
not exist", which is at best rather unprofessional-looking. Back-patch
to 8.4 where \ef was introduced.
Josh Kupershmidt
This reverts commit 0180bd6180.
contrib/userlock is gone, but user-level locking still exists,
and is exposed via the pg_advisory* family of functions.
If malloc(0) returns NULL, the binary search in findSecLabels() will
probably go into an infinite loop when there are no security labels,
because NULL-1 is greater than NULL after wraparound.
(We've seen this pathology before ... I wonder whether there's a way to
detect the class of bugs automatically?)
Diagnosis and patch by Steve Singer, cosmetic adjustments by me
Forgot to call RestoreBkpBlocks() in the redo-function, as pointed out by
Simon Riggs. In redo of a regular heap insert, it's taken care of in
heap_redo(), but this new record type uses the heap2 RM, and heap2_redo()
does not take care of that for you.
Also, failed to reset the vmbuffer and all_visibile_cleared local variables
after switching to a new buffer.
It used to be cleaned in maintainer-clean, but that is inconsistent
with other cleaning of NLS files in nls-global.mk, and it's also wrong
overall, because it's not part of the distribution tarball, which is
the base definition of the maintainer-clean target.
This greatly reduces the WAL volume, especially when the table is narrow.
The overhead of locking the heap page is also reduced. Reduced WAL traffic
also makes it scale a lot better, if you run multiple COPY processes at
the same time.
In particular, my previous patch expected the create_index test to run
before the inherit test; but this was only true in the serial schedule.
Rearrange this portion of the schedules to be more consistent.
Per buildfarm results.
Add PlaceHolderVar wrappers as needed to make UNION ALL sub-select output
expressions appear non-constant and distinct from each other. This makes
the world safe for add_child_rel_equivalences to do what it does. Before,
it was possible for that function to add identical expressions to different
EquivalenceClasses, which logically should imply merging such ECs, which
would be wrong; or to improperly add a constant to an EquivalenceClass,
drastically changing its behavior. Per report from Teodor Sigaev.
The only currently known consequence of this bug is "MergeAppend child's
targetlist doesn't match MergeAppend" planner failures in 9.1 and later.
I am suspicious that there may be other failure modes that could affect
older release branches; but in the absence of any hard evidence, I'll
refrain from back-patching further than 9.1.
a new macro, DatumGetInetPP(), that does not. This brings these macros
in line with other DatumGet*P() macros.
Backpatch to 8.3, where 1-byte header varlenas were introduced.
In a regular VACUUM, it's OK to skip pages for which a cleanup lock
isn't immediately available; the next VACUUM will deal with them. If
we're scanning the entire relation to advance relfrozenxid, we might
need to wait, but only if there are tuples on the page that actually
require freezing. These changes should greatly reduce the incidence
of of vacuum processes getting "stuck".
Simon Riggs and Robert Haas
Further experimentation reveals that my previous change didn't fix the
issue entirely: these tests would still fail at the spring-forward DST
transition. There doesn't seem to be any great value in testing this
specific issue for both timestamp and timestamptz, so just lose the
latter tests.
It was inadvertently changed to 201111111, which is a wrong date. Change it
to current date, and remove the comment that was supposed to remind me to
fix it before committing.
This assumption can be wrong when the toaster is passed a raw on-disk
tuple, because the tuple might pre-date an ALTER TABLE ADD COLUMN operation
that added columns without rewriting the table. In such a case the tuple's
natts value is smaller than what we expect from the tuple descriptor, and
so its t_hoff value could be smaller too. In fact, the tuple might not
have a null bitmap at all, and yet our current opinion of it is that it
contains some trailing nulls.
In such a situation, toast_insert_or_update did the wrong thing, because
to save a few lines of code it would use the old t_hoff value as the offset
where heap_fill_tuple should start filling data. This did not leave enough
room for the new nulls bitmap, with the result that the first few bytes of
data could be overwritten with null flag bits, as in a recent report from
Hubert Depesz Lubaczewski.
The particular case reported requires ALTER TABLE ADD COLUMN followed by
CREATE TABLE AS SELECT * FROM ... or INSERT ... SELECT * FROM ..., and
further requires that there be some out-of-line toasted fields in one of
the tuples to be copied; else we'll not reach the troublesome code.
The problem can only manifest in this form in 8.4 and later, because
before commit a77eaa6a95, CREATE TABLE AS or
INSERT/SELECT wouldn't result in raw disk tuples getting passed directly
to heap_insert --- there would always have been at least a junkfilter in
between, and that would reconstitute the tuple header with an up-to-date
t_natts and hence t_hoff. But I'm backpatching the tuptoaster change all
the way anyway, because I'm not convinced there are no older code paths
that present a similar risk.