Standby optionally sends back information about oldestXmin of queries
which is then checked and applied to the WALSender's proc->xmin.
GetOldestXmin() is modified slightly to agree with GetSnapshotData(),
so that all backends on primary include WALSender within their snapshots.
Note this does nothing to change the snapshot xmin on either master or
standby. Feedback piggybacks on the standby reply message.
vacuum_defer_cleanup_age is no longer used on standby, though parameter
still exists on primary, since some use cases still exist.
Simon Riggs, review comments from Fujii Masao, Heikki Linnakangas, Robert Haas
(I'm not entirely sure that we've finished bikeshedding the syntax details,
but the functionality seems OK.)
Pavel Stehule, reviewed by Stephen Frost and Tom Lane
They are expected to be used by extension modules like file_fdw.
There are no user-visible changes.
Itagaki Takahiro
Reviewed and tested by Kevin Grittner and Noah Misch.
Recent releases had a check on rel->rd_refcnt in heap_drop_with_catalog,
but failed to cover the possibility of pending trigger events at DROP time.
(Before 8.4 we didn't even check the refcnt.) When the trigger events were
eventually fired, you'd get "could not open relation with OID nnn" errors,
as in recent report from strk. Better to throw a suitable error when the
DROP is attempted.
Also add a similar check in DROP INDEX.
Back-patch to all supported branches.
though must not update the last transaction timestamp.
Plus comment and message cleanup for recent named restore point.
Fujii Masao, minor changes by me
The original design of pg_available_extensions did not consider the
possibility of version-specific control files. Split it into two views:
pg_available_extensions shows information that is generic about an
extension, while pg_available_extension_versions shows all available
versions together with information that could be version-dependent.
Also, add an SRF pg_extension_update_paths() to assist in checking that
a collection of update scripts provide sane update path sequences.
This allows us to have an unambiguous rule for deconstructing the names
of script files and secondary control files, without having to forbid
extension and version names from containing any dashes. We do have to
forbid them from containing double dashes or leading/trailing dashes,
but neither restriction is likely to bother anyone in practice.
Per discussion, this seems like a better solution overall than the
original design.
This change causes a multi-step update sequence to behave exactly as if the
updates had been commanded one at a time, including updating the "requires"
dependencies afresh at each step. The initial implementation took the
shortcut of examining only the final target version's "requires" and
changing the catalog entry but once. But on reflection that's a bad idea,
since it could lead to executing old update scripts under conditions
different than they were designed/tested for. Better to expend a few extra
cycles and avoid any surprises.
In the same spirit, if a CREATE EXTENSION FROM operation involves applying
a series of update files, it will act as though the CREATE had first been
done using the initial script's target version and then the additional
scripts were invoked with ALTER EXTENSION UPDATE.
I also removed the restriction about not changing encoding in secondary
control files. The new rule is that a script is assumed to be in whatever
encoding the control file(s) specify for its target version. Since this
reimplementation causes us to read each intermediate version's control
file, there's no longer any uncertainty about which encoding setting would
get applied.
relative, by creating a function path_is_relative_and_below_cwd() to
check for specific requirements. It is unclear if this fixes a security
problem or not but the new code is more robust.
- collowner field
- CREATE COLLATION
- ALTER COLLATION
- DROP COLLATION
- COMMENT ON COLLATION
- integration with extensions
- pg_dump support for the above
- dependency management
- psql tab completion
- psql \dO command
When the old type is binary coercible to the new type and the using
clause does not change the column contents, we can avoid a full table
rewrite, though any indexes on the affected columns will still need
to be rebuilt. This applies, for example, when changing a varchar
column to be of type text.
The prior coding assumed that the set of operations that force a
rewrite is identical to the set of operations that must be propagated
to tables making use of the affected table's rowtype. This is
no longer true: even though the tuples in those tables wouldn't
need to be modified, the data type change invalidate indexes built
using those composite type columns. Indexes on the table we're
actually modifying can be invalidated too, of course, but the
existing machinery is sufficient to handle that case.
Along the way, add some debugging messages that make it possible
to understand what operations ALTER TABLE is actually performing
in these cases.
Noah Misch and Robert Haas
Arrange for the control files to be in $SHAREDIR/extension not
$SHAREDIR/contrib, since we're generally trying to deprecate the term
"contrib" and this is a once-in-many-moons opportunity to get rid of it in
install paths. Fix PGXS to install the $EXTENSION file into that directory
no matter what MODULEDIR is set to; a nondefault MODULEDIR should only
affect the script and secondary extension files. Fix the control file
directory parameter to be interpreted relative to $SHAREDIR, to avoid a
surprising disconnect between how you specify that and what you set
MODULEDIR to.
Per discussion with David Wheeler.
This follows recent discussions, so it's quite a bit different from
Dimitri's original. There will probably be more changes once we get a bit
of experience with it, but let's get it in and start playing with it.
This is still just core code. I'll start converting contrib modules
shortly.
Dimitri Fontaine and Tom Lane
Per discussion with Noah Misch, the previous coding, introduced by
my commit 65377e0b9c on 2011-02-06,
was really an abuse of RELKIND_COMPOSITE_TYPE, since the caller in
typecmds.c is actually passing the name of a domain. So go back
having a type name argument, but make the first argument a Relation
rather than just a string so we can tell whether it's a table or
a foreign table and emit the proper error message.
the standby has written, flushed, and applied the WAL. At the moment, this
is for informational purposes only, the values are only shown in
pg_stat_replication system view, but in the future they will also be needed
for synchronous replication.
Extracted from Simon riggs' synchronous replication patch by Robert Haas, with
some tweaking by me.
Tracks one counter for each database, which is reset whenever
the statistics for any individual object inside the database is
reset, and one counter for the background writer.
Tomas Vondra, reviewed by Greg Smith
Flattening of subquery range tables during setrefs.c could lead to the
rangetable indexes in PlanRowMark nodes not matching up with the column
names previously assigned to the corresponding resjunk ctid (resp. tableoid
or wholerow) columns. Typical symptom would be either a "cannot extract
system attribute from virtual tuple" error or an Assert failure. This
wasn't a problem before 9.0 because we didn't support FOR UPDATE below the
top query level, and so the final flattening could never renumber an RTE
that was relevant to FOR UPDATE. Fix by using a plan-tree-wide unique
number for each PlanRowMark to label the associated resjunk columns, so
that the number need not change during flattening.
Per report from David Johnston (though I'm darned if I can see how this got
past initial testing of the relevant code). Back-patch to 9.0.
This follows my proposal of yesterday, namely that we try to recreate the
previous state of the extension exactly, instead of allowing CREATE
EXTENSION to run a SQL script that might create some entirely-incompatible
on-disk state. In --binary-upgrade mode, pg_dump won't issue CREATE
EXTENSION at all, but instead uses a kluge function provided by
pg_upgrade_support to recreate the pg_extension row (and extension-level
pg_depend entries) without creating any member objects. The member objects
are then restored in the same way as if they weren't members, in particular
using pg_upgrade's normal hacks to preserve OIDs that need to be preserved.
Then, for each member object, ALTER EXTENSION ADD is issued to recreate the
pg_depend entry that marks it as an extension member.
In passing, fix breakage in pg_upgrade's enum-type support: somebody didn't
fix it when the noise word VALUE got added to ALTER TYPE ADD. Also,
rationalize parsetree representation of COMMENT ON DOMAIN and fix
get_object_address() to allow OBJECT_DOMAIN.
This is an essential component of making the extension feature usable;
first because it's needed in the process of converting an existing
installation containing "loose" objects of an old contrib module into
the extension-based world, and second because we'll have to use it
in pg_dump --binary-upgrade, as per recent discussion.
Loosely based on part of Dimitri Fontaine's ALTER EXTENSION UPGRADE
patch.
Specifying this option makes the server not wait for the
xlog to be archived, or emit a warning that it can't,
instead leaving the responsibility with the client.
This is useful when the log is being streamed using
the streaming protocol in parallel with the backup,
without having log archiving enabled.
Older versions of gcc tend to throw "variable might be clobbered by
`longjmp' or `vfork'" warnings whenever a variable is assigned in more than
one place and then used after the end of a PG_TRY block. That's reasonably
easy to work around in execute_extension_script, and the overhead of
unconditionally saving/restoring the GUC variables seems unlikely to be a
serious concern.
Also clean up logic in ATExecValidateConstraint to make it easier to read
and less likely to provoke "variable might be used uninitialized in this
function" warnings.
This patch adds the server infrastructure to support extensions.
There is still one significant loose end, namely how to make it play nice
with pg_upgrade, so I am not yet committing the changes that would make
all the contrib modules depend on this feature.
In passing, fix a disturbingly large amount of breakage in
AlterObjectNamespace() and callers.
Dimitri Fontaine, reviewed by Anssi Kääriäinen,
Itagaki Takahiro, Tom Lane, and numerous others
This adds collation support for columns and domains, a COLLATE clause
to override it per expression, and B-tree index support.
Peter Eisentraut
reviewed by Pavel Stehule, Itagaki Takahiro, Robert Haas, Noah Misch
new recovery.conf parameter recovery_target_name allows PITR to
specify named points as recovery targets.
Jaime Casanova, reviewed by Euler Taveira de Oliveira, plus minor edits
If the standby was streaming when trigger file arrives, check also in the
archive for additional WAL files. This is a corner case since it is
unlikely that we would trigger a failover while the master is still
available and sending data to standby, while at the same time running in
archive mode and also while the streaming standby has fallen behind archive.
Someone would eventually be unlucky; we must plug all gaps however small.
Fujii Masao
FK constraints that are marked NOT VALID may later be VALIDATED, which uses an
ShareUpdateExclusiveLock on constraint table and RowShareLock on referenced
table. Significantly reduces lock strength and duration when adding FKs.
New state visible from psql.
Simon Riggs, with reviews from Marko Tiikkaja and Robert Haas
Waiting for relation locks can lead to starvation - it pins down an
autovacuum worker for as long as the lock is held. But if we're doing
an anti-wraparound vacuum, then we still wait; maintenance can no longer
be put off.
To assist with troubleshooting, if log_autovacuum_min_duration >= 0,
we log whenever an autovacuum or autoanalyze is skipped for this reason.
Per a gripe by Josh Berkus, and ensuing discussion.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
If the foreign table's rowtype is being used as the type of a column in
another table, we can't just up and change its data type. This was
already checked for composite types and ordinary tables, but we
previously failed to enforce it for foreign tables.
Make sure it's clear that the prohibition on adding a column with a default
when the rowtype is used elsewhere is intentional, and be a bit more
explicit about the other cases where we perform this check.
This fixes make distprep, and seems more robust in other ways as well.
Some special handling is required because errcodes.txt is needed by
some stuff in src/port, but just by src/backend as is the case for the
other generated headers.
While I'm at it, fix a few other things that were overlooked in the
original patch.
src/pl/plpgsql/src/plerrcodes.h, src/include/utils/errcodes.h, and a
big chunk of errcodes.sgml are now automatically generated from a single
file, src/backend/utils/errcodes.txt.
Jan Urbański, reviewed by Tom Lane.
Add the current xlog insert location to the response of
IDENTIFY_SYSTEM, and adds result sets containing start
and stop location of backups to BASE_BACKUP responses.
Prior to 9.0, restartpoints never created, deleted, or recycled WAL
files, but now they can. This code makes log_checkpoints treat
checkpoints and restartpoints symmetrically. It also adjusts up
the documentation of the parameter to mention restartpoints.
Fujii Masao. Docs by me, as suggested by Itagaki Takahiro.
Previously reported as ERRCODE_ADMIN_SHUTDOWN, this case is now
reported as ERRCODE_T_R_DATABASE_DROPPED. No message text change.
Unlikely to happen on most servers, so low impact change to allow
session poolers to correctly handle this situation.
Tatsuo Ishii, edits by me, review by Robert Haas
All retryable conflict errors now have an error code that indicates that
a retry is possible, correcting my incomplete fix of 2010/05/12
Tatsuo Ishii and Simon Riggs, input from Robert Haas and Florian Pflug
With this patch, pg_basebackup doesn't write a backup_label file in the
data directory, so it doesn't interfere with a pg_start/stop_backup() based
backup anymore. backup_label is still included in the backup, but it is
injected directly into the tar stream.
Heikki Linnakangas, reviewed by Fujii Masao and Magnus Hagander.
reduce_outer_joins() mistakenly treated a semijoin like a left join for
purposes of deciding whether not-null constraints created by the join's
quals could be passed down into the join's left-hand side (possibly
resulting in outer-join simplification there). Actually, semijoin works
like inner join for this purpose, ie, we do not need to see any rows that
can't possibly satisfy the quals. Hence, two-line fix to treat semi and
inner joins alike. Per observation by Andres Freund about a performance
gripe from Yazan Suleiman.
Back-patch to 8.4, since this oversight has been there since the current
handling of semijoins was implemented.
When included, this makes the base backup a complete working
"clone" of the initial database, ready to have a postmaster
started against it without the need to set up any log archiving
or similar.
Magnus Hagander, reviewed by Fujii Masao and Heikki Linnakangas
When we need to insert a new entry and the queue is full, compact the
entire queue in the hopes of making room for the new entry. Doing this
on every insertion might worsen contention on BgWriterCommLock, but
when the queue it's full, it's far better than allowing the backend to
perform its own fsync, per testing by Greg Smith as reported in
http://archives.postgresql.org/pgsql-hackers/2011-01/msg02665.php
Original idea from Greg Smith. Patch by me. Review by Chris Browne
and Greg Smith
We only need that header when compiling with icc, since the gcc variant of
ia64_get_bsp() uses in-line assembly code. Per report from Frank Brendel,
the header doesn't exist on all IA64 platforms; so don't include it unless
we need it.
This reverts commit a06e41deeb of 2011-01-26.
Per discussion, this behavior is not wanted, as it would need to change if
we ever made composite types support DEFAULT.
In the case where the initial call of systable_getnext_ordered() returned
NULL, this function would nonetheless call it again. That's undefined
behavior that only by chance failed to not give visibly incorrect results.
Put an if-test around the final loop to prevent that, and in passing
improve some comments. No back-patch since there's no actual failure.
Per report from YAMAMOTO Takashi.
The previous coding prevented ALTER TABLE .. ADD COLUMN from being used
with a non-NULL default in situations where the table's rowtype was being
used elsewhere. But this is a completely arbitrary restriction since
you could do the same operation in multiple steps (add the column, add
the default, update the table).
Inspired by a patch from Noah Misch, though I didn't use his code.
There isn't any need to track this state on a table-wide basis, and trying
to do so introduces undesirable semantic fuzziness. Move the flag to
pg_index, where it clearly describes just a single index and can be
immutable after index creation.
This feature allows a unique or pkey constraint to be created using an
already-existing unique index. While the constraint isn't very
functionally different from the bare index, it's nice to be able to do that
for documentation purposes. The main advantage over just issuing a plain
ALTER TABLE ADD UNIQUE/PRIMARY KEY is that the index can be created with
CREATE INDEX CONCURRENTLY, so that there is not a long interval where the
table is locked against updates.
On the way, refactor some of the code in DefineIndex() and index_create()
so that we don't have to pass through those functions in order to create
the index constraint's catalog entries. Also, in parse_utilcmd.c, pass
around the ParseState pointer in struct CreateStmtContext to save on
notation, and add error location pointers to some error reports that didn't
have one before.
Gurjeet Singh, reviewed by Steve Singer and Tom Lane
While doing this, also move base backup options into
a struct instead of increasing the number of parameters
to multiple functions for each new option.
Include the lefttype/righttype columns explicitly (instead of assuming
the reader can deduce them from the operator or function description),
and move the operator or function description to the end of the string,
to make it clearer that it's a referenced object and not the amop or
amproc item itself. Per extensive discussion of Andreas Karlsson's
original patch.
Andreas Karlsson, Tom Lane
This tool makes it possible to do the pg_start_backup/
copy files/pg_stop_backup step in a single command.
There are still some steps to be done before this is a
complete backup solution, such as the ability to stream
the required WAL logs, but it's still usable, and
could do with some buildfarm coverage.
In passing, make the checkpoint request optionally
fast instead of hardcoding it.
Magnus Hagander, reviewed by Fujii Masao and Dimitri Fontaine
As in commit fb4c5d2798 on 2011-01-21,
this avoids spurious debug messages and allows idempotent changes at
any time. Along the way, make assign_XactIsoLevel allow idempotent
changes even when not within a subtransaction, to be consistent with
the new coding of assign_transaction_read_only and because there's
no compelling reason to do otherwise.
Kevin Grittner, with some adjustments.
If wal_buffers is initially set to -1 (which is now the default), it's
replaced by 1/32nd of shared_buffers, with a minimum of 8 (the old default)
and a maximum of the XLOG segment size. The allowed range for manual
settings is still from 4 up to whatever will fit in shared memory.
Greg Smith, with implementation correction by me.
The previous coding treated anything that wasn't an autovacuum launcher
as a normal backend, which is wrong now that we also have WAL senders.
Fujii Masao, reviewed by Robert Haas, Alvaro Herrera, Tom Lane,
and Bernd Helmle.
The new coding avoids a spurious debug message when a transaction
that has changed the isolation level has been rolled back. It also
allows the property to be freely changed to the current value within
a subtransaction.
Kevin Grittner, with one small change by me.
Failure to do so can lead to constraint violations. This was broken by
commit 1ddc2703a9 on 2010-02-07, so
back-patch to 9.0.
Noah Misch. Regression test by me.
We can get the length of a compressed or out-of-line datum without actually
detoasting it. If the lengths of two strings are unequal, we can then
conclude they are unequal without detoasting. That saves considerable work
in an admittedly less-common case, without costing anything much when the
optimization doesn't apply.
Noah Misch
If the slice to be assigned to was before the existing array lower bound
(requiring at least one null element to spring into existence to fill the
gap), the code miscalculated how many entries needed to be copied from
the old array's null bitmap. This could result in trashing the array's
data area (as seen in bug #5840 from Karsten Loesing), or worse.
This has been broken since we first allowed the behavior of assigning to
non-adjacent slices, in 8.2. Back-patch to all affected versions.
Otherwise WAL recovery will replay the un-flushed WAL after walreceiver has
exited, which can lead to a non-recoverable standby if the system crashes hard
at that point.
This closes a race condition where if a tablespace was created
after the enumeration happened but before the do_pg_start_backup()
was called, the backup would be incomplete. Now that it's done
while we are in backup mode, WAL replay will recreate it during
restore.
Noted by Heikki.
backend, as far as the postmaster shutdown logic is concerned. That means,
fast shutdown will wait for WAL sender processes to exit before signaling
bgwriter to finish. This avoids race conditions between a base backup stopping
or starting, and bgwriter writing the shutdown checkpoint WAL record. We don't
want e.g the end-of-backup WAL record to be written after the shutdown
checkpoint.
Makes it easier to parse mainly the BASE_BACKUP command
with it's options, and avoids having to manually deal
with quoted identifiers in the label (previously broken),
and makes it easier to add new commands and options in
the future.
In passing, refactor the case statement in the walsender
to put each command in it's own function.
When the exit waits until the whole backup completes, it may take
a very long time.
In passing, add back an error check in the main loop so we detect
clients that disconnect much earlier if the backup is large.
Fix broken test for pre-existing postmaster, caused by wrong code for
appending lines to the lockfile; don't write a failed listen_address
setting into the lockfile; don't arbitrarily change the location of the
data directory in the lockfile compared to previous releases; provide more
consistent and useful definitions of the socket path and listen_address
entries; avoid assuming that pg_ctl has the same DEFAULT_PGSOCKET_DIR as
the postmaster; assorted code style improvements.
This reverts commit d1001a78ce of 2010-12-05,
which was broken as reported by Jeff Davis. The problem is that the
individual planning steps may have side-effects on substructures of
PlannerGlobal, not only the current PlannerInfo root. Arranging to keep
all such side effects in the main planning context is probably possible,
but it would change this from a quick local hack into a wide-ranging and
rather fragile endeavor. Which it's not worth.
that can be read without blocking. It used to conclude that there isn't, even
though there was data in the socket receive buffer. That lead walreceiver to
flush the WAL after every received chunk, potentially causing big performance
issues.
Backpatch to 9.0, because the performance impact can be very significant.
Changing a file two directory levels deep under src/backend/ would not
cause the postgres binary to be rebuilt. This change fixes it, but no
one knows why.
In an inherited UPDATE/DELETE, each target table has its own subplan,
because it might have a column set different from other targets. This
means that the resjunk columns we add to support EvalPlanQual might be
at different physical column numbers in each subplan. The EvalPlanQual
rewrite I did for 9.0 failed to account for this, resulting in possible
misbehavior or even crashes during concurrent updates to the same row,
as seen in a recent report from Gordon Shannon. Revise the data structure
so that we track resjunk column numbers separately for each subplan.
I also chose to move responsibility for identifying the physical column
numbers back to executor startup, instead of assuming that numbers derived
during preprocess_targetlist would stay valid throughout subsequent
massaging of the plan. That's a bit slower, so we might want to consider
undoing it someday; but it would complicate the patch considerably and
didn't seem justifiable in a bug fix that has to be back-patched to 9.0.
Some versions of gcc complain about "variable `tablespaces' might be
clobbered by `longjmp' or `vfork'" with the original coding. Fix by
moving the PG_TRY block into a separate subroutine.
Per my note of a couple days ago, create_index_paths would refuse to
consider any path at all for GIN indexes if the selectivity estimate came
out as 1.0; not even if you tried to force it with enable_seqscan. While
this isn't really a bad outcome in practice, it could be annoying for
testing purposes. Adjust the test for "is this path only useful for
sorting" so that it doesn't fire on paths with nil pathkeys, which will
include all GIN paths.
When in streaming mode we can never get out, so it will never
be required, but after a base backup (or other operations)
we can get back to the loop, so the title needs to be cleared.
Add BASE_BACKUP command to walsender, allowing it to stream a
base backup to the client (in tar format). The syntax is still
far from ideal, that will be fixed in the switch to use a proper
grammar for walsender.
No client included yet, will come as a separate commit.
Magnus Hagander and Heikki Linnakangas
Move the actual functionality into a separate function that's
easier to call internally, and change the SQL-callable function
to be a wrapper calling this.
Also create a pg_abort_backup() function, only callable internally,
that does only the most vital parts of pg_stop_backup(), making it
safe(r) to call from error handlers.
This applies the fix for bug #5784 to remaining places where we wish
to reject nulls in user-supplied arrays. In all these places, there's
no reason not to allow a null bitmap to be present, so long as none of
the current elements are actually null.
I did not change some other places where we are looking at system catalog
entries or aggregate transition values, as the presence of a null bitmap
in such an array would be suspicious.
This will support fixing contrib/intarray (and probably other places)
so that they don't have to fail on arrays that contain a null bitmap
but no live null entries.
The only reason this wasn't crashing while testing the core anyarray
operators was that it was disabled for those cases because of passing the
wrong type information to get_opfamily_proc :-(. So fix that too, and make
it insist on finding the support proc --- in hindsight, silently doing
nothing is not as sane a coping mechanism as all that.
The only use we have had for amindexnulls is in determining whether an
index is safe to cluster on; but since the addition of the amclusterable
flag, that usage is pretty redundant.
In passing, clean up assorted sloppiness from the last patch that touched
pg_am.h: Natts_pg_am was wrong, and ambuildempty was not documented.
The original coding could combine duplicate entries only when they
originated from the same qual condition. In particular it could not
combine cases where multiple qual conditions all give rise to full-index
scan requests, which is an expensive case well worth optimizing. Refactor
so that duplicates are recognized across all the quals.
which is stored in pg_largeobject_metadata.
No backpatch to 9.0 because you can't migrate from 9.0 to 9.0 with the
same catversion (because of tablespace conflict), and a pre-9.0
migration to 9.0 has not large object permissions to migrate.
Toast tables have identical pg_class.oid and pg_class.relfilenode, but
for clarity it is good to preserve the pg_class.oid.
Update comments regarding what is preserved, and do some
variable/function renaming for clarity.
Per my recent proposal(s). Null key datums can now be returned by
extractValue and extractQuery functions, and will be stored in the index.
Also, placeholder entries are made for indexable items that are NULL or
contain no keys according to extractValue. This means that the index is
now always complete, having at least one entry for every indexed heap TID,
and so we can get rid of the prohibition on full-index scans. A full-index
scan is implemented much the same way as partial-match scans were already:
we build a bitmap representing all the TIDs found in the index, and then
drive the results off that.
Also, introduce a concept of a "search mode" that can be requested by
extractQuery when the operator requires matching to empty items (this is
just as cheap as matching to a single key) or requires a full index scan
(which is not so cheap, but it sure beats failing or giving wrong answers).
The behavior remains backward compatible for opclasses that don't return
any null keys or request a non-default search mode.
Using these features, we can now make the GIN index opclass for anyarray
behave in a way that matches the actual anyarray operators for &&, <@, @>,
and = ... which it failed to do before in assorted corner cases.
This commit fixes the core GIN code and ginarrayprocs.c, updates the
documentation, and adds some simple regression test cases for the new
behaviors using the array operators. The tsearch and contrib GIN opclass
support functions still need to be looked over and probably fixed.
Another thing I intend to fix separately is that this is pretty inefficient
for cases where more than one scan condition needs a full-index search:
we'll run duplicate GinScanEntrys, each one of which builds a large bitmap.
There is some existing logic to merge duplicate GinScanEntrys but it needs
refactoring to make it work for entries belonging to different scan keys.
Note that most of gin.h has been split out into a new file gin_private.h,
so that gin.h doesn't export anything that's not supposed to be used by GIN
opclasses or the rest of the backend. I did quite a bit of other code
beautification work as well, mostly fixing comments and choosing more
appropriate names for things.
This can be overriden by using NOREPLICATION on the CREATE ROLE
statement, but by default they will have it, making it backwards
compatible and "less surprising" (given that superusers normally
override all checks).
Add new function pg_sequence_parameters that returns a sequence's start,
minimum, maximum, increment, and cycle values, and use that in the view.
(bug #5662; design suggestion by Tom Lane)
Also slightly adjust the view's column order and permissions after review of
SQL standard.
Foreign tables are a core component of SQL/MED. This commit does
not provide a working SQL/MED infrastructure, because foreign tables
cannot yet be queried. Support for foreign table scans will need to
be added in a future patch. However, this patch creates the necessary
system catalog structure, syntax support, and support for ancillary
operations such as COMMENT and SECURITY LABEL.
Shigeru Hanada, heavily revised by Robert Haas
This is advantageous first because it allows us to hash the smaller table
regardless of the outer-join type, and second because hash join can be more
flexible than merge join in dealing with arbitrary join quals in a FULL
join. For merge join all the join quals have to be mergejoinable, but hash
join will work so long as there's at least one hashjoinable qual --- the
others can be any condition. (This is true essentially because we don't
keep per-inner-tuple match flags in merge join, while hash join can do so.)
To do this, we need a has-it-been-matched flag for each tuple in the
hashtable, not just one for the current outer tuple. The key idea that
makes this practical is that we can store the match flag in the tuple's
infomask, since there are lots of bits there that are of no interest for a
MinimalTuple. So we aren't increasing the size of the hashtable at all for
the feature.
To write this without turning the hash code into even more of a pile of
spaghetti than it already was, I rewrote ExecHashJoin in a state-machine
style, similar to ExecMergeJoin. Other than that decision, it was pretty
straightforward.
Instead, declare a public wrapper of the sole function using it for
external callers, so that they don't have to always pass a NULL
argument.
Author: Kevin Grittner
The contents of an unlogged table are WAL-logged; thus, they are not
available on standby servers and are truncated whenever the database
system enters recovery. Indexes on unlogged tables are also unlogged.
Unlogged GiST indexes are not currently supported.
This privilege is required to do Streaming Replication, instead of
superuser, making it possible to set up a SR slave that doesn't
have write permissions on the master.
Superuser privileges do NOT override this check, so in order to
use the default superuser account for replication it must be
explicitly granted the REPLICATION permissions. This is backwards
incompatible change, in the interest of higher default security.
The "date" type supports a wider range of dates than int64 timestamps do.
However, there is pre-int64-timestamp code in the planner that assumes that
all date values can be converted to timestamp with impunity. Fortunately,
what we really need out of the conversion is always a double (float8)
value; so even when the date is out of timestamp's range it's possible to
produce a sane answer. All we need is a code path that doesn't try to
force the result into int64. Per trouble report from David Rericha.
Back-patch to all supported versions. Although this is surely a corner
case, there's not much point in advertising a date range wider than
timestamp's if we will choke on such values in unexpected places.
This is to avoid use of the C++ keywords "bitand" and "bitor" in
the header file utils/varbit.h. Note the functions' SQL-level
names are not changed, only their C-level names.
In passing, make some comments in varbit.c conform to project-standard
layout.
"private" is a keyword in C++, so this breaks the poorly-enforced policy
that header files should be include-able in C++ code. Per report from
Craig Ringer and some investigation with cpluspluscheck.