of this are to centralise the conflict code to allow further change,
as well as to allow passing through the full reason for the conflict
through to the conflicting backends. Backend state alters how we
can handle different types of conflict so this is now required.
As originally suggested by Heikki, no longer optional.
indexscans would do the wrong thing if index_rescan() was called with a
NULL instead of a new set of scankeys and the index was DESC order,
because sk_strategy would not get flipped a second time. I think
that those provisions for a NULL argument are dead code now as far as the
core backend goes, but possibly somebody somewhere is still using it.
In any case, this refactoring seems clearer, and it's definitely shorter.
to be just a minor extension of the previous patch that made "x IS NULL"
indexable, because we can treat the IS NOT NULL condition as if it were
"x < NULL" or "x > NULL" (depending on the index's NULLS FIRST/LAST option),
just like IS NULL is treated like "x = NULL". Aside from any possible
usefulness in its own right, this is an important improvement for
index-optimized MAX/MIN aggregates: it is now reliably possible to get
a column's min or max value cheaply, even when there are a lot of nulls
cluttering the interesting end of the index.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
friends). This code has all been ifdef'd out for many years, and doesn't
seem to have any prospect of becoming any more useful in the future.
EXPLAIN ANALYZE is what people use in practice, and I think if we did want
process-wide counters we'd be more likely to put in dtrace events for that
than try to resurrect this code. Get rid of it so as to have one less detail
to worry about while refactoring execMain.c.
tuple size limit. Improve the error message for index-tuple-too-large
so that it includes the actual size, the limit, and the index name.
Sync with the btree occurrences of the same error.
Back-patch to 8.4 because it appears that the out-of-sync problem
is occurring in the field.
Teodor and Tom
values being complained of.
In passing, also remove the arbitrary length limitation in the similar
error detail message for foreign key violations.
Itagaki Takahiro
The current implementation fires an AFTER ROW trigger for each tuple that
looks like it might be non-unique according to the index contents at the
time of insertion. This works well as long as there aren't many conflicts,
but won't scale to massive unique-key reassignments. Improving that case
is a TODO item.
Dean Rasheed
behavior in cases where we don't know the heap tuple count accurately; in
particular partial vacuum, but this also makes the API a bit more useful
for ANALYZE. This patch adds "estimated_count" flags to both structs so
that an approximate count can be flagged as such, and adjusts the logic
so that approximate counts are not used for updating pg_class.reltuples.
This fixes my previous complaint that VACUUM was putting ridiculous values
into pg_class.reltuples for indexes. The actual impact of that bug is
limited, because the planner only pays attention to reltuples for an index
if the index is partial; which probably explains why beta testers hadn't
noticed a degradation in plan quality from it. But it needs to be fixed.
The whole thing is a bit messy and should be redesigned in future, because
reltuples now has the potential to drift quite far away from reality when
a long period elapses with no non-partial vacuums. But this is as good as
it's going to get for 8.4.
points where we step right or left to the next page. This should ensure
reasonable response time to a query cancel request during an unsuccessful
index scan, as seen in recent gripe from Marc Cousin. It's a bit trickier
than it might seem at first glance, because CHECK_FOR_INTERRUPTS() is a no-op
if executed while holding a buffer lock. So we have to do it just at the
point where we've dropped one page lock and not yet acquired the next.
Remove CHECK_FOR_INTERRUPTS calls at the top level of btgetbitmap and
hashgetbitmap, since they're pointless given the added checks.
I think that GIST is okay already --- at least, there's a CHECK_FOR_INTERRUPTS
at a plausible-looking place in gistnext(). I don't claim to know GIN well
enough to try to poke it for this, if indeed it has a problem at all.
This is a pre-existing issue, but in view of the lack of prior complaints
I'm not going to risk back-patching.
multiple index entries in a holding area before adding them to the main index
structure. This helps because bulk insert is (usually) significantly faster
than retail insert for GIN.
This patch also removes GIN support for amgettuple-style index scans. The
API defined for amgettuple is difficult to support with fastupdate, and
the previously committed partial-match feature didn't really work with
it either. We might eventually figure a way to put back amgettuple
support, but it won't happen for 8.4.
catversion bumped because of change in GIN's pg_am entry, and because
the format of GIN indexes changed on-disk (there's a metapage now,
and possibly a pending list).
Teodor Sigaev
be used instead of the normal exclusive lock, and make WAL redo functions
responsible for calling RestoreBkpBlocks(). They know better what kind of a
lock they need.
At the moment, this just moves things around with no functional change, but
makes the hot standby patch that's under review cleaner.
a more complete framework for writing custom option processing routines
by user-defined access methods.
Catalog version bumped due to the general API changes, which are going to
affect user-defined "amoptions" routines.
truncations in FSM code, call FreeSpaceMapTruncateRel from smgr_redo. To
make that cleaner from modularity point of view, move the WAL-logging one
level up to RelationTruncate, and move RelationTruncate and all the
related WAL-logging to new src/backend/catalog/storage.c file. Introduce
new RelationCreateStorage and RelationDropStorage functions that are used
instead of calling smgrcreate/smgrscheduleunlink directly. Move the
pending rel deletion stuff from smgrcreate/smgrscheduleunlink to the new
functions. This leaves smgr.c as a thin wrapper around md.c; all the
transactional stuff is now in storage.c.
This will make it easier to add new forks with similar truncation logic,
like the visibility map.
for inserting tuples in increasing TID order. It's not clear whether this
fully explains Ivan Sergio Borgonovo's complaint, but simple testing
confirms that a scan that doesn't start at block 0 can slow GIN build by
a factor of three or four.
Backpatch to 8.3. Sync scan didn't exist before that.
functions into one ReadBufferExtended function, that takes the strategy
and mode as argument. There's three modes, RBM_NORMAL which is the default
used by plain ReadBuffer(), RBM_ZERO, which replaces ZeroOrReadBuffer, and
a new mode RBM_ZERO_ON_ERROR, which allows callers to read corrupt pages
without throwing an error. The FSM needs the new mode to recover from
corrupt pages, which could happend if we crash after extending an FSM file,
and the new page is "torn".
Add fork number to some error messages in bufmgr.c, that still lacked it.
free space information is stored in a dedicated FSM relation fork, with each
relation (except for hash indexes; they don't use FSM).
This eliminates the max_fsm_relations and max_fsm_pages GUC options; remove any
trace of them from the backend, initdb, and documentation.
Rewrite contrib/pg_freespacemap to match the new FSM implementation. Also
introduce a new variant of the get_raw_page(regclass, int4, int4) function in
contrib/pageinspect that let's you to return pages from any relation fork, and
a new fsm_page_contents() function to inspect the new FSM pages.
of multiple forks, and each fork can be created and grown separately.
The bulk of this patch is about changing the smgr API to include an extra
ForkNumber argument in every smgr function. Also, smgrscheduleunlink and
smgrdounlink no longer implicitly call smgrclose, because other forks might
still exist after unlinking one. The callers of those functions have been
modified to call smgrclose instead.
This patch in itself doesn't have any user-visible effect, but provides the
infrastructure needed for upcoming patches. The additional forks envisioned
are a rewritten FSM implementation that doesn't rely on a fixed-size shared
memory block, and a visibility map to allow skipping portions of a table in
VACUUM that have no dead tuples.
SizeOfPageHeaderData instead of sizeof(PageHeaderData) in places where that
makes the code clearer, and avoid casting between Page and PageHeader where
possible. Zdenek Kotala, with some additional cleanup by Heikki Linnakangas.
I did not apply the parts of the proposed patch that would have resulted in
slightly changing the on-disk format of hash indexes; it seems to me that's
not a win as long as there's any chance of having in-place upgrade for 8.4.
corresponding struct definitions. This allows other headers to avoid including
certain highly-loaded headers such as rel.h and relscan.h, instead using just
relcache.h, heapam.h or genam.h, which are more lightweight and thus cause less
unnecessary dependencies.
forks. XLogOpenRelation() and the associated light-weight relation cache in
xlogutils.c is gone, and XLogReadBuffer() now takes a RelFileNode as argument,
instead of Relation.
For functions that still need a Relation struct during WAL replay, there's a
new function called CreateFakeRelcacheEntry() that returns a fake entry like
XLogOpenRelation() used to.
CacheInvalidateRelcache() crashes if called in WAL recovery, because the
invalidation infrastructure hasn't been initialized yet.
Back-patch to 8.2, where the bug was introduced.
the associated datatype as their equality member. This means that these
opclasses can now support plain equality comparisons along with LIKE tests,
thus avoiding the need for an extra index in some applications. This
optimization was not possible when the pattern opclasses were first introduced,
because we didn't insist that text equality meant bitwise equality; but we
do now, so there is no semantic difference between regular and pattern
equality operators.
I removed the name_pattern_ops opclass altogether, since it's really useless:
name's regular comparisons are just strcmp() and are unlikely to become
something different. Instead teach indxpath.c that btree name_ops can be
used for LIKE whether or not the locale is C. This might lead to a useful
speedup in LIKE queries on the system catalogs in non-C locales.
The ~=~ and ~<>~ operators are gone altogether. (It would have been nice to
keep them for backward compatibility's sake, but since the pg_amop structure
doesn't allow multiple equality operators per opclass, there's no way.)
A not-immediately-obvious incompatibility is that the sort order within
bpchar_pattern_ops indexes changes --- it had been identical to plain
strcmp, but is now trailing-blank-insensitive. This will impact
in-place upgrades, if those ever happen.
Per discussions a couple months ago.
unnecessary #include lines in it. Also, move some tuple routine prototypes and
macros to htup.h, which allows removal of heapam.h inclusion from some .c
files.
For this to work, a new header file access/sysattr.h needed to be created,
initially containing attribute numbers of system columns, for pg_dump usage.
While at it, make contrib ltree, intarray and hstore header files more
consistent with our header style.
corrupted. (Neither is very important if SIGTERM is used to shut down the
whole database cluster together, but there's a problem if someone tries to
SIGTERM individual backends.) To do this, introduce new infrastructure
macros PG_ENSURE_ERROR_CLEANUP/PG_END_ENSURE_ERROR_CLEANUP that take care
of transiently pushing an on_shmem_exit cleanup hook. Also use this method
for createdb cleanup --- that wasn't a shared-memory-corruption problem,
but SIGTERM abort of createdb could leave orphaned files lying around.
Backpatch as far as 8.2. The shmem corruption cases don't exist in 8.1,
and the createdb usage doesn't seem important enough to risk backpatching
further.
instead of plan time. Extend the amgettuple API so that the index AM returns
a boolean indicating whether the indexquals need to be rechecked, and make
that rechecking happen in nodeIndexscan.c (currently the only place where
it's expected to be needed; other callers of index_getnext are just erroring
out for now). For the moment, GIN and GIST have stub logic that just always
sets the recheck flag to TRUE --- I'm hoping to get Teodor to handle pushing
that control down to the opclass consistent() functions. The planner no
longer pays any attention to amopreqcheck, and that catalog column will go
away in due course.
indexscan always occurs in one call, and the results are returned in a
TIDBitmap instead of a limited-size array of TIDs. This should improve
speed a little by reducing AM entry/exit overhead, and it is necessary
infrastructure if we are ever to support bitmap indexes.
In an only slightly related change, add support for TIDBitmaps to preserve
(somewhat lossily) the knowledge that particular TIDs reported by an index
need to have their quals rechecked when the heap is visited. This facility
is not really used yet; we'll need to extend the forced-recheck feature to
plain indexscans before it's useful, and that hasn't been coded yet.
The intent is to use it to clean up 8.3's horrid @@@ kluge for text search
with weighted queries. There might be other uses in future, but that one
alone is sufficient reason.
Heikki Linnakangas, with some adjustments by me.
snapmgmt.c file for the former. The header files have also been reorganized
in three parts: the most basic snapshot definitions are now in a new file
snapshot.h, and the also new snapmgmt.h keeps the definitions for snapmgmt.c.
tqual.h has been reduced to the bare minimum.
This patch is just a first step towards managing live snapshots within a
transaction; there is no functionality change.
Per my proposal to pgsql-patches on 20080318191940.GB27458@alvh.no-ip.org and
subsequent discussion.
bucket number, so as to ensure locality of access to the index during the
insertion step. Without this, building an index significantly larger than
available RAM takes a very long time because of thrashing. On the other
hand, sorting is just useless overhead when the index does fit in RAM.
We choose to sort when the initial index size exceeds effective_cache_size.
This is a revised version of work by Tom Raney and Shreya Bhargava.
since these seem to happen after all in corrupted indexes. Make sure we
supply the index name in all cases, and provide relevant block numbers where
available. Also consistently identify the index name as such.
Back-patch to 8.2, in hopes that this might help Mason Hale figure out his
problem.
it failed for splits of non-leaf pages because in such pages the first
data key on a page is suppressed, and so we can't just copy the first
key from the right page to reconstitute the left page's high key.
Problem found by Koichi Suzuki, patch by Heikki.
columns, and the new version can be stored on the same heap page, we no longer
generate extra index entries for the new version. Instead, index searches
follow the HOT-chain links to ensure they find the correct tuple version.
In addition, this patch introduces the ability to "prune" dead tuples on a
per-page basis, without having to do a complete VACUUM pass to recover space.
VACUUM is still needed to clean up dead index entries, however.
Pavan Deolasee, with help from a bunch of other people.
than two independent bits (one of which was never used in heap pages anyway,
or at least hadn't been in a very long time). This gives us flexibility to
add the HOT notions of redirected and dead item pointers without requiring
anything so klugy as magic values of lp_off and lp_len. The state values
are chosen so that for the states currently in use (pre-HOT) there is no
change in the physical representation.
buffers, rather than blowing out the whole shared-buffer arena. Aside from
avoiding cache spoliation, this fixes the problem that VACUUM formerly tended
to cause a WAL flush for every page it modified, because we had it hacked to
use only a single buffer. Those flushes will now occur only once per
ring-ful. The exact ring size, and the threshold for seqscans to switch into
the ring usage pattern, remain under debate; but the infrastructure seems
done. The key bit of infrastructure is a new optional BufferAccessStrategy
object that can be passed to ReadBuffer operations; this replaces the former
StrategyHintVacuum API.
This patch also changes the buffer usage-count methodology a bit: we now
advance usage_count when first pinning a buffer, rather than when last
unpinning it. To preserve the behavior that a buffer's lifetime starts to
decrease when it's released, the clock sweep code is modified to not decrement
usage_count of pinned buffers.
Work not done in this commit: teach GiST and GIN indexes to use the vacuum
BufferAccessStrategy for vacuum-driven fetches.
Original patch by Simon, reworked by Heikki and again by Tom.
and aborted transactions have different effects; also teach it not to assume
that prepared transactions are always committed.
Along the way, simplify the pgstats API by tying counting directly to
Relations; I cannot detect any redeeming social value in having stats
pointers in HeapScanDesc and IndexScanDesc structures. And fix a few
corner cases in which counts might be missed because the relation's
pgstat_info pointer hadn't been set.
WAL records that shows whether it is safe to remove full-page images
(ie, whether or not an on-line backup was in progress when the WAL entry
was made). Also make provision for an XLOG_NOOP record type that can be
used to fill in the extra space when decompressing the data for restore.
This is the portion of Koichi Suzuki's "full page writes" patch that
has to go into the core database. The remainder of that work is two
external compression and decompression programs, which for the time being
will undergo separate development on pgfoundry. Per discussion.
Also, twiddle the handling of BTREE_SPLIT records to ensure it'll be
possible to compress them (the previous coding caused essential info
to be omitted). The other commonly-used record types seem OK already,
with the possible exception of GIN and GIST WAL records, which I don't
understand well enough to opine on.
(original code *always* created a full-page image for the left page, thus
leaving the intended savings unrealized), avoid risk of not having enough room
on the page during xlog restore, squeeze out another couple bytes in the xlog
record, clean up neglected comments.
index types can be reliably distinguished by examining the special space
on an index page. Per my earlier proposal, plus the realization that
there's no need for btree's vacuum cycle ID to cycle through every possible
16-bit value. Restricting its range a little costs nearly nothing and
eliminates the possibility of collisions.
Memo to self: remember to make bitmap indexes play along with this scheme,
assuming that patch ever gets accepted.
will be released by transaction abort before _bt_end_vacuum gets called.
If either of these "can't happen" errors actually happened, we'd freeze up
trying to acquire an already-held lock. Latest word is that this does
not explain Martin Pitt's trouble report, but it still looks like a bug.
pointer" in every Snapshot struct. This allows removal of the case-by-case
tests in HeapTupleSatisfiesVisibility, which should make it a bit faster
(I didn't try any performance tests though). More importantly, we are no
longer violating portable C practices by assuming that small integers are
distinct from all pointer values, and HeapTupleSatisfiesDirty no longer
has a non-reentrant API involving side-effects on a global variable.
There were a couple of places calling HeapTupleSatisfiesXXX routines
directly rather than through the HeapTupleSatisfiesVisibility macro.
Since these places had to be changed anyway, I chose to make them go
through the macro for uniformity.
Along the way I renamed HeapTupleSatisfiesSnapshot to HeapTupleSatisfiesMVCC
to emphasize that it's only used with MVCC-type snapshots. I was sorely
tempted to rename HeapTupleSatisfiesVisibility to HeapTupleSatisfiesSnapshot,
but forebore for the moment to avoid confusion and reduce the likelihood that
this patch breaks some of the pending patches. Might want to reconsider
doing that later.
I refactored findsplitloc and checksplitloc so that the division of
labor is more clear IMO. I pushed all the space calculation inside the
loop to checksplitloc.
I also fixed the off by 4 in free space calculation caused by
PageGetFreeSpace subtracting sizeof(ItemIdData), even though it was
harmless, because it was distracting and I felt it might come back to
bite us in the future if we change the page layout or alignments.
There's now a new function PageGetExactFreeSpace that doesn't do the
subtraction.
findsplitloc now tries the "just the new item to right page" split as
well. If people don't like the refactoring, I can write a patch to just
add that.
Heikki Linnakangas
> Currently, an index split writes all the data on the split page to
> WAL. That's a lot of WAL traffic. The tuples that are copied to the
> right page need to be WAL logged, but the tuples that stay on the
> original page don't.
Heikki Linnakangas
Standard English uses "may", "can", and "might" in different ways:
may - permission, "You may borrow my rake."
can - ability, "I can lift that log."
might - possibility, "It might rain today."
Unfortunately, in conversational English, their use is often mixed, as
in, "You may use this variable to do X", when in fact, "can" is a better
choice. Similarly, "It may crash" is better stated, "It might crash".
exactly at the point where we need to insert a new item, the calculation used
the wrong size for the "high key" of the new left page. This could lead to
choosing an unworkable split, resulting in "PANIC: failed to add item to the
left sibling" (or "right sibling") failure. Although this bug has been there
a long time, it's very difficult to trigger a failure before 8.2, since there
was generally a lot of free space on both sides of a chosen split. In 8.2,
where the user-selected fill factor determines how much free space the code
tries to leave, an unworkable split is much more likely. Report by Joe
Conway, diagnosis and fix by Heikki Linnakangas.
hold true for operators in a btree operator family. This is mostly to
clarify my own thinking about what the planner can assume for optimization
purposes. (blowing dust off an old abstract-algebra textbook...)
per-column options for btree indexes. The planner's support for this is still
pretty rudimentary; it does not yet know how to plan mergejoins with
nondefault ordering options. The documentation is pretty rudimentary, too.
I'll work on improving that stuff later.
Note incompatible change from prior behavior: ORDER BY ... USING will now be
rejected if the operator is not a less-than or greater-than member of some
btree opclass. This prevents less-than-sane behavior if an operator that
doesn't actually define a proper sort ordering is selected.
having md.c return a success/failure boolean to smgr.c, which was just going
to elog anyway, let md.c issue the elog messages itself. This allows better
error reporting, particularly in cases such as "short read" or "short write"
which Peter was complaining of. Also, remove the kluge of allowing mdread()
to return zeroes from a read-beyond-EOF: this is now an error condition
except when InRecovery or zero_damaged_pages = true. (Hash indexes used to
require that behavior, but no more.) Also, enforce that mdwrite() is to be
used for rewriting existing blocks while mdextend() is to be used for
extending the relation EOF. This restriction lets us get rid of the old
ad-hoc defense against creating huge files by an accidental reference to
a bogus block number: we'll only create new segments in mdextend() not
mdwrite() or mdread(). (Again, when InRecovery we allow it anyway, since
we need to allow updates of blocks that were later truncated away.)
Also, clean up the original makeshift patch for bug #2737: move the
responsibility for padding relation segments to full length into md.c.
or contradictory keys even in cross-data-type scenarios. This is another
benefit of the opfamily rewrite: we can find the needed comparison
operators now.
cases. Operator classes now exist within "operator families". While most
families are equivalent to a single class, related classes can be grouped
into one family to represent the fact that they are semantically compatible.
Cross-type operators are now naturally adjunct parts of a family, without
having to wedge them into a particular opclass as we had done originally.
This commit restructures the catalogs and cleans up enough of the fallout so
that everything still works at least as well as before, but most of the work
needed to actually improve the planner's behavior will come later. Also,
there are not yet CREATE/DROP/ALTER OPERATOR FAMILY commands; the only way
to create a new family right now is to allow CREATE OPERATOR CLASS to make
one by default. I owe some more documentation work, too. But that can all
be done in smaller pieces once this infrastructure is in place.
deletion code to avoid the case where an upper-level btree page remains "half
dead" for a significant period of time, and to block insertions into a key
range that is in process of being re-assigned to the right sibling of the
deleted page's parent. This prevents the scenario reported by Ed L. wherein
index keys could become out-of-order in the grandparent index level.
Since this is a moderately invasive fix, I'm applying it only to HEAD.
The bug exists back to 7.4, but the back branches will get a different patch.
even when a single relation requires more than max_fsm_pages pages. Also,
make VACUUM emit a warning in this case, since it likely means that VACUUM
FULL or other drastic corrective measure is needed. Per reports from Jeff
Frost and others of unexpected changes in the claimed max_fsm_pages need.
on the same index page; we can avoid data copying as well as buffer refcount
manipulations in this common case. Makes for a small but noticeable
improvement in mergejoin speed.
Heikki Linnakangas
operation every so often. This improves the usefulness of PITR log
shipping for hot standby: formerly, if the standby server crashed, it
was necessary to restart it from the last base backup and replay all
the WAL since then. Now it will only need to reread about the same
amount of WAL as the master server would. The behavior might also
come in handy during a long PITR replay sequence. Simon Riggs,
with some editorialization by Tom Lane.
When we are about to split an index page to do an insertion, first look
to see if any entries marked LP_DELETE exist on the page, and if so remove
them to try to make enough space for the desired insert. This should reduce
index bloat in heavily-updated tables, although of course you still need
VACUUM eventually to clean up the heap.
Junji Teramoto
it can handle small fillfactors for ordinary-sized index entries without
failing on large ones; fix nbtinsert.c to distinguish leaf and nonleaf
pages; change the minimum fillfactor to 10% for all index types.
discussion (including making def_arg allow reserved words), add missed
opt_definition for UNIQUE case. Put the reloptions support code in a less
random place (I chose to make a new file access/common/reloptions.c).
Eliminate header inclusion creep. Make the index options functions safely
user-callable (seems like client apps might like to be able to test validity
of options before trying to make an index). Reduce overhead for normal case
with no options by allowing rd_options to be NULL. Fix some unmaintainably
klugy code, including getting rid of Natts_pg_class_fixed at long last.
Some stylistic cleanup too, and pay attention to keeping comments in sync
with code.
Documentation still needs work, though I did fix the omissions in
catalogs.sgml and indexam.sgml.
per-tuple space overhead for sorts in memory. I chose to replace the
previous patch that tried to write out the bare minimum amount of data
when sorting on disk; instead, just dump the MinimalTuples as-is. This
wastes 3 to 10 bytes per tuple depending on architecture and null-bitmap
length, but the simplification in the writetup/readtup routines seems
worth it.
(relpages/reltuples). To do this, create formal support in heapam.c for
"overwrite" tuple updates (including xlog replay capability) and use that
instead of the ad-hoc overwrites we'd been using in VACUUM and CREATE INDEX.
Take the responsibility for updating stats during CREATE INDEX out of the
individual index AMs, and do it where it belongs, in catalog/index.c. Aside
from being more modular, this avoids having to update the same tuple twice in
some paths through CREATE INDEX. It's probably not measurably faster, but
for sure it's a lot cleaner than before.
into a single mostly-physical-order scan of the index. This requires some
ticklish interlocking considerations, but should create no material
performance impact on normal index operations (at least given the
already-committed changes to make scans work a page at a time). VACUUM
itself should get significantly faster in any index that's degenerated to a
very nonlinear page order. Also, we save one pass over the index entirely,
except in the case where there were no deletions to do and so only one pass
happened anyway.
Original patch by Heikki Linnakangas, rework by Tom Lane.
btgettuple and btgetmulti). This eliminates the problem of "re-finding" the
exact stopping point, since the stopping point is effectively always a page
boundary, and index items are never moved across pre-existing page boundaries.
A small penalty is that the keys_are_unique optimization is effectively
disabled (and, therefore, is removed in this patch), causing us to apply
_bt_checkkeys() to at least one more tuple than necessary when looking up a
unique key. However, the advantages for non-unique cases seem great enough to
accept this tradeoff. Aside from simplifying and (sometimes) speeding up the
indexscan code, this will allow us to reimplement btbulkdelete as a largely
sequential scan instead of index-order traversal, thereby significantly
reducing the cost of VACUUM. Those changes will come in a separate patch.
Original patch by Heikki Linnakangas, rework by Tom Lane.
This formulation requires every AM to provide amvacuumcleanup, unlike before,
but it's surely a whole lot cleaner. Also, add an 'amstorage' column to
pg_am so that we can get rid of hardwired knowledge in DefineOpClass().
thereby saving a visit to the metapage in most index searches/updates.
This wouldn't actually save any I/O (since in the old regime the metapage
generally stayed in cache anyway), but it does provide a useful decrease
in bufmgr traffic in high-contention scenarios. Per my recent proposal.
upper-level insertion completes a previously-seen split, we cannot simply grab
the downlink block number out of the buffer, because the buffer could contain
a later state of the page --- or perhaps the page doesn't even exist at all
any more, due to relation truncation. These possibilities have been masked up
to now because the use of full_page_writes effectively ensured that no xlog
replay routine ever actually saw a page state newer than its own change.
Since we're deprecating full_page_writes in 8.1.*, there's no need to fix this
in existing release branches, but we need a fix in HEAD if we want to have any
hope of re-allowing full_page_writes. Accordingly, adjust the contents of
btree WAL records so that we can always get the downlink block number from the
WAL record rather than having to depend on buffer contents. Per report from
Kevin Grittner and Peter Brant.
Improve a few comments in related code while at it.
incrementally by successive inserts rather than by sorting the data.
We were only using the slow path during bootstrap, apparently because
when first written it failed during bootstrap --- but it works fine now
AFAICT. Removing it saves a hundred or so lines of code and produces
noticeably (~10%) smaller initial states of the system catalog indexes.
While that won't make much difference for heavily-modified catalogs,
for the more static ones there may be a useful long-term performance
improvement.
misleadingly-named WriteBuffer routine, and instead require routines that
change buffer pages to call MarkBufferDirty (which does exactly what it says).
We also require that they do so before calling XLogInsert; this takes care of
the synchronization requirement documented in SyncOneBuffer. Note that
because bufmgr takes the buffer content lock (in shared mode) while writing
out any buffer, it doesn't matter whether MarkBufferDirty is executed before
the buffer content change is complete, so long as the content change is
completed before releasing exclusive lock on the buffer. So it's OK to set
the dirtybit before we fill in the LSN.
This eliminates the former kluge of needing to set the dirtybit in LockBuffer.
Aside from making the code more transparent, we can also add some new
debugging assertions, in particular that the caller of MarkBufferDirty must
hold the buffer content lock, not merely a pin.
This commit doesn't make much functional change, but it does eliminate some
duplicated code --- for instance, PageIsNew tests are now done inside
XLogReadBuffer rather than by each caller.
The GIST xlog code still needs a lot of love, but I'll worry about that
separately.
passed extend = true whenever we are reading a page we intend to reinitialize
completely, even if we think the page "should exist". This is because it
might indeed not exist, if the relation got truncated sometime after the
current xlog record was made and before the crash we're trying to recover
from. These two thinkos appear to explain both of the old bug reports
discussed here:
http://archives.postgresql.org/pgsql-hackers/2005-05/msg01369.php
when an error occurs during xlog replay. Also, replace the former risky
'write into a fixed-size buffer with no overflow detection' API for XLOG
record description routines; use an expansible StringInfo instead. (The
latter accounts for most of the patch bulk.)
Qingqing Zhou
we are not holding a buffer content lock; where it was, InterruptHoldoffCount
is positive and so we'd not respond to cancel signals as intended. Also
add missing vacuum_delay_point() call in btvacuumcleanup. This should fix
complaint from Evgeny Gridasov about failure to respond to SIGINT/SIGTERM
in a timely fashion (bug #2257).
during the vacuumcleanup scan that we're going to do anyway. Should
save a few cycles (one calculation per page, not per tuple) as well
as not having to depend on assumptions about heap and index being
in step.
I think this could probably be made to work for GIST too, but that
code looks messy enough that I'm disinclined to try right now.
partial. None of the existing AMs do anything useful except counting
tuples when there's nothing to delete, and we can get a tuple count
from the heap as long as it's not a partial index. (hash actually can
skip anyway because it maintains a tuple count in the index metapage.)
GIST is not currently able to exploit this optimization because, due to
failure to index NULLs, GIST is always effectively partial. Possibly
we should fix that sometime.
Simon Riggs w/ some review by Tom Lane.
just refer to btree index entries as plain IndexTuples, which is what
they have been for a very long time. This is mostly just an exercise
in removing extraneous notation, but it does save a palloc/pfree cycle
per index insertion.
and non-required keys in a btree index scan, mark the required scankeys
with private flag bits SK_BT_REQFWD and/or SK_BT_REQBKWD. This seems
at least marginally clearer to me, and it eliminates a wired-into-the-
data-structure assumption that required keys are consecutive. Even though
that assumption will remain true for the foreseeable future, having it
in there makes the code seem more complex than necessary.
are two basically different kinds of scankeys, and we ought to try harder
to indicate which is used in each place in the code. I've chosen the names
"search scankey" and "insertion scankey", though you could make about
as good an argument for "operator scankey" and "comparison function
scankey".
rather than "return expr;" -- the latter style is used in most of the
tree. I kept the parentheses when they were necessary or useful because
the return expression was complex.
use it. While it normally has been opened earlier during btree index
build, testing shows that it's possible for the link to be closed again
if an sinval reset occurs while the index is being built.
_bt_checkkeys(), instead of checking it in the top-level nbtree.c routines
as formerly. This saves a little bit of loop overhead, but more importantly
it lets us skip performing the index key comparisons for dead tuples.
checks, which were once needed because PageGetMaxOffsetNumber would
fail on empty pages, but are now just redundant. Also, don't set up
local variables that aren't needed in the fast path --- most of the
time, we only need to advance offnum and not step across a page boundary.
Motivated by noticing _bt_step at the top of OProfile profile for a
pgbench run.
comment line where output as too long, and update typedefs for /lib
directory. Also fix case where identifiers were used as variable names
in the backend, but as typedefs in ecpg (favor the backend for
indenting).
Backpatch to 8.1.X.
on every index page they read; in particular to catch the case of an
all-zero page, which PageHeaderIsValid allows to pass. It turns out
hash already had this idea, but it was just Assert()ing things rather
than doing a straight error check, and the Asserts were partially
redundant with PageHeaderIsValid anyway. Per recent failure example
from Jim Nasby. (gist still needs the same treatment.)
the wrong buffer dirty when trying to kill a dead index entry that's on
a page after the one it started on. No risk of data corruption, just
inefficiency, but still a bug.
generated by bitmap index scans. Along the way, simplify and speed up
the code for counting sequential and index scans; it was both confusing
and inefficient to be taking care of that in the per-tuple loops, IMHO.
initdb forced because of internal changes in pg_stat view definitions.
on a page, as suggested by ITAGAKI Takahiro. Also, change a few places
that were using some other estimates of max-items-per-page to consistently
use MaxOffsetNumber. This is conservatively large --- we could have used
the new MaxHeapTuplesPerPage macro, or a similar one for index tuples ---
but those places are simply declaring a fixed-size buffer and assuming it
will work, rather than actively testing for overrun. It seems safer to
size these buffers in a way that can't overflow even if the page is
corrupt.
scankeys arrays that it needs can never have more than INDEX_MAX_KEYS
entries, so it's reasonable to just allocate them as fixed-size local
arrays, and save the cost of palloc/pfree. Not a huge savings, but
a cycle saved is a cycle earned ...
nonconsecutive columns of a multicolumn index, as per discussion around
mid-May (pghackers thread "Best way to scan on-disk bitmaps"). This
turns out to require only minimal changes in btree, and so far as I can
see none at all in GiST. btcostestimate did need some work, but its
original assumption that index selectivity == heap selectivity was
quite bogus even before this.
up have the standard layout with unused space between pd_lower and pd_upper.
When this is set, XLogInsert will omit the unused space without bothering
to scan it to see if it's zero. That saves time in XLogInsert, and also
allows reversion of my earlier patch to make PageRepairFragmentation et al
explicitly re-zero freed space. Per suggestion by Heikki Linnakangas.
Instead of a separate CRC on each backup block, include backup blocks
in their parent WAL record's CRC; this is important to ensure that the
backup block really goes with the WAL record, ie there was not a page
tear right at the start of the backup block. Implement a simple form
of compression of backup blocks: drop any run of zeroes starting at
pd_lower, so as not to store the unused 'hole' that commonly exists in
PG heap and index pages. Tweak PageRepairFragmentation and related
routines to ensure they keep the unused space zeroed, so that the above
compression method remains effective. All per recent discussions.
methods: they all invoke UpdateStats() since they have computed the
number of heap tuples, so I created a function in catalog/index.c that
each AM now calls.
and VACUUM: in the interval between adding a new page to the relation
and formatting it, it was possible for VACUUM to come along and decide
it should format the page too. Though not harmful in itself, this would
cause data loss if a third transaction were able to insert tuples into
the vacuumed page before the original extender got control back.
which is neither needed by nor related to that header. Remove the bogus
inclusion and instead include the header in those C files that actually
need it. Also fix unnecessary inclusions and bad inclusion order in
tsearch2 files.
Essentially, we shoehorn in a lockable-object-type field by taking
a byte away from the lockmethodid, which can surely fit in one byte
instead of two. This allows less artificial definitions of all the
other fields of LOCKTAG; we can get rid of the special pg_xactlock
pseudo-relation, and also support locks on individual tuples and
general database objects (including shared objects). None of those
possibilities are actually exploited just yet, however.
I removed pg_xactlock from pg_class, but did not force initdb for
that change. At this point, relkind 's' (SPECIAL) is unused and
could be removed entirely.
change saves a great deal of space in pg_proc and its primary index,
and it eliminates the former requirement that INDEX_MAX_KEYS and
FUNC_MAX_ARGS have the same value. INDEX_MAX_KEYS is still embedded
in the on-disk representation (because it affects index tuple header
size), but FUNC_MAX_ARGS is not. I believe it would now be possible
to increase FUNC_MAX_ARGS at little cost, but haven't experimented yet.
There are still a lot of vestigial references to FUNC_MAX_ARGS, which
I will clean up in a separate pass. However, getting rid of it
altogether would require changing the FunctionCallInfoData struct,
and I'm not sure I want to buy into that.
access: define new index access method functions 'amgetmulti' that can
fetch multiple TIDs per call. (The functions exist but are totally
untested as yet.) Since I was modifying pg_am anyway, remove the
no-longer-needed 'rel' parameter from amcostestimate functions, and
also remove the vestigial amowner column that was creating useless
work for Alvaro's shared-object-dependencies project.
Initdb forced due to changes in pg_am.
PageIndexTupleDelete() with a single pass of compactification ---
logic mostly lifted from PageRepairFragmentation. I noticed while
profiling that a VACUUM that's cleaning up a whole lot of deleted
tuples would spend as much as a third of its CPU time in
PageIndexTupleDelete; not too surprising considering the loop method
was roughly O(N^2) in the number of tuples involved.
convention for isnull flags. Also, remove the useless InsertIndexResult
return struct from index AM aminsert calls --- there is no reason for
the caller to know where in the index the tuple was inserted, and we
were wasting a palloc cycle per insert to deliver this uninteresting
value (plus nontrivial complexity in some AMs).
I forced initdb because of the change in the signature of the aminsert
routines, even though nothing really looks at those pg_proc entries...
to write out data that we are about to tell the filesystem to drop.
smgr_internal_unlink already had a DropRelFileNodeBuffers call to
get rid of dead buffers without a write after it's no longer possible
to roll back the deleting transaction. Adding a similar call in
smgrtruncate simplifies callers and makes the overall division of
labor clearer. This patch removes the former behavior that VACUUM
would write all dirty buffers of a relation unconditionally.
Also performed an initial run through of upgrading our Copyright date to
extend to 2005 ... first run here was very simple ... change everything
where: grep 1996-2004 && the word 'Copyright' ... scanned through the
generated list with 'less' first, and after, to make sure that I only
picked up the right entries ...
a relation's number of blocks, rather than the possibly-obsolete value
in pg_class.relpages. Scale the value in pg_class.reltuples correspondingly
to arrive at a hopefully more accurate number of rows. When pg_class
contains 0/0, estimate a tuple width from the column datatypes and divide
that into current file size to estimate number of rows. This improved
methodology allows us to jettison the ancient hacks that put bogus default
values into pg_class when a table is first created. Also, per a suggestion
from Simon, make VACUUM (but not VACUUM FULL or ANALYZE) adjust the value
it puts into pg_class.reltuples to try to represent the mean tuple density
instead of the minimal density that actually prevails just after VACUUM.
These changes alter the plans selected for certain regression tests, so
update the expected files accordingly. (I removed join_1.out because
it's not clear if it still applies; we can add back any variant versions
as they are shown to be needed.)
Rather than using ReadBuffer() to increment the reference count on an
already-pinned buffer, we should use IncrBufferRefCount() as it is
faster and does not require acquiring the BufMgrLock.
in all cases when keep_buf = true. This allows ANALYZE's inner loop to
use heap_release_fetch, which saves multiple buffer lookups for the same
page and avoids overestimation of cost by the vacuum cost mechanism.
http://archives.postgresql.org/pgsql-hackers/2004-10/msg00464.php.
This fix is intended to be permanent: it moves the responsibility for
calling SetBufferCommitInfoNeedsSave() into the tqual.c routines,
eliminating the requirement for callers to test whether t_infomask changed.
Also, tighten validity checking on buffer IDs in bufmgr.c --- several
routines were paranoid about out-of-range shared buffer numbers but not
about out-of-range local ones, which seems a tad pointless.
relcache entries. Also, change TransactionIdIsCurrentTransactionId()
so that if consulted during transaction abort, it will not say that
the aborted xact is still current. (It would be better to ensure that
it's never called at all during abort, but I'm not sure we can easily
guarantee that.) In combination, these fix a crash we have seen
occasionally during parallel regression tests of 8.0.
value of 'start' could be past the end of the page, if the page was
split by some concurrent inserting process since we visited it. In
this situation the code could look at bogus entries and possibly find
a match (since after all those entries still contain what they had
before the split). This would lead to 'specified item offset is too large'
followed by 'PANIC: failed to add item to the page', as reported by Joe
Conway for scenarios involving heavy concurrent insertion activity.
of XLogInsert had the same sort of checkpoint interlock problem as
RecordTransactionCommit, and indeed I found some. Btree index build
and ALTER TABLE SET TABLESPACE write data outside the friendly confines
of the buffer manager, and therefore they have to take their own
responsibility for checkpoint interlock. The easiest solution seems to
be to force smgrimmedsync at the end of the index build or table copy,
even when the operation is being WAL-logged. This is sufficient since
the new index or table will be of interest to no one if we don't get
as far as committing the current transaction.
recovery more manageable. Also, undo recent change to add FILE_HEADER
and WASTED_SPACE records to XLOG; instead make the XLOG page header
variable-size with extra fields in the first page of an XLOG file.
This should fix the boundary-case bugs observed by Mark Kirkwood.
initdb forced due to change of XLOG representation.
keep track of portal-related resources separately from transaction-related
resources. This allows cursors to work in a somewhat sane fashion with
nested transactions. For now, cursor behavior is non-subtransactional,
that is a cursor's state does not roll back if you abort a subtransaction
that fetched from the cursor. We might want to change that later.
shift support code into heapam.c accordingly. This is in service of
soon-to-be-committed ALTER TABLE SET TABLESPACE code that will want to
use this same record type for both heaps and indexes.
Theoretically I should have forced initdb for this, but in practice there
is no change in xlog contents because CVS tip will never really emit this
record type anyhow...
There are various things left to do: contrib dbsize and oid2name modules
need work, and so does the documentation. Also someone should think about
COMMENT ON TABLESPACE and maybe RENAME TABLESPACE. Also initlocation is
dead, it just doesn't know it yet.
Gavin Sherry and Tom Lane.
locking conflict against concurrent CHECKPOINT that was discussed a few
weeks ago. Also, if not using WAL archiving (which is always true ATM
but won't be if PITR makes it into this release), there's no need to
WAL-log the index build process; it's sufficient to force-fsync the
completed index before commit. This seems to gain about a factor of 2
in my tests, which is consistent with writing half as much data. I did
not try it with WAL on a separate drive though --- probably the gain would
be a lot less in that scenario.
rather than an error code, and does elog(ERROR) not elog(WARNING)
when it detects a problem. All callers were simply elog(ERROR)'ing on
failure return anyway, and I find it hard to envision a caller that would
not, so we may as well simplify the callers and produce the more useful
error message directly.
In the past, we used a 'Lispy' linked list implementation: a "list" was
merely a pointer to the head node of the list. The problem with that
design is that it makes lappend() and length() linear time. This patch
fixes that problem (and others) by maintaining a count of the list
length and a pointer to the tail node along with each head node pointer.
A "list" is now a pointer to a structure containing some meta-data
about the list; the head and tail pointers in that structure refer
to ListCell structures that maintain the actual linked list of nodes.
The function names of the list API have also been changed to, I hope,
be more logically consistent. By default, the old function names are
still available; they will be disabled-by-default once the rest of
the tree has been updated to use the new API names.
costing us lots more to maintain than it was worth. On shared tables
it was of exactly zero benefit because we couldn't trust it to be
up to date. On temp tables it sometimes saved an lseek, but not often
enough to be worth getting excited about. And the real problem was that
we forced an lseek on every relcache flush in order to update the field.
So all in all it seems best to lose the complexity.
the next are handled by ReleaseAndReadBuffer rather than separate
ReleaseBuffer and ReadBuffer calls. This cuts the number of acquisitions
of the BufMgrLock by a factor of 2 (possibly more, if an indexscan happens
to pull successive rows from the same heap page). Unfortunately this
doesn't seem enough to get us out of the recently discussed context-switch
storm problem, but it's surely worth doing anyway.
subroutine in src/port/pgsleep.c. Remove platform dependencies from
miscadmin.h and put them in port.h where they belong. Extend recent
vacuum cost-based-delay patch to apply to VACUUM FULL, ANALYZE, and
non-btree index vacuuming.
By the way, where is the documentation for the cost-based-delay patch?
the relcache, and so the notion of 'blind write' is gone. This should
improve efficiency in bgwriter and background checkpoint processes.
Internal restructuring in md.c to remove the not-very-useful array of
MdfdVec objects --- might as well just use pointers.
Also remove the long-dead 'persistent main memory' storage manager (mm.c),
since it seems quite unlikely to ever get resurrected.
Make btree index creation and initial validation of foreign-key constraints
use maintenance_work_mem rather than work_mem as their memory limit.
Add some code to guc.c to allow these variables to be referenced by their
old names in SHOW and SET commands, for backwards compatibility.
pointer type when it is not necessary to do so.
For future reference, casting NULL to a pointer type is only necessary
when (a) invoking a function AND either (b) the function has no prototype
OR (c) the function is a varargs function.
to step more than one entry after descending the search tree to arrive at
the correct place to start the scan. This can improve the behavior
substantially when there are many entries equal to the chosen boundary
value. Per suggestion from Dmitry Tkach, 14-Jul-03.
some concurrent changes Jan was making to the bufmgr. Here's an
updated version of the patch -- it should apply cleanly to CVS
HEAD and passes the regression tests.
This patch makes the following changes:
- remove the UnlockAndReleaseBuffer() and UnlockAndWriteBuffer()
macros, and replace uses of them with calls to the appropriate
functions.
- remove a bunch of #ifdef BMTRACE code: it is ugly & broken
(i.e. it doesn't compile)
- make BufferReplace() return a bool, not an int
- cleanup some logic in bufmgr.c; should be functionality
equivalent to the previous code, just cleaner now
- remove the BM_PRIVATE flag as it is unused
- improve a few comments, etc.
pghackers proposal of 8-Nov. All the existing cross-type comparison
operators (int2/int4/int8 and float4/float8) have appropriate support.
The original proposal of storing the right-hand-side datatype as part of
the primary key for pg_amop and pg_amproc got modified a bit in the event;
it is easier to store zero as the 'default' case and only store a nonzero
when the operator is actually cross-type. Along the way, remove the
long-since-defunct bigbox_ops operator class.
Remove the 'strategy map' code, which was a large amount of mechanism
that no longer had any use except reverse-mapping from procedure OID to
strategy number. Passing the strategy number to the index AM in the
first place is simpler and faster.
This is a preliminary step in planned support for cross-datatype index
operations. I'm committing it now since the ScanKeyEntryInitialize()
API change touches quite a lot of files, and I want to commit those
changes before the tree drifts under me.
invalid (has the wrong magic number) until the build is entirely
complete. This turns out to cost no additional writes in the normal
case, since we were rewriting the metapage at the end of the process
anyway. In normal scenarios there's no real gain in security, because
a failed index build would roll back the transaction leaving an unused
index file, but for rebuilding shared system indexes this seems to add
some useful protection.
killed items; just skip to the next item immediately. Only check for
key equality when we reach a non-killed item or the end of the index
page. This saves key comparisons when there are lots of killed items,
as for example in a heavily-updated table that's not been vacuumed lately.
Seems to be a win for pgbench anyway.
index pages: when _bt_getbuf asks the FSM for a free index page, it is
possible (and, in some cases, even moderately likely) that the answer
will be the same page that _bt_split is trying to split. _bt_getbuf
already knew that the returned page might not be free, but it wasn't
prepared for the possibility that even trying to lock the page could
be problematic. Fix by doing a conditional rather than unconditional
grab of the page lock.
bottom. Otherwise we fail to moveright when the root page was split while
we were "in flight" to it. This is not a significant problem when the root
is above the leaf level, but if the root was also a leaf (ie, a single-page
index just got split) we may return the wrong leaf page to the caller,
resulting in failure to find a key that is in fact present. Bug has existed
at least since 7.1, probably forever.
NULL key pointer, indicating that the existing scan key should be reused.
This behavior isn't used yet but will be needed for my planned fix to
the keys_are_unique code.
Adjustable threshold is gone in favor of keeping track of total requested
page storage and doling out proportional fractions to each relation
(with a minimum amount per relation, and some quantization of the results
to avoid thrashing with small changes in page counts). Provide special-
case code for indexes so as not to waste space storing useless page
free space counts. Restructure internal data storage to be a flat array
instead of list-of-chunks; this may cost a little more work in data
copying when reorganizing, but allows binary search to be used during
lookup_fsm_page_entry().
end of a btree index. This isn't super-effective, since we won't move
nondeletable pages, but it's better than nothing. Also, improve stats
displayed during VACUUM VERBOSE.
deleting multiple index entries on a single index page. This makes for
a very substantial reduction in the amount of WAL traffic during a
large delete operation.
now knows what to do upon hitting a dead page (in theory anyway, it's
untested...). Add a post-VACUUM-cleanup entry point for index AMs, to
provide a place for dead-page scavenging to happen.
Also, fix oversight that broke btpo_prev links in temporary indexes.
initdb forced due to additions in pg_am.
support btree compaction, as per proposal of a few days ago. btree index
pages no longer store parent links, instead they have a level indicator
(counting up from zero for leaf pages). The FixBTree recovery logic is
removed, and replaced by code that detects missing parent-level insertions
during WAL replay. Also, generate appropriate WAL entries when updating
btree metapage and when building a btree index from scratch. I believe
btree indexes are now completely WAL-legal for the first time.
initdb forced due to index and WAL changes.
item, if the page containing the current item is split while the indexscan
is stopped and holds no read-lock on the page. The current item might
move right onto a page that the indexscan holds no pin on. In the prior
code this would allow btbulkdelete to reach and possibly delete the item,
causing 'my bits moved right off the end of the world!' when the indexscan
finally resumes. Fix by chaining read-locks to the right during
_bt_restscan and requiring btbulkdelete to LockBufferForCleanup on every
page it scans, not only those with deletable items. Per my pghackers
message of 25-May-02. (Too bad no one could think of a better way.)
offset past the last-used-item-plus-one, since that would result in
leaving uninitialized holes in the item pointer array. AFAICT the only
place that was depending on this was btree index build, which was being
cavalier about when to fill in the P_HIKEY pointer; easily fixed.
Also a small performance improvement: shuffle itemid's by means of
memmove, not a one-at-a-time loop.
The local buffer manager is no longer used for newly-created relations
(unless they are TEMP); a new non-TEMP relation goes through the shared
bufmgr and thus will participate normally in checkpoints. But TEMP relations
use the local buffer manager throughout their lifespan. Also, operations
in TEMP relations are not logged in WAL, thus improving performance.
Since it's no longer necessary to fsync relations as they move out of the
local buffers into shared buffers, quite a lot of smgr.c/md.c/fd.c code
is no longer needed and has been removed: there's no concept of a dirty
relation anymore in md.c/fd.c, and we never fsync anything but WAL.
Still TODO: improve local buffer management algorithms so that it would
be reasonable to increase NLocBuffer.
all places, where pd_linp is accessed. Also introduce new macros
SizeOfPageHeaderData and BTMaxItemSize. This is just source code
cosmetic, no behaviour changed.
Manfred Koizar
transaction, so as to avoid returning them out of the index AM. Saves
repeated heap_fetch operations on frequently-updated rows. Also detect
queries on unique keys (equality to all columns of a unique index), and
don't bother continuing scan once we have found first match.
Killing is implemented in the btree and hash AMs, but not yet in rtree
or gist, because there isn't an equally convenient place to do it in
those AMs (the outer amgetnext routine can't do it without re-pinning
the index page).
Did some small cleanup on APIs of HeapTupleSatisfies, heap_fetch, and
index_insert to make this a little easier.
yesterday's proposal to pghackers. Also remove unnecessary parameters
to heap_beginscan, heap_rescan. I modified pg_proc.h to reflect the
new numbers of parameters for the AM interface routines, but did not
force an initdb because nothing actually looks at those fields.
o Change all current CVS messages of NOTICE to WARNING. We were going
to do this just before 7.3 beta but it has to be done now, as you will
see below.
o Change current INFO messages that should be controlled by
client_min_messages to NOTICE.
o Force remaining INFO messages, like from EXPLAIN, VACUUM VERBOSE, etc.
to always go to the client.
o Remove INFO from the client_min_messages options and add NOTICE.
Seems we do need three non-ERROR elog levels to handle the various
behaviors we need for these messages.
Regression passed.
now just below FATAL in server_min_messages. Added more text to
highlight ordering difference between it and client_min_messages.
---------------------------------------------------------------------------
REALLYFATAL => PANIC
STOP => PANIC
New INFO level the prints to client by default
New LOG level the prints to server log by default
Cause VACUUM information to print only to the client
NOTICE => INFO where purely information messages are sent
DEBUG => LOG for purely server status messages
DEBUG removed, kept as backward compatible
DEBUG5, DEBUG4, DEBUG3, DEBUG2, DEBUG1 added
DebugLvl removed in favor of new DEBUG[1-5] symbols
New server_min_messages GUC parameter with values:
DEBUG[5-1], INFO, NOTICE, ERROR, LOG, FATAL, PANIC
New client_min_messages GUC parameter with values:
DEBUG[5-1], LOG, INFO, NOTICE, ERROR, FATAL, PANIC
Server startup now logged with LOG instead of DEBUG
Remove debug_level GUC parameter
elog() numbers now start at 10
Add test to print error message if older elog() values are passed to elog()
Bootstrap mode now has a -d that requires an argument, like postmaster
to prevent spreading of corruption when page header pointers are bad.
Merge PageZero into PageInit, since it was never used separately, and
remove separate memset calls used at most other PageInit call points.
Remove IndexPageCleanup, which wasn't used at all.
to insert the same key into a supposedly unique index. The bug is of
low probability, and may not explain any of the recent reports of
duplicated rows; but a bug is a bug.
where rightmost index page splits while we are waiting to obtain exclusive
lock on it. Not clear this would actually hurt (probably the callback
would always fail), but better safe than sorry.
Also, improve comments describing concurrency considerations in this code.
lookup info in the relcache for index access method support functions.
This makes a huge difference for dynamically loaded support functions,
and should save a few cycles even for built-in ones. Also tweak dfmgr.c
so that load_external_function is called only once, not twice, when
doing fmgr_info for a dynamically loaded function. All per performance
gripe from Teodor Sigaev, 5-Oct-01.
rightmost on its tree level, we split 2/3 to the left and 1/3 to the
new right page, rather than the even split we use elsewhere. The idea
is that when faced with a steadily increasing series of inserted keys
(such as sequence or timestamp values), we'll end up with a btree that's
about 2/3ds full not 1/2 full, which is much closer to the desired
steady-state load for a btree. Per suggestion from Ann Harrison of
IBPhoenix.
per previous discussion on pghackers. Most of the duplicate code in
different AMs' ambuild routines has been moved out to a common routine
in index.c; this means that all index types now do the right things about
inserting recently-dead tuples, etc. (I also removed support for EXTEND
INDEX in the ambuild routines, since that's about to go away anyway, and
it cluttered the code a lot.) The retail indextuple deletion routines have
been replaced by a "bulk delete" routine in which the indexscan is inside
the access method. I haven't pushed this change as far as it should go yet,
but it should allow considerable simplification of the internal bookkeeping
for deletions. Also, add flag columns to pg_am to eliminate various
hardcoded tests on AM OIDs, and remove unused pg_am columns.
Fix rtree and gist index types to not attempt to store NULLs; before this,
gist usually crashed, while rtree managed not to crash but computed wacko
bounding boxes for NULL entries (which might have had something to do with
the performance problems we've heard about occasionally).
Add AtEOXact routines to hash, rtree, and gist, all of which have static
state that needs to be reset after an error. We discovered this need long
ago for btree, but missed the other guys.
Oh, one more thing: concurrent VACUUM is now the default.
do anything yet, but it has the necessary connections to initialization
and so forth. Make some gestures towards allowing number of blocks in
a relation to be BlockNumber, ie, unsigned int, rather than signed int.
(I doubt I got all the places that are sloppy about it, yet.) On the
way, replace the hardwired NLOCKS_PER_XACT fudge factor with a GUC
variable.
a separate statement (though it can still be invoked as part of VACUUM, too).
pg_statistic redesigned to be more flexible about what statistics are
stored. ANALYZE now collects a list of several of the most common values,
not just one, plus a histogram (not just the min and max values). Random
sampling is used to make the process reasonably fast even on very large
tables. The number of values and histogram bins collected is now
user-settable via an ALTER TABLE command.
There is more still to do; the new stats are not being used everywhere
they could be in the planner. But the remaining changes for this project
should be localized, and the behavior is already better than before.
A not-very-related change is that sorting now makes use of btree comparison
routines if it can find one, rather than invoking '<' twice.
give consistent results for all datatypes. Types float4, float8, and
numeric were broken for NaN values; abstime, timestamp, and interval
were broken for INVALID values; timetz was just plain broken (some
possible pairs of values were neither < nor = nor >). Also clean up
text, bpchar, varchar, and bit/varbit to eliminate duplicate code and
thereby reduce the probability of similar inconsistencies arising in
the future.
allocated by plan nodes are not leaked at end of query. This doesn't
really matter for normal queries, but it sure does for queries invoked
repetitively inside SQL functions. Clean up some other grotty code
associated with tupdescs, and fix a few other memory leaks exposed by
tests with simple SQL functions.
and new root page if old root one was splitted but new root page
wasn't created.
New code is protected by FixBTree bool flag setted to FALSE, so
nothing should be affected by this untested approach.
are treated more like 'cancel' interrupts: the signal handler sets a
flag that is examined at well-defined spots, rather than trying to cope
with an interrupt that might happen anywhere. See pghackers discussion
of 1/12/01.
are now critical sections, so as to ensure die() won't interrupt us while
we are munging shared-memory data structures. Avoid insecure intermediate
states in some code that proc_exit will call, like palloc/pfree. Rename
START/END_CRIT_CODE to START/END_CRIT_SECTION, since that seems to be
what people tend to call them anyway, and make them be called with () like
a function call, in hopes of not confusing pg_indent.
I doubt that this is sufficient to make SIGTERM safe anywhere; there's
just too much code that could get invoked during proc_exit().
included by everything that includes bufmgr.h --- it's supposed to be
internals, after all, not part of the API! This fixes the conflict
against FreeBSD headers reported by Rosenman, by making it unnecessary
for s_lock.h to be included by plperl.c.
Context diff this time.
Remove -m486 compile args for FreeBSD-i386, compile -O2 on i386.
Compile with only -O on alpha for codegen safety.
Make the port use the TEST_AND_SET for alpha and i386 on FreeBSD.
Fix a lot of bogus string formats for outputting pointers (cast to int
and %u/%x replaced with no cast and %p), and 'Size'(size_t) are now
cast to 'unsigned long' and output with %lu/
Remove an unused variable.
Alfred Perlstein
(WAL logging for this is not done yet, however.) Clean up a number of really
crufty things that are no longer needed now that DROP behaves nicely. Make
temp table mapper do the right things when drop or rename affecting a temp
table is rolled back. Also, remove "relation modified while in use" error
check, in favor of locking tables at first reference and holding that lock
throughout the statement.
actually, but who could understand it with no comments? Fix bug
while at it: _bt_orderkeys would try to invoke comparisons on
NULL inputs, given the right sort of redundant quals.
left keys during bottom-up index build, and leave some free space
instead of packing the pages to the brim (so as to avoid vast numbers
of page splits during the first interactive insertions).
duplicate keys by letting search go to the left rather than right when an
equal key is seen at an upper tree level. Fix poor choice of page split
point (leading to insertion failures) that was forced by chaining logic.
Don't store leftmost key in non-leaf pages, since it's not necessary.
Don't create root page until something is first stored in the index, so an
unused index is now 8K not 16K. (Doesn't seem to be as easy to get rid of
the metadata page, unfortunately.) Massive cleanup of unreadable code,
fix poor, obsolete, and just plain wrong documentation and comments.
See src/backend/access/nbtree/README for the gory details.
pass-by-ref data types --- eg, an index on lower(textfield) --- no longer
leak memory during index creation or update. Clean up a lot of redundant
code ... did you know that copy, vacuum, truncate, reindex, extend index,
and bootstrap each basically duplicated the main executor's logic for
extracting information about an index and preparing index entries?
Functional indexes should be a little faster now too, due to removal
of repeated function lookups.
CREATE INDEX 'opt_type' clause is deimplemented by these changes,
but I haven't removed it from the parser yet (need to merge with
Thomas' latest change set first).
memory contexts. Currently, only leaks in expressions executed as
quals or projections are handled. Clean up some old dead cruft in
executor while at it --- unused fields in state nodes, that sort of thing.
entries now for int8 and network hash indexes. int24_ops and int42_ops
are gone. pg_opclass no longer contains multiple entries claiming to be
the default opclass for the same datatype. opr_sanity regress test
extended to catch errors like these in the future.
passing the index-is-unique flag to index build routines (duh! ...
why wasn't it done this way to begin with?). Aside from eliminating
an eyesore, this should save a few milliseconds in btree index creation
because a full scan of pg_index is not needed any more.
--- ie, they're only called for side-effects. Add a PG_RETURN_VOID()
macro and use it where appropriate. This probably doesn't change the
machine code by a single bit ... it's just for documentation.
inputs have been converted to newstyle. This should go a long way towards
fixing our portability problems with platforms where char and short
parameters are passed differently from int-width parameters. Still
more to do for the Alpha port however.
That means you can now set your options in either or all of $PGDATA/configuration,
some postmaster option (--enable-fsync=off), or set a SET command. The list of
options is in backend/utils/misc/guc.c, documentation will be written post haste.
pg_options is gone, so is that pq_geqo config file. Also removed were backend -K,
-Q, and -T options (no longer applicable, although -d0 does the same as -Q).
Added to configure an --enable-syslog option.
changed all callers from TPRINTF to elog(DEBUG)
running gcc and HP's cc with warnings cranked way up. Signed vs unsigned
comparisons, routines declared static and then defined not-static,
that kind of thing. Tedious, but perhaps useful...
appropriate btree three-way comparison routine. Not clear why the
three-way comparison routines were being used in some paths and not
others in btree --- incomplete changes by someone long ago, maybe?
Anyway, this makes for a nice speedup in CREATE INDEX.
from a constraint condition does not violate the constraint (cf. discussion
on pghackers 12/9/99). Implemented by adding a parameter to ExecQual,
specifying whether to return TRUE or FALSE when the qual result is
really NULL in three-valued boolean logic. Currently, ExecRelCheck is
the only caller that asks for TRUE, but if we find any other places that
have the wrong response to NULL, it'll be easy to fix them.
Make all system indexes unique.
Make all cache loads use system indexes.
Rename *rel to *relid in inheritance tables.
Rename cache names to be clearer.
a generalized module 'tuplesort.c' that can sort either HeapTuples or
IndexTuples, and is not tied to execution of a Sort node. Clean up
memory leakages in sorting, and replace nbtsort.c's private implementation
of mergesorting with calls to tuplesort.c.
is used to find start scan position of Indexscan-s.
To speed up finding scan start position,I have changed
_bt_first() to use as many keys as possible.
I'll attach the patch here.
Regards.
Hiroshi Inoue
additional argument specifying the kind of lock to acquire/release (or
'NoLock' to do no lock processing). Ensure that all relations are locked
with some appropriate lock level before being examined --- this ensures
that relevant shared-inval messages have been processed and should prevent
problems caused by concurrent VACUUM. Fix several bugs having to do with
mismatched increment/decrement of relation ref count and mismatched
heap_open/close (which amounts to the same thing). A bogus ref count on
a relation doesn't matter much *unless* a SI Inval message happens to
arrive at the wrong time, which is probably why we got away with this
sloppiness for so long. Repair missing grab of AccessExclusiveLock in
DROP TABLE, ALTER/RENAME TABLE, etc, as noted by Hiroshi.
Recommend 'make clean all' after pulling this update; I modified the
Relation struct layout slightly.
Will post further discussion to pghackers list shortly.
Also, move responsibility for calling vc_abort into main xact.c list of
things-to-call-at-abort. What in the world was it doing down inside of
TransactionIdAbort()?
care of equal-key cases, eliminating bt_firsteq(). The linear search
formerly done by bt_firsteq() took a lot of time in the case where many
equal keys appear on the same page.
indexes.
1. Index Scan using plural indexids never scan backward
as to the order of indexids.
2. The cursor using Index scan is not usable after moving
past the end.
This patch solves above bugs.
Moreover the change of _bt_first() would be useful to extend
ORDER BY patch by Jan Wieck for all descending order cases.
Hiroshi Inoue
2. Much faster btree tuples deletion in the case when first on page
index tuple is deleted (no movement to the left page(s)).
3. Remember blkno of new root page in BTPageOpaque of
left/right siblings when root page is splitted.
I would like some feedback on what the hash function for the int8 hash
function
in the ./backend/access/hash/hashfunc.c should return.
Also, could someone (maybe Tomas Lockhart?) look-over the patch and make
sure
the system table entries are correct? I've tried to research them as
much as I
could, but some of them are still not clear to me.
Thanks,
-Ryan
Ok. I made patches replacing all of "#if FALSE" or "#if 0" to "#ifdef
NOT_USED" for current. I have tested these patches in that the
postgres binaries are identical.
> tprintf.patch
>
> tprintf.patch
>
> adds functions and macros which implement a conditional trace package
> with the ability to change flags and numeric options of running
> backends at runtime.
> Options/flags can be specified in the command line and/or read from
> the file pg_options in the data directory.
no longer returns buffer pointer, can be gotten from scan;
descriptor; bootstrap can create multi-key indexes;
pg_procname index now is multi-key index; oidint2, oidint4, oidname
are gone (must be removed from regression tests); use System Cache
rather than sequential scan in many places; heap_modifytuple no
longer takes buffer parameter; remove unused buffer parameter in
a few other functions; oid8 is not index-able; remove some use of
single-character variable names; cleanup Buffer variables usage
and scan descriptor looping; cleaned up allocation and freeing of
tuples; 18k lines of diff;
1. Removes the unnecessary "#define AbcRegProcedure 123"'s from
pg_proc.h.
2. Changes those #defines to use the names already defined in
fmgr.h.
3. Forces the make of fmgr.h in backend/Makefile instead of having
it
made as a dependency in access/common/Makefile *hack*hack*hack*
4. Rearranged the #includes to a less helter-skelter arrangement,
also
changing <file.h> to "file.h" to signify a non-system header.
5. Removed "pg_proc.h" from files where its only purpose was for
the
#defines removed in item #1.
6. Added "fmgr.h" to each file changed for completeness sake.
Turns out that #6 was not necessary for some files because fmgr.h
was being included in a roundabout way SIX levels deep by the first
include.
"access/genam.h"
->"access/relscan.h"
->"utils/rel.h"
->"access/strat.h"
->"access/skey.h"
->"fmgr.h"
So adding fmgr.h really didn't add anything to the compile, hopefully
just made it clearer to the programmer.
S Darren.
Attached you'll find a (big) patch that fixes make dep and make
depend in all Makefiles where I found it to be appropriate.
It also removes the dependency in Makefile.global for NAMEDATALEN
and OIDNAMELEN by making backend/catalog/genbki.sh and bin/initdb/initdb.sh
a little smarter.
This no longer requires initdb.sh that is turned into initdb with
a sed script when installing Postgres, hence initdb.sh should be
renamed to initdb (after the patch has been applied :-) )
This patch is against the 6.3 sources, as it took a while to
complete.
Please review and apply,
Cheers,
Jeroen van Vianen
1. Remove the char2, char4, char8 and char16 types from postgresql
2. Change references of char16 to name in the regression tests.
3. Rename the char16.sql regression test to name.sql. 4. Modify
the regression test scripts and outputs to match up.
Might require new regression.{SYSTEM} files...
Darren King
#define TAPETEMP "pg_btsortXXXXXX"
to:
#define TAPETEMP "pg_btsortXXXXXXX"
For some reason, under FreeBSD, it appears that the mktemp() value needs the
extra 'X' to improve/ensure uniqueness
Patch by: wieck@sapserv.debis.de (Jan Wieck)
One of the design rules of PostgreSQL is extensibility. And
to follow this rule means (at least for me) that there should
not only be a builtin PL. Instead I would prefer a defined
interface for PL implemetations.
==========================================
What follows is a set of diffs that cleans up the usage of BLCKSZ.
As a side effect, the person compiling the code can change the
value of BLCKSZ _at_their_own_risk_. By that, I mean that I've
tried it here at 4096 and 16384 with no ill-effects. A value
of 4096 _shouldn't_ affect much as far as the kernel/file system
goes, but making it bigger than 8192 can have severe consequences
if you don't know what you're doing. 16394 worked for me, _BUT_
when I went to 32768 and did an initdb, the SCSI driver broke and
the partition that I was running under went to hell in a hand
basket. Had to reboot and do a good bit of fsck'ing to fix things up.
The patch can be safely applied though. Just leave BLCKSZ = 8192
and everything is as before. It basically only cleans up all of the
references to BLCKSZ in the code.
If this patch is applied, a comment in the config.h file though above
the BLCKSZ define with warning about monkeying around with it would
be a good idea.
Darren darrenk@insightdist.com
(Also cleans up some of the #includes in files referencing BLCKSZ.)
==========================================
Essentially, this cleans things up so that if PORTNAME isn't defined (I'm
working on getting rid of it for FreeBSD, at least, to see if its possible)
none of the PORTNAME related stuff gets passed around.
Had a little bit of -I related redundancy as well
Subject: [PATCHES] Re: [PORTS] AIX 6.1 fixes...
Here are the patches for the two things that wouldn't make it thru the AIX
compiler. The geo_ops.c change is harmless I believe. The nbtcompare.c patch
fixes me, but I don't know about any other ports. Maybe wait on that one
until Vadim decides what to do about the unsigned vs signed chars varlena
issue.
when btree used in innerscan with run-time key which value
passed by pointer.
Fix: keys ordering stuff moved to _bt_first().
Pointed by Thomas Lockhart.
index tuple (logical position within A LEVEL). bti_oid & bti_dummy
taken off from BTItemData.
2. Fix for multi-column indices (nbtsearch.c):
_bt_binsrch() - for searches on internal pages having keysize <
number of attrs we point at the last item < the scankey, not at the
first item = the scankey;
_bt_moveright() - if keysize < number of attrs we compare scankey with
_last_ item on current page to decide should we move right or
not.
Actually required by multi-column indices support.
We still don't use btree for 'A is (not) null', but
now btree keep items with NULL attrs using single rule
for placing/finding items on pages:
NULLs greater NOT_NULLs and NULL = NULL.
+ Bulkload code (nbtsort.c) support for multi-column indices
building and NULLs.
+ Fix for btendscan()->pfree(scanopaque) from Chris Dunlop.
Patches from: aoki@CS.Berkeley.EDU (Paul M. Aoki)
i gave jolly my btree bulkload code a long, long time ago but never
gave him a bunch of my bugfixes. here's a diff against the 6.0
baseline.
for some reason, this code has slowed down somewhat relative to the
insertion-build code on very small tables. don't know why -- it used
to be within about 10%. anyway, here are some (highly unscientific!)
timings on a dec 3000/300 for synthetic tables with 10k, 100k and
1000k tuples (basically, 1mb, 10mb and 100mb heaps). 'c' means
clustered (pre-sorted) inputs and 'u' means unclustered (randomly
ordered) inputs. the 10k table basically fits in the buffer pool, but
the 100k and 1000k tables don't. as you can see, insertion build is
fine if you've sorted your heaps on your index key or if your heap
fits in core, but is absolutely horrible on unordered data (yes,
that's 7.5 hours to index 100mb of data...) because of the zillions of
random i/os.
if it doesn't work for you for whatever reason, you can always turn it
back off by flipping the FastBuild flag in nbtree.c. i don't have
time to maintain it.
good luck!
baseline code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 8.6
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 9.1
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.2
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 652.4
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.1
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 26772.9
bulkloading code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 11.3
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 10.4
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.5
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 63.5
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.9
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 701.0