Commit Graph

211 Commits

Author SHA1 Message Date
Simon Riggs 606c0123d6 Reduce btree scan overhead for < and > strategies
For <, <=, > and >= strategies, mark the first scan key
as already matched if scanning in an appropriate direction.
If index tuple contains no nulls we can skip the first
re-check for each tuple.

Author: Rajeev Rastogi
Reviewer: Haribabu Kommi
Rework of the code and comments by Simon Riggs
2014-11-18 10:24:55 +00:00
Heikki Linnakangas 2076db2aea Move the backup-block logic from XLogInsert to a new file, xloginsert.c.
xlog.c is huge, this makes it a little bit smaller, which is nice. Functions
related to putting together the WAL record are in xloginsert.c, and the
lower level stuff for managing WAL buffers and such are in xlog.c.

Also move the definition of XLogRecord to a separate header file. This
causes churn in the #includes of all the files that write WAL records, and
redo routines, but it avoids pulling in xlog.h into most places.

Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.
2014-11-06 13:55:36 +02:00
Andres Freund 728f152e07 Add rmgr callback to name xlog record types for display purposes.
This is primarily useful for the upcoming pg_xlogdump --stats feature,
but also allows to remove some duplicated code in the rmgr_desc
routines.

Due to the separation and harmonization, the output of dipsplayed
records changes somewhat. But since this isn't enduser oriented
content that's ok.

It's potentially desirable to further change pg_xlogdump's display of
records. It previously wasn't possible to show the record type
separately from the description forcing it to be in the last
column. But that's better done in a separate commit.

Author: Abhijit Menon-Sen, slightly editorialized by me
Reviewed-By: Álvaro Herrera, Andres Freund, and Heikki Linnakangas
Discussion: 20140604104716.GA3989@toroid.org
2014-09-19 16:20:29 +02:00
Robert Haas 9f03ca9151 Avoid copying index tuples when building an index.
The previous code, perhaps out of concern for avoid memory leaks, formed
the tuple in one memory context and then copied it to another memory
context.  However, this doesn't appear to be necessary, since
index_form_tuple and the functions it calls take precautions against
leaking memory.  In my testing, building the tuple directly inside the
sort context shaves several percent off the index build time.
Rearrange things so we do that.

Patch by me.  Review by Amit Kapila, Tom Lane, Andres Freund.
2014-07-01 10:34:42 -04:00
Heikki Linnakangas 0ef0b6784c Change the signature of rm_desc so that it's passed a XLogRecord.
Just feels more natural, and is more consistent with rm_redo.
2014-06-14 10:46:48 +03:00
Bruce Momjian 0a78320057 pgindent run for 9.4
This includes removing tabs after periods in C comments, which was
applied to back branches, so this change should not effect backpatching.
2014-05-06 12:12:18 -04:00
Heikki Linnakangas 4fafc4ecd9 Cleanup of new b-tree page deletion code.
When marking a branch as half-dead, a pointer to the top of the branch is
stored in the leaf block's hi-key. During normal operation, the high key
was left in place, and the block number was just stored in the ctid field
of the high key tuple, but in WAL replay, the high key was recreated as a
truncated tuple with zero columns. For the sake of easier debugging, also
truncate the tuple in normal operation, so that the page is identical
after WAL replay. Also, rename the 'downlink' field in the WAL record to
'topparent', as that seems like a more descriptive name. And make sure
it's set to invalid when unlinking the leaf page.
2014-04-23 10:19:54 +03:00
Heikki Linnakangas 40dae7ec53 Make the handling of interrupted B-tree page splits more robust.
Splitting a page consists of two separate steps: splitting the child page,
and inserting the downlink for the new right page to the parent. Previously,
we handled the case that you crash in between those steps with a cleanup
routine after the WAL recovery had finished, which finished the incomplete
split. However, that doesn't help if the page split is interrupted but the
database doesn't crash, so that you don't perform WAL recovery. That could
happen for example if you run out of disk space.

Remove the end-of-recovery cleanup step. Instead, when a page is split, the
left page is marked with a new INCOMPLETE_SPLIT flag, and when the downlink
is inserted to the parent, the flag is cleared again. If an insertion sees
a page with the flag set, it knows that the split was interrupted for some
reason, and inserts the missing downlink before proceeding.

I used the same approach to fix GIN and GiST split algorithms earlier. This
was the last WAL cleanup routine, so we could get rid of that whole
machinery now, but I'll leave that for a separate patch.

Reviewed by Peter Geoghegan.
2014-03-18 20:50:44 +02:00
Magnus Hagander 02703ff227 Fix small typo in comment
Michael Paquier
2014-03-17 09:09:21 +01:00
Heikki Linnakangas efada2b8e9 Fix race condition in B-tree page deletion.
In short, we don't allow a page to be deleted if it's the rightmost child
of its parent, but that situation can change after we check for it.

Problem
-------

We check that the page to be deleted is not the rightmost child of its
parent, and then lock its left sibling, the page itself, its right sibling,
and the parent, in that order. However, if the parent page is split after
the check but before acquiring the locks, the target page might become the
rightmost child, if the split happens at the right place. That leads to an
error in vacuum (I reproduced this by setting a breakpoint in debugger):

ERROR:  failed to delete rightmost child 41 of block 3 in index "foo_pkey"

We currently re-check that the page is still the rightmost child, and throw
the above error if it's not. We could easily just give up rather than throw
an error, but that approach doesn't scale to half-dead pages. To recap,
although we don't normally allow deleting the rightmost child, if the page
is the *only* child of its parent, we delete the child page and mark the
parent page as half-dead in one atomic operation. But before we do that, we
check that the parent can later be deleted, by checking that it in turn is
not the rightmost child of the grandparent (potentially recursing all the
way up to the root). But the same situation can arise there - the
grandparent can be split while we're not holding the locks. We end up with
a half-dead page that we cannot delete.

To make things worse, the keyspace of the deleted page has already been
transferred to its right sibling. As the README points out, the keyspace at
the grandparent level is "out-of-whack" until the half-dead page is deleted,
and if enough tuples with keys in the transferred keyspace are inserted, the
page might get split and a downlink might be inserted into the grandparent
that is out-of-order. That might not cause any serious problem if it's
transient (as the README ponders), but is surely bad if it stays that way.

Solution
--------

This patch changes the page deletion algorithm to avoid that problem. After
checking that the topmost page in the chain of to-be-deleted pages is not
the rightmost child of its parent, and then deleting the pages from bottom
up, unlink the pages from top to bottom. This way, the intermediate stages
are similar to the intermediate stages in page splitting, and there is no
transient stage where the keyspace is "out-of-whack". The topmost page in
the to-be-deleted chain doesn't have a downlink pointing to it, like a page
split before the downlink has been inserted.

This also allows us to get rid of the cleanup step after WAL recovery, if we
crash during page deletion. The deletion will be continued at next VACUUM,
but the tree is consistent for searches and insertions at every step.

This bug is old, all supported versions are affected, but this patch is too
big to back-patch (and changes the WAL record formats of related records).
We have not heard any reports of the bug from users, so clearly it's not
easy to bump into. Maybe backpatch later, after this has had some field
testing.

Reviewed by Kevin Grittner and Peter Geoghegan.
2014-03-14 16:07:19 +02:00
Bruce Momjian 7e04792a1c Update copyright for 2014
Update all files in head, and files COPYRIGHT and legal.sgml in all back
branches.
2014-01-07 16:05:30 -05:00
Tom Lane 991f3e5ab3 Provide database object names as separate fields in error messages.
This patch addresses the problem that applications currently have to
extract object names from possibly-localized textual error messages,
if they want to know for example which index caused a UNIQUE_VIOLATION
failure.  It adds new error message fields to the wire protocol, which
can carry the name of a table, table column, data type, or constraint
associated with the error.  (Since the protocol spec has always instructed
clients to ignore unrecognized field types, this should not create any
compatibility problem.)

Support for providing these new fields has been added to just a limited set
of error reports (mainly, those in the "integrity constraint violation"
SQLSTATE class), but we will doubtless add them to more calls in future.

Pavel Stehule, reviewed and extensively revised by Peter Geoghegan, with
additional hacking by Tom Lane.
2013-01-29 17:08:26 -05:00
Tom Lane 31f38f28b0 Redesign the planner's handling of index-descent cost estimation.
Historically we've used a couple of very ad-hoc fudge factors to try to
get the right results when indexes of different sizes would satisfy a
query with the same number of index leaf tuples being visited.  In
commit 21a39de580 I tweaked one of these
fudge factors, with results that proved disastrous for larger indexes.
Commit bf01e34b55 fudged it some more,
but still with not a lot of principle behind it.

What seems like a better way to address these issues is to explicitly model
index-descent costs, since that's what's really at stake when considering
diferent indexes with similar leaf-page-level costs.  We tried that once
long ago, and found that charging random_page_cost per page descended
through was way too much, because upper btree levels tend to stay in cache
in real-world workloads.  However, there's still CPU costs to think about,
and the previous fudge factors can be seen as a crude attempt to account
for those costs.  So this patch replaces those fudge factors with explicit
charges for the number of tuple comparisons needed to descend the index
tree, plus a small charge per page touched in the descent.  The cost
multipliers are chosen so that the resulting charges are in the vicinity of
the historical (pre-9.2) fudge factors for indexes of up to about a million
tuples, while not ballooning unreasonably beyond that, as the old fudge
factor did (even more so in 9.2).

To make this work accurately for btree indexes, add some code that allows
extraction of the known root-page height from a btree.  There's no
equivalent number readily available for other index types, but we can use
the log of the number of index pages as an approximate substitute.

This seems like too much of a behavioral change to risk back-patching,
but it should improve matters going forward.  In 9.2 I'll just revert
the fudge-factor change.
2013-01-11 12:56:58 -05:00
Bruce Momjian bd61a623ac Update copyrights for 2013
Fully update git head, and update back branches in ./COPYRIGHT and
legal.sgml files.
2013-01-01 17:15:01 -05:00
Tom Lane 70bc583319 Fix btmarkpos/btrestrpos to handle array keys.
This fixes another error in commit 9e8da0f757.
I neglected to make the mark/restore functionality save and restore the
current set of array key values, which led to strange behavior if an
IndexScan with ScalarArrayOpExpr quals was used as the inner side of a
mergejoin.  Per bug #7570 from Melese Tesfaye.
2012-09-27 17:01:02 -04:00
Bruce Momjian 927d61eeff Run pgindent on 9.2 source tree in preparation for first 9.3
commit-fest.
2012-06-10 15:20:04 -04:00
Tom Lane 9789c99d01 Cosmetic cleanup for commit a760893dbd.
Mostly, fixing overlooked comments.
2012-02-21 14:14:16 -05:00
Bruce Momjian e126958c2e Update copyright notices for year 2012. 2012-01-01 18:01:58 -05:00
Tom Lane 3695a55513 Replace simple constant pg_am.amcanreturn with an AM support function.
The need for this was debated when we put in the index-only-scan feature,
but at the time we had no near-term expectation of having AMs that could
support such scans for only some indexes; so we kept it simple.  However,
the SP-GiST AM forces the issue, so let's fix it.

This patch only installs the new API; no behavior actually changes.
2011-12-18 15:50:37 -05:00
Tom Lane c6e3ac11b6 Create a "sort support" interface API for faster sorting.
This patch creates an API whereby a btree index opclass can optionally
provide non-SQL-callable support functions for sorting.  In the initial
patch, we only use this to provide a directly-callable comparator function,
which can be invoked with a bit less overhead than the traditional
SQL-callable comparator.  While that should be of value in itself, the real
reason for doing this is to provide a datatype-extensible framework for
more aggressive optimizations, as in Peter Geoghegan's recent work.

Robert Haas and Tom Lane
2011-12-07 00:19:39 -05:00
Tom Lane 9e8da0f757 Teach btree to handle ScalarArrayOpExpr quals natively.
This allows "indexedcol op ANY(ARRAY[...])" conditions to be used in plain
indexscans, and particularly in index-only scans.
2011-10-16 15:39:24 -04:00
Tom Lane cbfa92c23c Improve index-only scans to avoid repeated access to the index page.
We copy all the matched tuples off the page during _bt_readpage, instead of
expensively re-locking the page during each subsequent tuple fetch.  This
costs a bit more local storage, but not more than 2*BLCKSZ worth, and the
reduction in LWLock traffic is certainly worth that.  What's more, this
lets us get rid of the API wart in the original patch that said an index AM
could randomly decline to supply an index tuple despite having asserted
pg_am.amcanreturn.  That will be important for future improvements in the
index-only-scan feature, since the executor will now be able to rely on
having the index data available.
2011-10-09 00:21:08 -04:00
Bruce Momjian 4bd7333b14 Allow more include files to be compiled in their own by adding missing
include dependencies.

Modify pgcompinclude to skip a common fcinfo error.
2011-08-27 11:05:33 -04:00
Bruce Momjian 5d950e3b0c Stamp copyrights for year 2011. 2011-01-01 13:18:15 -05:00
Robert Haas 53dbc27c62 Support unlogged tables.
The contents of an unlogged table are WAL-logged; thus, they are not
available on standby servers and are truncated whenever the database
system enters recovery.  Indexes on unlogged tables are also unlogged.
Unlogged GiST indexes are not currently supported.
2010-12-29 06:48:53 -05:00
Magnus Hagander 9f2e211386 Remove cvs keywords from all files. 2010-09-20 22:08:53 +02:00
Bruce Momjian 239d769e7e pgindent run for 9.0, second run 2010-07-06 19:19:02 +00:00
Simon Riggs a760893dbd Derive latestRemovedXid for btree deletes by reading heap pages. The
WAL record for btree delete contains a list of tids, even when backup
blocks are present. We follow the tids to their heap tuples, taking
care to follow LP_REDIRECT tuples. We ignore LP_DEAD tuples on the
understanding that they will always have xmin/xmax earlier than any
LP_NORMAL tuples referred to by killed index tuples. Iff all tuples
are LP_DEAD we return InvalidTransactionId. The heap relfilenode is
added to the WAL record, requiring API changes to pass down the heap
Relation. XLOG_PAGE_MAGIC updated.
2010-03-28 09:27:02 +00:00
Simon Riggs bf6285b3a7 Further corrections of mismatching struct and btree SizeOf macros.
In this case, correction is to remove now unused fields from struct.
Since these were unused and full of garbage anyway, no version change.
2010-03-20 07:49:48 +00:00
Tom Lane 865b29540e Fix oversight in btpo.xact patch; it was in fact installing garbage
in the xact field on replay, due to not writing out all the data in
the wal log struct.
2010-03-19 20:51:30 +00:00
Simon Riggs 5c73ae17d1 Reset btpo.xact following recovery of btree delete page. Add btpo_xact
field into WAL record and reset it from there, rather than using
FrozenTransactionId which can lead to some corner case bugs.

Problem report and suggested route to a fix from Heikki, details by me.
2010-03-19 10:41:22 +00:00
Bruce Momjian 65e806cba1 pgindent run for 9.0 2010-02-26 02:01:40 +00:00
Simon Riggs fafa374f2d Introduce WAL records to log reuse of btree pages, allowing conflict
resolution during Hot Standby. Page reuse interlock requested by Tom.
Analysis and patch by me.
2010-02-13 00:59:58 +00:00
Tom Lane 0a469c8769 Remove old-style VACUUM FULL (which was known for a little while as
VACUUM FULL INPLACE), along with a boatload of subsidiary code and complexity.
Per discussion, the use case for this method of vacuuming is no longer large
enough to justify maintaining it; not to mention that we don't wish to invest
the work that would be needed to make it play nicely with Hot Standby.

Aside from the code directly related to old-style VACUUM FULL, this commit
removes support for certain WAL record types that could only be generated
within VACUUM FULL, redirect-pointer removal in heap_page_prune, and
nontransactional generation of cache invalidation sinval messages (the last
being the sticking point for Hot Standby).

We still have to retain all code that copes with finding HEAP_MOVED_OFF and
HEAP_MOVED_IN flag bits on existing tuples.  This can't be removed as long
as we want to support in-place update from pre-9.0 databases.
2010-02-08 04:33:55 +00:00
Bruce Momjian 0239800893 Update copyright for the year 2010. 2010-01-02 16:58:17 +00:00
Simon Riggs efc16ea520 Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.

New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.

This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.

Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.

Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 01:32:45 +00:00
Tom Lane 25d9bf2e3e Support deferrable uniqueness constraints.
The current implementation fires an AFTER ROW trigger for each tuple that
looks like it might be non-unique according to the index contents at the
time of insertion.  This works well as long as there aren't many conflicts,
but won't scale to massive unique-key reassignments.  Improving that case
is a TODO item.

Dean Rasheed
2009-07-29 20:56:21 +00:00
Bruce Momjian d747140279 8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef list
provided by Andrew.
2009-06-11 14:49:15 +00:00
Bruce Momjian 511db38ace Update copyright for 2009. 2009-01-01 17:24:05 +00:00
Heikki Linnakangas 4942ea2870 The flag to mark dead tuples is nowadays called LP_DEAD, not LP_DELETE.
Simon Riggs.
2008-12-30 16:24:37 +00:00
Tom Lane 9d035f4254 Clean up the use of some page-header-access macros: principally, use
SizeOfPageHeaderData instead of sizeof(PageHeaderData) in places where that
makes the code clearer, and avoid casting between Page and PageHeader where
possible.  Zdenek Kotala, with some additional cleanup by Heikki Linnakangas.

I did not apply the parts of the proposed patch that would have resulted in
slightly changing the on-disk format of hash indexes; it seems to me that's
not a win as long as there's any chance of having in-place upgrade for 8.4.
2008-07-13 20:45:47 +00:00
Alvaro Herrera a3540b0f65 Improve our #include situation by moving pointer types away from the
corresponding struct definitions.  This allows other headers to avoid including
certain highly-loaded headers such as rel.h and relscan.h, instead using just
relcache.h, heapam.h or genam.h, which are more lightweight and thus cause less
unnecessary dependencies.
2008-06-19 00:46:06 +00:00
Alvaro Herrera e4ca6cac43 Change xlog.h to xlogdefs.h in bufpage.h, and fix fallout. 2008-06-06 22:35:22 +00:00
Tom Lane d1cbd26ded Repair two places where SIGTERM exit could leave shared memory state
corrupted.  (Neither is very important if SIGTERM is used to shut down the
whole database cluster together, but there's a problem if someone tries to
SIGTERM individual backends.)  To do this, introduce new infrastructure
macros PG_ENSURE_ERROR_CLEANUP/PG_END_ENSURE_ERROR_CLEANUP that take care
of transiently pushing an on_shmem_exit cleanup hook.  Also use this method
for createdb cleanup --- that wasn't a shared-memory-corruption problem,
but SIGTERM abort of createdb could leave orphaned files lying around.

Backpatch as far as 8.2.  The shmem corruption cases don't exist in 8.1,
and the createdb usage doesn't seem important enough to risk backpatching
further.
2008-04-16 23:59:40 +00:00
Tom Lane 4e82a95476 Replace "amgetmulti" AM functions with "amgetbitmap", in which the whole
indexscan always occurs in one call, and the results are returned in a
TIDBitmap instead of a limited-size array of TIDs.  This should improve
speed a little by reducing AM entry/exit overhead, and it is necessary
infrastructure if we are ever to support bitmap indexes.

In an only slightly related change, add support for TIDBitmaps to preserve
(somewhat lossily) the knowledge that particular TIDs reported by an index
need to have their quals rechecked when the heap is visited.  This facility
is not really used yet; we'll need to extend the forced-recheck feature to
plain indexscans before it's useful, and that hasn't been coded yet.
The intent is to use it to clean up 8.3's horrid @@@ kluge for text search
with weighted queries.  There might be other uses in future, but that one
alone is sufficient reason.

Heikki Linnakangas, with some adjustments by me.
2008-04-10 22:25:26 +00:00
Bruce Momjian 9098ab9e32 Update copyrights in source tree to 2008. 2008-01-01 19:46:01 +00:00
Tom Lane 93190c3098 Repair still another bug in the btree page split WAL reduction patch:
it failed for splits of non-leaf pages because in such pages the first
data key on a page is suppressed, and so we can't just copy the first
key from the right page to reconstitute the left page's high key.
Problem found by Koichi Suzuki, patch by Heikki.
2007-11-16 19:53:50 +00:00
Bruce Momjian fdf5a5efb7 pgindent run for 8.3. 2007-11-15 21:14:46 +00:00
Tom Lane 226a100568 Code review for btree page split WAL reduction patch. Make it actually work
(original code *always* created a full-page image for the left page, thus
leaving the intended savings unrealized), avoid risk of not having enough room
on the page during xlog restore, squeeze out another couple bytes in the xlog
record, clean up neglected comments.
2007-04-11 20:47:38 +00:00
Tom Lane 56218fbc48 Minor tweaking of index special-space definitions so that the various
index types can be reliably distinguished by examining the special space
on an index page.  Per my earlier proposal, plus the realization that
there's no need for btree's vacuum cycle ID to cycle through every possible
16-bit value.  Restricting its range a little costs nearly nothing and
eliminates the possibility of collisions.
Memo to self: remember to make bitmap indexes play along with this scheme,
assuming that patch ever gets accepted.
2007-04-09 22:04:08 +00:00
Bruce Momjian b79575ce45 Reduce WAL activity for page splits:
> Currently, an index split writes all the data on the split page to
> WAL. That's a lot of WAL traffic. The tuples that are copied to the
> right page need to be WAL logged, but the tuples that stay on the
> original page don't.

Heikki Linnakangas
2007-02-08 05:05:53 +00:00
Tom Lane 23c4978e6c Rename MaxTupleSize to MaxHeapTupleSize to clarify that it's not meant to
describe the maximum size of index tuples (which is typically AM-dependent
anyway); and consequently remove the bogus deduction for "special space"
that was built into it.

Adjust TOAST_TUPLE_THRESHOLD and TOAST_MAX_CHUNK_SIZE to avoid wasting two
bytes per toast chunk, and to ensure that the calculation correctly tracks any
future changes in page header size.  The computation had been inaccurate in a
way that didn't cause any harm except space wastage, but future changes could
have broken it more drastically.

Fix the calculation of BTMaxItemSize, which was formerly computed as 1 byte
more than it could safely be.  This didn't cause any harm in practice because
it's only compared against maxalign'd lengths, but future changes in the size
of page headers or btree special space could have exposed the problem.

initdb forced because of change in TOAST_MAX_CHUNK_SIZE, which alters the
storage of toast tables.
2007-02-05 04:22:18 +00:00
Neil Conway 2b7334d487 Refactor the index AM API slightly: move currentItemData and
currentMarkData from IndexScanDesc to the opaque structs for the
AMs that need this information (currently gist and hash).

Patch from Heikki Linnakangas, fixes by Neil Conway.
2007-01-20 18:43:35 +00:00
Tom Lane 4431758229 Support ORDER BY ... NULLS FIRST/LAST, and add ASC/DESC/NULLS FIRST/NULLS LAST
per-column options for btree indexes.  The planner's support for this is still
pretty rudimentary; it does not yet know how to plan mergejoins with
nondefault ordering options.  The documentation is pretty rudimentary, too.
I'll work on improving that stuff later.

Note incompatible change from prior behavior: ORDER BY ... USING will now be
rejected if the operator is not a less-than or greater-than member of some
btree opclass.  This prevents less-than-sane behavior if an operator that
doesn't actually define a proper sort ordering is selected.
2007-01-09 02:14:16 +00:00
Bruce Momjian 29dccf5fe0 Update CVS HEAD for 2007 copyright. Back branches are typically not
back-stamped for this.
2007-01-05 22:20:05 +00:00
Tom Lane 70ce5c9082 Fix "failed to re-find parent key" btree VACUUM failure by revising page
deletion code to avoid the case where an upper-level btree page remains "half
dead" for a significant period of time, and to block insertions into a key
range that is in process of being re-assigned to the right sibling of the
deleted page's parent.  This prevents the scenario reported by Ed L. wherein
index keys could become out-of-order in the grandparent index level.

Since this is a moderately invasive fix, I'm applying it only to HEAD.
The bug exists back to 7.4, but the back branches will get a different patch.
2006-11-01 19:43:17 +00:00
Bruce Momjian f99a569a2e pgindent run for 8.2. 2006-10-04 00:30:14 +00:00
Tom Lane 08ae5edc5c Optimize the case where a btree indexscan has current and mark positions
on the same index page; we can avoid data copying as well as buffer refcount
manipulations in this common case.  Makes for a small but noticeable
improvement in mergejoin speed.

Heikki Linnakangas
2006-08-24 01:18:34 +00:00
Tom Lane e002836913 Make recovery from WAL be restartable, by executing a checkpoint-like
operation every so often.  This improves the usefulness of PITR log
shipping for hot standby: formerly, if the standby server crashed, it
was necessary to restart it from the last base backup and replay all
the WAL since then.  Now it will only need to reread about the same
amount of WAL as the master server would.  The behavior might also
come in handy during a long PITR replay sequence.  Simon Riggs,
with some editorialization by Tom Lane.
2006-08-07 16:57:57 +00:00
Tom Lane e6284649b9 Modify btree to delete known-dead index entries without an actual VACUUM.
When we are about to split an index page to do an insertion, first look
to see if any entries marked LP_DELETE exist on the page, and if so remove
them to try to make enough space for the desired insert.  This should reduce
index bloat in heavily-updated tables, although of course you still need
VACUUM eventually to clean up the heap.

Junji Teramoto
2006-07-25 19:13:00 +00:00
Tom Lane d29b66882a Tweak fillfactor code as per my recent proposal. Fix nbtsort.c so that
it can handle small fillfactors for ordinary-sized index entries without
failing on large ones; fix nbtinsert.c to distinguish leaf and nonleaf
pages; change the minimum fillfactor to 10% for all index types.
2006-07-11 21:05:57 +00:00
Tom Lane b7b78d24f7 Code review for FILLFACTOR patch. Change WITH grammar as per earlier
discussion (including making def_arg allow reserved words), add missed
opt_definition for UNIQUE case.  Put the reloptions support code in a less
random place (I chose to make a new file access/common/reloptions.c).
Eliminate header inclusion creep.  Make the index options functions safely
user-callable (seems like client apps might like to be able to test validity
of options before trying to make an index).  Reduce overhead for normal case
with no options by allowing rd_options to be NULL.  Fix some unmaintainably
klugy code, including getting rid of Natts_pg_class_fixed at long last.
Some stylistic cleanup too, and pay attention to keeping comments in sync
with code.

Documentation still needs work, though I did fix the omissions in
catalogs.sgml and indexam.sgml.
2006-07-03 22:45:41 +00:00
Bruce Momjian 277807bd9e Add FILLFACTOR to CREATE INDEX.
ITAGAKI Takahiro
2006-07-02 02:23:23 +00:00
Tom Lane 5749f6ef0c Rewrite btree vacuuming to fold the former bulkdelete and cleanup operations
into a single mostly-physical-order scan of the index.  This requires some
ticklish interlocking considerations, but should create no material
performance impact on normal index operations (at least given the
already-committed changes to make scans work a page at a time).  VACUUM
itself should get significantly faster in any index that's degenerated to a
very nonlinear page order.  Also, we save one pass over the index entirely,
except in the case where there were no deletions to do and so only one pass
happened anyway.

Original patch by Heikki Linnakangas, rework by Tom Lane.
2006-05-08 00:00:17 +00:00
Tom Lane 09cb5c0e7d Rewrite btree index scans to work a page at a time in all cases (both
btgettuple and btgetmulti).  This eliminates the problem of "re-finding" the
exact stopping point, since the stopping point is effectively always a page
boundary, and index items are never moved across pre-existing page boundaries.
A small penalty is that the keys_are_unique optimization is effectively
disabled (and, therefore, is removed in this patch), causing us to apply
_bt_checkkeys() to at least one more tuple than necessary when looking up a
unique key.  However, the advantages for non-unique cases seem great enough to
accept this tradeoff.  Aside from simplifying and (sometimes) speeding up the
indexscan code, this will allow us to reimplement btbulkdelete as a largely
sequential scan instead of index-order traversal, thereby significantly
reducing the cost of VACUUM.  Those changes will come in a separate patch.

Original patch by Heikki Linnakangas, rework by Tom Lane.
2006-05-07 01:21:30 +00:00
Tom Lane 49a7610c36 Fix an ancient oversight in btree xlog replay. When trying to determine if an
upper-level insertion completes a previously-seen split, we cannot simply grab
the downlink block number out of the buffer, because the buffer could contain
a later state of the page --- or perhaps the page doesn't even exist at all
any more, due to relation truncation.  These possibilities have been masked up
to now because the use of full_page_writes effectively ensured that no xlog
replay routine ever actually saw a page state newer than its own change.
Since we're deprecating full_page_writes in 8.1.*, there's no need to fix this
in existing release branches, but we need a fix in HEAD if we want to have any
hope of re-allowing full_page_writes.  Accordingly, adjust the contents of
btree WAL records so that we can always get the downlink block number from the
WAL record rather than having to depend on buffer contents.  Per report from
Kevin Grittner and Peter Brant.

Improve a few comments in related code while at it.
2006-04-13 03:53:05 +00:00
Tom Lane 89bda95d82 Remove the 'slow' path for btree index build, which built the btree
incrementally by successive inserts rather than by sorting the data.
We were only using the slow path during bootstrap, apparently because
when first written it failed during bootstrap --- but it works fine now
AFAICT.  Removing it saves a hundred or so lines of code and produces
noticeably (~10%) smaller initial states of the system catalog indexes.
While that won't make much difference for heavily-modified catalogs,
for the more static ones there may be a useful long-term performance
improvement.
2006-04-01 03:03:37 +00:00
Tom Lane a8b8f4db23 Clean up WAL/buffer interactions as per my recent proposal. Get rid of the
misleadingly-named WriteBuffer routine, and instead require routines that
change buffer pages to call MarkBufferDirty (which does exactly what it says).
We also require that they do so before calling XLogInsert; this takes care of
the synchronization requirement documented in SyncOneBuffer.  Note that
because bufmgr takes the buffer content lock (in shared mode) while writing
out any buffer, it doesn't matter whether MarkBufferDirty is executed before
the buffer content change is complete, so long as the content change is
completed before releasing exclusive lock on the buffer.  So it's OK to set
the dirtybit before we fill in the LSN.
This eliminates the former kluge of needing to set the dirtybit in LockBuffer.
Aside from making the code more transparent, we can also add some new
debugging assertions, in particular that the caller of MarkBufferDirty must
hold the buffer content lock, not merely a pin.
2006-03-31 23:32:07 +00:00
Tom Lane 0a20207060 Arrange to emit a description of the current XLOG record as error context
when an error occurs during xlog replay.  Also, replace the former risky
'write into a fixed-size buffer with no overflow detection' API for XLOG
record description routines; use an expansible StringInfo instead.  (The
latter accounts for most of the patch bulk.)

Qingqing Zhou
2006-03-24 04:32:13 +00:00
Bruce Momjian f2f5b05655 Update copyright for 2006. Update scripts. 2006-03-05 15:59:11 +00:00
Tom Lane c389760c32 Remove the no-longer-useful BTItem/BTItemData level of structure, and
just refer to btree index entries as plain IndexTuples, which is what
they have been for a very long time.  This is mostly just an exercise
in removing extraneous notation, but it does save a palloc/pfree cycle
per index insertion.
2006-01-25 23:04:21 +00:00
Tom Lane 7ccaf13a06 Instead of using a numberOfRequiredKeys count to distinguish required
and non-required keys in a btree index scan, mark the required scankeys
with private flag bits SK_BT_REQFWD and/or SK_BT_REQBKWD.  This seems
at least marginally clearer to me, and it eliminates a wired-into-the-
data-structure assumption that required keys are consecutive.  Even though
that assumption will remain true for the foreseeable future, having it
in there makes the code seem more complex than necessary.
2006-01-23 22:31:41 +00:00
Tom Lane cefcbbf1fd Push the responsibility for handling ignore_killed_tuples down into
_bt_checkkeys(), instead of checking it in the top-level nbtree.c routines
as formerly.  This saves a little bit of loop overhead, but more importantly
it lets us skip performing the index key comparisons for dead tuples.
2005-12-07 19:37:53 +00:00
Tom Lane 766dc45d9f Add defenses to btree and hash index AMs to do simple sanity checks
on every index page they read; in particular to catch the case of an
all-zero page, which PageHeaderIsValid allows to pass.  It turns out
hash already had this idea, but it was just Assert()ing things rather
than doing a straight error check, and the Asserts were partially
redundant with PageHeaderIsValid anyway.  Per recent failure example
from Jim Nasby.  (gist still needs the same treatment.)
2005-11-06 19:29:01 +00:00
Bruce Momjian 1dc3498251 Standard pgindent run for 8.1. 2005-10-15 02:49:52 +00:00
Tom Lane 4c8495a1f2 Remove the mostly-stubbed-out-anyway support routines for WAL UNDO.
That code is never going to be used in the foreseeable future, and
where it's more than a stub it's making the redo routines harder to
read.
2005-06-06 17:01:25 +00:00
Tom Lane bf3dbb5881 First steps towards index scans with heap access decoupled from index
access: define new index access method functions 'amgetmulti' that can
fetch multiple TIDs per call.  (The functions exist but are totally
untested as yet.)  Since I was modifying pg_am anyway, remove the
no-longer-needed 'rel' parameter from amcostestimate functions, and
also remove the vestigial amowner column that was creating useless
work for Alvaro's shared-object-dependencies project.
Initdb forced due to changes in pg_am.
2005-03-27 23:53:05 +00:00
Tom Lane ee4ddac137 Convert index-related tuple handling routines from char 'n'/' ' to bool
convention for isnull flags.  Also, remove the useless InsertIndexResult
return struct from index AM aminsert calls --- there is no reason for
the caller to know where in the index the tuple was inserted, and we
were wasting a palloc cycle per insert to deliver this uninteresting
value (plus nontrivial complexity in some AMs).
I forced initdb because of the change in the signature of the aminsert
routines, even though nothing really looks at those pg_proc entries...
2005-03-21 01:24:04 +00:00
PostgreSQL Daemon 2ff501590b Tag appropriate files for rc3
Also performed an initial run through of upgrading our Copyright date to
extend to 2005 ... first run here was very simple ... change everything
where: grep 1996-2004 && the word 'Copyright' ... scanned through the
generated list with 'less' first, and after, to make sure that I only
picked up the right entries ...
2004-12-31 22:04:05 +00:00
Bruce Momjian b6b71b85bc Pgindent run for 8.0. 2004-08-29 05:07:03 +00:00
Bruce Momjian da9a8649d8 Update copyright to 2004. 2004-08-29 04:13:13 +00:00
Tom Lane fe548629c5 Invent ResourceOwner mechanism as per my recent proposal, and use it to
keep track of portal-related resources separately from transaction-related
resources.  This allows cursors to work in a somewhat sane fashion with
nested transactions.  For now, cursor behavior is non-subtransactional,
that is a cursor's state does not roll back if you abort a subtransaction
that fetched from the cursor.  We might want to change that later.
2004-07-17 03:32:14 +00:00
Tom Lane 94d4d240bb Rename XLOG_BTREE_NEWPAGE xlog record type into XLOG_HEAP_NEWPAGE, and
shift support code into heapam.c accordingly.  This is in service of
soon-to-be-committed ALTER TABLE SET TABLESPACE code that will want to
use this same record type for both heaps and indexes.

Theoretically I should have forced initdb for this, but in practice there
is no change in xlog contents because CVS tip will never really emit this
record type anyhow...
2004-07-11 18:01:45 +00:00
Tom Lane 2095206de1 Adjust btree index build to not use shared buffers, thereby avoiding the
locking conflict against concurrent CHECKPOINT that was discussed a few
weeks ago.  Also, if not using WAL archiving (which is always true ATM
but won't be if PITR makes it into this release), there's no need to
WAL-log the index build process; it's sufficient to force-fsync the
completed index before commit.  This seems to gain about a factor of 2
in my tests, which is consistent with writing half as much data.  I did
not try it with WAL on a separate drive though --- probably the gain would
be a lot less in that scenario.
2004-06-02 17:28:18 +00:00
Tom Lane 37fa3b6c89 Tweak indexscan and seqscan code to arrange that steps from one page to
the next are handled by ReleaseAndReadBuffer rather than separate
ReleaseBuffer and ReadBuffer calls.  This cuts the number of acquisitions
of the BufMgrLock by a factor of 2 (possibly more, if an indexscan happens
to pull successive rows from the same heap page).  Unfortunately this
doesn't seem enough to get us out of the recently discussed context-switch
storm problem, but it's surely worth doing anyway.
2004-04-21 18:24:26 +00:00
Tom Lane 391c3811a2 Rename SortMem and VacuumMem to work_mem and maintenance_work_mem.
Make btree index creation and initial validation of foreign-key constraints
use maintenance_work_mem rather than work_mem as their memory limit.
Add some code to guc.c to allow these variables to be referenced by their
old names in SHOW and SET commands, for backwards compatibility.
2004-02-03 17:34:04 +00:00
Tom Lane 569659ae16 Improve btree's initial-positioning-strategy code so that we never need
to step more than one entry after descending the search tree to arrive at
the correct place to start the scan.  This can improve the behavior
substantially when there are many entries equal to the chosen boundary
value.  Per suggestion from Dmitry Tkach, 14-Jul-03.
2003-12-21 01:23:06 +00:00
PostgreSQL Daemon 55b113257c make sure the $Id tags are converted to $PostgreSQL as well ... 2003-11-29 22:41:33 +00:00
Tom Lane fa5c8a055a Cross-data-type comparisons are now indexable by btrees, pursuant to my
pghackers proposal of 8-Nov.  All the existing cross-type comparison
operators (int2/int4/int8 and float4/float8) have appropriate support.
The original proposal of storing the right-hand-side datatype as part of
the primary key for pg_amop and pg_amproc got modified a bit in the event;
it is easier to store zero as the 'default' case and only store a nonzero
when the operator is actually cross-type.  Along the way, remove the
long-since-defunct bigbox_ops operator class.
2003-11-12 21:15:59 +00:00
Tom Lane c1d62bfd00 Add operator strategy and comparison-value datatype fields to ScanKey.
Remove the 'strategy map' code, which was a large amount of mechanism
that no longer had any use except reverse-mapping from procedure OID to
strategy number.  Passing the strategy number to the index AM in the
first place is simpler and faster.
This is a preliminary step in planned support for cross-datatype index
operations.  I'm committing it now since the ScanKeyEntryInitialize()
API change touches quite a lot of files, and I want to commit those
changes before the tree drifts under me.
2003-11-09 21:30:38 +00:00
Tom Lane e33f205a94 Adjust btree index build procedure so that the btree metapage looks
invalid (has the wrong magic number) until the build is entirely
complete.  This turns out to cost no additional writes in the normal
case, since we were rewriting the metapage at the end of the process
anyway.  In normal scenarios there's no real gain in security, because
a failed index build would roll back the transaction leaving an unused
index file, but for rebuilding shared system indexes this seems to add
some useful protection.
2003-09-29 23:40:26 +00:00
Bruce Momjian 46785776c4 Another pgindent run with updated typedefs. 2003-08-08 21:42:59 +00:00
Bruce Momjian f3c3deb7d0 Update copyrights to 2003. 2003-08-04 02:40:20 +00:00
Bruce Momjian 089003fb46 pgindent run. 2003-08-04 00:43:34 +00:00
Tom Lane 3bbd6af37c Adjust btbulkdelete logic so that only one WAL record is issued while
deleting multiple index entries on a single index page.  This makes for
a very substantial reduction in the amount of WAL traffic during a
large delete operation.
2003-02-23 22:43:09 +00:00
Tom Lane 88dc31e3f2 First cut at recycling space in btree indexes. Still some rough edges
to fix, but it seems to basically work...
2003-02-23 06:17:13 +00:00
Tom Lane 799bc58dc7 More infrastructure for btree compaction project. Tree-traversal code
now knows what to do upon hitting a dead page (in theory anyway, it's
untested...).  Add a post-VACUUM-cleanup entry point for index AMs, to
provide a place for dead-page scavenging to happen.
Also, fix oversight that broke btpo_prev links in temporary indexes.
initdb forced due to additions in pg_am.
2003-02-22 00:45:05 +00:00
Tom Lane 70508ba7ae Make btree index structure adjustments and WAL logging changes needed to
support btree compaction, as per proposal of a few days ago.  btree index
pages no longer store parent links, instead they have a level indicator
(counting up from zero for leaf pages).  The FixBTree recovery logic is
removed, and replaced by code that detects missing parent-level insertions
during WAL replay.  Also, generate appropriate WAL entries when updating
btree metapage and when building a btree index from scratch.  I believe
btree indexes are now completely WAL-legal for the first time.
initdb forced due to index and WAL changes.
2003-02-21 00:06:22 +00:00
Bruce Momjian 33f1687879 There already was a macro PageGetItemId; this is now used in (almost)
all places, where pd_linp is accessed.  Also introduce new macros
SizeOfPageHeaderData and BTMaxItemSize. This is just source code
cosmetic, no behaviour changed.

Manfred Koizar
2002-07-02 05:48:44 +00:00
Bruce Momjian d84fe82230 Update copyright to 2002. 2002-06-20 20:29:54 +00:00