in snapshots, per my proposal of a few days ago. Also, tweak heapam.c
routines (heap_insert, heap_update, heap_delete, heap_mark4update) to
be passed the command ID to use, instead of doing GetCurrentCommandID.
For catalog updates they'll still get passed current command ID, but
for updates generated from the main executor they'll get passed the
command ID saved in the snapshot the query is using. This should fix
some corner cases associated with functions and triggers that advance
current command ID while an outer query is still in progress.
yesterday's proposal to pghackers. Also remove unnecessary parameters
to heap_beginscan, heap_rescan. I modified pg_proc.h to reflect the
new numbers of parameters for the AM interface routines, but did not
force an initdb because nothing actually looks at those fields.
GUC support. It's now possible to set datestyle, timezone, and
client_encoding from postgresql.conf and per-database or per-user
settings. Also, implement rollback of SET commands that occur in a
transaction that later fails. Create a SET LOCAL var = value syntax
that sets the variable only for the duration of the current transaction.
All per previous discussions in pghackers.
path. The default behavior if no per-user schemas are created is that
all users share a 'public' namespace, thus providing behavior backwards
compatible with 7.2 and earlier releases. Probably the semantics and
default setting will need to be fine-tuned, but this is a start.
in schemas other than the system namespace; however, there's no search
path yet, and not all operations work yet on tables outside the system
namespace.
per recent pghackers discussion: force a new WAL record at first nextval
after a checkpoint, and ensure that xlog is flushed to disk if a nextval
record is the only thing emitted by a transaction.
PostgreSQL. This hash function replaces the one used by hash indexes and
the catalog cache. Hash joins use a different, relatively poor-quality
hash function, but I'll fix that later.
As suggested by Tom Lane, this patch also changes the size of the fixed
hash table used by the catalog cache to be a power-of-2 (instead of a
prime: I chose 256 instead of 257). This allows the catcache to lookup
hash buckets using a simple bitmask. This should improve the performance
of the catalog cache slightly, since the previous method (modulo a
prime) was slow.
In my tests, this improves the performance of hash indexes by between 4%
and 8%; the performance when using btree indexes or seqscans is
basically unchanged.
Neil Conway <neilconway@rogers.com>
(current as of a few hours ago.)
This patch:
1. Adds PG_GETARG_xxx_P_SLICE() macros and associated support routines.
2. Adds routines in src/backend/access/tuptoaster.c for fetching only
necessary chunks of a toasted value. (Modelled on latest changes to
assume chunks are returned in order).
3. Amends text_substr and bytea_substr to use new methods. It now
handles multibyte cases -and should still lead to a performance
improvement in the multibyte case where the substring is near the
beginning of the string.
4. Added new command: ALTER TABLE tabname ALTER COLUMN colname SET
STORAGE {PLAIN | EXTERNAL | EXTENDED | MAIN} to parser and documented in
alter-table.sgml. (NB I used ColId as the item type for the storage
mode string, rather than a new production - I hope this makes sense!).
All this does is sets attstorage for the specified column.
4. AlterTableAlterColumnStatistics is now AlterTableAlterColumnFlags and
handles both statistics and storage (it uses the subtype code to
distinguish). The previous version of my patch also re-arranged other
code in backend/commands/command.c but I have dropped that from this
patch.(I plan to return to it separately).
5. Documented new macros (and also the PG_GETARG_xxx_P_COPY macros) in
xfunc.sgml. ref/alter_table.sgml also contains documentation for ALTER
COLUMN SET STORAGE.
John Gray
are now both invoked once per received SQL command (raw parsetree) from
pg_exec_query_string. BeginCommand is actually just an empty routine
at the moment --- all its former operations have been pushed into tuple
receiver setup routines in printtup.c. This makes for a clean distinction
between BeginCommand/EndCommand (once per command) and the tuple receiver
setup/teardown routines (once per ExecutorRun call), whereas the old code
was quite ad hoc. Along the way, clean up the calling conventions for
ExecutorRun a little bit.
hashname() and reduce the penalty incured when NAMEDATALEN is increased.
I posted this to -hackers a couple days ago, and there haven't been any
major complaints. It passes the regression tests. See -hackers for more
discussion, as well as the suggestion from Tom Lane on which this patch
is based.
Unless anyone sees any problems, please apply for 7.3.
Cheers,
Neil Conway
Improve 'pg_internal.init' relcache entry preload mechanism so that it is
safe to use for all system catalogs, and arrange to preload a realistic
set of system-catalog entries instead of only the three nailed-in-cache
indexes that were formerly loaded this way. Fix mechanism for deleting
out-of-date pg_internal.init files: this must be synchronized with transaction
commit, not just done at random times within transactions. Drive it off
relcache invalidation mechanism so that no special-case tests are needed.
Cache additional information in relcache entries for indexes (their pg_index
tuples and index-operator OIDs) to eliminate repeated lookups. Also cache
index opclass info at the per-opclass level to avoid repeated lookups during
relcache load.
Generalize 'systable scan' utilities originally developed by Hiroshi,
move them into genam.c, use in a number of places where there was formerly
ugly code for choosing either heap or index scan. In particular this allows
simplification of the logic that prevents infinite recursion between syscache
and relcache during startup: we can easily switch to heapscans in relcache.c
when and where needed to avoid recursion, so IndexScanOK becomes simpler and
does not need any expensive initialization.
Eliminate useless opening of a heapscan data structure while doing an indexscan
(this saves an mdnblocks call and thus at least one kernel call).
recreated since the start of our transaction, our first reference to it
errored out because we'd try to reuse our old relcache entry for it.
Do this by accepting SI inval messages just before relcache search in
heap_openr, so that dead relcache entries will be flushed before we
search. Also, break heap_open/openr into two pairs of routines,
relation_open(r) and heap_open(r). The relation_open routines make
no tests on relkind and so can be used to open anything that has a
pg_class entry. The heap_open routines are wrappers that add a relkind
test to preserve their established behavior. Use the relation_open
routines in several places that had various kluge solutions for opening
rels that might be either heap or index rels.
Also, remove the old 'heap stats' code that's been superseded by Jan's
stats collector, and clean up some inconsistencies in error reporting
between the different types of ALTER TABLE.
Modified the parser and the SET handlers to use full Node structures
rather than simply a character string argument.
Implement INTERVAL() YEAR TO MONTH (etc) syntax per SQL99.
Does not yet accept the goofy string format that goes along with, but
this should be fairly straight forward to fix now as a bug or later
as a feature.
Implement precision for the INTERVAL() type.
Use the typmod mechanism for both of INTERVAL features.
Fix the INTERVAL syntax in the parser:
opt_interval was in the wrong place.
INTERVAL is now a reserved word, otherwise we get reduce/reduce errors.
Implement an explicit date_part() function for TIMETZ.
Should fix coersion problem with INTERVAL reported by Peter E.
Fix up some error messages for date/time types.
Use all caps for type names within message.
Fix recently introduced side-effect bug disabling 'epoch' as a recognized
field for date_part() etc. Reported by Peter E. (??)
Bump catalog version number.
Rename "microseconds" current transaction time field
from ...Msec to ...Usec. Duh!
date/time regression tests updated for reference platform, but a few
changes will be necessary for others.
lookup info in the relcache for index access method support functions.
This makes a huge difference for dynamically loaded support functions,
and should save a few cycles even for built-in ones. Also tweak dfmgr.c
so that load_external_function is called only once, not twice, when
doing fmgr_info for a dynamically loaded function. All per performance
gripe from Teodor Sigaev, 5-Oct-01.
existing lock manager and spinlocks: it understands exclusive vs shared
lock but has few other fancy features. Replace most uses of spinlocks
with lightweight locks. All remaining uses of spinlocks have very short
lock hold times (a few dozen instructions), so tweak spinlock backoff
code to work efficiently given this assumption. All per my proposal on
pghackers 26-Sep-01.
Define a new function, GetCurrentTransactionStartTimeUsec() to get the time
to this precision.
Allow now() and timestamp 'now' to use this higher precision result so
we now have fractional seconds in this "constant".
Add timestamp without time zone type.
Move previous timestamp type to timestamp with time zone.
Accept another ISO variant for date/time values: yyyy-mm-ddThh:mm:ss
(note the "T" separating the day from hours information).
Remove 'current' from date/time types; convert to 'now' in input.
Separate time and timetz regression tests.
Separate timestamp and timestamptz regression test.
buffer manager with 'pg_clog', a specialized access method modeled
on pg_xlog. This simplifies startup (don't need to play games to
open pg_log; among other things, OverrideTransactionSystem goes away),
should improve performance a little, and opens the door to recycling
commit log space by removing no-longer-needed segments of the commit
log. Actual recycling is not there yet, but I felt I should commit
this part separately since it'd still be useful if we chose not to
do transaction ID wraparound.
pgsql-hackers. pg_opclass now has a row for each opclass supported by each
index AM, not a row for each opclass name. This allows pg_opclass to show
directly whether an AM supports an opclass, and furthermore makes it possible
to store additional information about an opclass that might be AM-dependent.
pg_opclass and pg_amop now store "lossy" and "haskeytype" information that we
previously expected the user to remember to provide in CREATE INDEX commands.
Lossiness is no longer an index-level property, but is associated with the
use of a particular operator in a particular index opclass.
Along the way, IndexSupportInitialize now uses the syscaches to retrieve
pg_amop and pg_amproc entries. I find this reduces backend launch time by
about ten percent, at the cost of a couple more special cases in catcache.c's
IndexScanOK.
Initial work by Oleg Bartunov and Teodor Sigaev, further hacking by Tom Lane.
initdb forced.
default, but OIDS are removed from many system catalogs that don't need them.
Some interesting side effects: TOAST pointers are 20 bytes not 32 now;
pg_description has a three-column key instead of one.
Bugs fixed in passing: BINARY cursors work again; pg_class.relhaspkey
has some usefulness; pg_dump dumps comments on indexes, rules, and
triggers in a valid order.
initdb forced.
(as proposed in http://fts.postgresql.org/db/mw/msg.html?mid=1028327)
2. support for 'pass-by-value' arguments - to test this
we used special opclass for int4 with values in range [0-2^15]
More testing will be done after resolving problem with
index_formtuple and implementation of B-tree using GiST
3. small patch to contrib modules (seg,cube,rtree_gist,intarray) -
mark functions as 'isstrict' where needed.
Oleg Bartunov
rather than deleting them only to have to create more. Steady state
is 2*CHECKPOINT_SEGMENTS + WAL_FILES + 1 segment files, which will
simply be renamed rather than constantly deleted and recreated.
To make this safe, added current XLOG file/offset number to page
header of XLOG pages, so that an un-overwritten page from an old
incarnation of a logfile can be reliably told from a valid page.
This change means that if you try to restart postmaster in a CVS-tip
database after installing the change, you'll get a complaint about
bad XLOG page magic number. If you don't want to initdb, run
contrib/pg_resetxlog (and be sure you shut down the old postmaster
cleanly).
per previous discussion on pghackers. Most of the duplicate code in
different AMs' ambuild routines has been moved out to a common routine
in index.c; this means that all index types now do the right things about
inserting recently-dead tuples, etc. (I also removed support for EXTEND
INDEX in the ambuild routines, since that's about to go away anyway, and
it cluttered the code a lot.) The retail indextuple deletion routines have
been replaced by a "bulk delete" routine in which the indexscan is inside
the access method. I haven't pushed this change as far as it should go yet,
but it should allow considerable simplification of the internal bookkeeping
for deletions. Also, add flag columns to pg_am to eliminate various
hardcoded tests on AM OIDs, and remove unused pg_am columns.
Fix rtree and gist index types to not attempt to store NULLs; before this,
gist usually crashed, while rtree managed not to crash but computed wacko
bounding boxes for NULL entries (which might have had something to do with
the performance problems we've heard about occasionally).
Add AtEOXact routines to hash, rtree, and gist, all of which have static
state that needs to be reset after an error. We discovered this need long
ago for btree, but missed the other guys.
Oh, one more thing: concurrent VACUUM is now the default.
validity checking rules for VACUUM. Make some other rearrangements of the
VACUUM code to allow more code to be shared between full and lazy VACUUM.
Minor code cleanups and added comments for TransactionId manipulations.
stub) into the rest of the system. Adopt a cleaner approach to preventing
deadlock in concurrent heap_updates: allow RelationGetBufferForTuple to
select any page of the rel, and put the onus on it to lock both buffers
in a consistent order. Remove no-longer-needed isExtend hack from
API of ReleaseAndReadBuffer.
pg_database now has unique indexes on oid and on datname.
pg_shadow now has unique indexes on usename and on usesysid.
pg_am now has unique index on oid.
pg_opclass now has unique index on oid.
pg_amproc now has unique index on amid+amopclaid+amprocnum.
Remove pg_rewrite's unnecessary index on oid, delete unused RULEOID syscache.
Remove index on pg_listener and associated syscache for performance reasons
(caching rows that are certain to change before you need 'em again is
rather pointless).
Change pg_attrdef's nonunique index on adrelid into a unique index on
adrelid+adnum.
Fix various incorrect settings of pg_class.relisshared, make that the
primary reference point for whether a relation is shared or not.
IsSharedSystemRelationName() is now only consulted to initialize relisshared
during initial creation of tables and indexes. In theory we might now
support shared user relations, though it's not clear how one would get
entries for them into pg_class &etc of multiple databases.
Fix recently reported bug that pg_attribute rows created for an index all have
the same OID. (Proof that non-unique OID doesn't matter unless it's
actually used to do lookups ;-))
There's no need to treat pg_trigger, pg_attrdef, pg_relcheck as bootstrap
relations. Convert them into plain system catalogs without hardwired
entries in pg_class and friends.
Unify global.bki and template1.bki into a single init script postgres.bki,
since the alleged distinction between them was misleading and pointless.
Not to mention that it didn't work for setting up indexes on shared
system relations.
Rationalize locking of pg_shadow, pg_group, pg_attrdef (no need to use
AccessExclusiveLock where ExclusiveLock or even RowExclusiveLock will do).
Also, hold locks until transaction commit where necessary.
appropriate pin-count manipulation, and instead use ReleaseAndReadBuffer.
Make use of the fact that the passed-in buffer (if there is one) must
be pinned to avoid grabbing the bufmgr spinlock when we are able to
return this same buffer. Eliminate unnecessary 'previous tuple' and
'next tuple' fields of HeapScanDesc and IndexScanDesc, thereby removing
a whole lot of bookkeeping from heap_getnext() and related routines.
report on old-style functions invoked by RI triggers. We had a number of
other places that were being sloppy about which memory context FmgrInfo
subsidiary data will be allocated in. Turns out none of them actually
cause a problem in 7.1, but this is for arcane reasons such as the fact
that old-style triggers aren't supported anyway. To avoid getting burnt
later, I've restructured the trigger support so that we don't keep trigger
FmgrInfo structs in relcache memory. Some other related cleanups too:
it's not really necessary to call fmgr_info at all while setting up
the index support info in relcache entries, because those ScanKeyEntry
structs are never used to invoke the functions. This should speed up
relcache initialization a tiny bit.
Python) to support shared extension modules, I have learned that Guido
prefers the style of the attached patch to solve the above problem.
I feel that this solution is particularly appropriate in this case
because the following:
PglargeType
PgType
PgQueryType
are already being handled in the way that I am proposing for PgSourceType.
Jason Tishler
PageGetFreeSpace() was being called while not holding the buffer lock, which
not only could yield a garbage answer, but even if it's the right answer there
might be less space available after we reacquire the buffer lock.
Also repair potential deadlock introduced by my recent performance improvement
in RelationGetBufferForTuple(): it was possible for two heap_updates to try to
lock two buffers in opposite orders. The fix creates a global rule that
buffers of a single heap relation should be locked in decreasing block number
order. Currently, this only applies to heap_update; VACUUM can get away with
ignoring the rule since it holds exclusive lock on the whole relation anyway.
However, if we try to implement a VACUUM that can run in parallel with other
transactions, VACUUM will also have to obey the lock order rule.
a separate statement (though it can still be invoked as part of VACUUM, too).
pg_statistic redesigned to be more flexible about what statistics are
stored. ANALYZE now collects a list of several of the most common values,
not just one, plus a histogram (not just the min and max values). Random
sampling is used to make the process reasonably fast even on very large
tables. The number of values and histogram bins collected is now
user-settable via an ALTER TABLE command.
There is more still to do; the new stats are not being used everywhere
they could be in the planner. But the remaining changes for this project
should be localized, and the behavior is already better than before.
A not-very-related change is that sorting now makes use of btree comparison
routines if it can find one, rather than invoking '<' twice.
O_SYNC, or O_DSYNC (as available on a given platform). Add GUC parameter
to control sync method.
Also, add defense to XLogWrite to prevent it from going nuts if passed
a target write position that's past the end of the buffers so far filled
by XLogInsert.
detect case that next page in log came from an older run than the prior
page. This avoids the necessity to re-zero the log after recovery from
a crash, which is good because we need not risk destroying valuable log
information.
This forces another initdb since yesterday :-(. Need to get that log
reset utility done...
* Store two past checkpoint locations, not just one, in pg_control.
On startup, we fall back to the older checkpoint if the newer one
is unreadable. Also, a physical copy of the newest checkpoint record
is kept in pg_control for possible use in disaster recovery (ie,
complete loss of pg_xlog). Also add a version number for pg_control
itself. Remove archdir from pg_control; it ought to be a GUC
parameter, not a special case (not that it's implemented yet anyway).
* Suppress successive checkpoint records when nothing has been entered
in the WAL log since the last one. This is not so much to avoid I/O
as to make it actually useful to keep track of the last two
checkpoints. If the things are right next to each other then there's
not a lot of redundancy gained...
* Change CRC scheme to a true 64-bit CRC, not a pair of 32-bit CRCs
on alternate bytes. Polynomial borrowed from ECMA DLT1 standard.
* Fix XLOG record length handling so that it will work at BLCKSZ = 32k.
* Change XID allocation to work more like OID allocation. (This is of
dubious necessity, but I think it's a good idea anyway.)
* Fix a number of minor bugs, such as off-by-one logic for XLOG file
wraparound at the 4 gig mark.
* Add documentation and clean up some coding infelicities; move file
format declarations out to include files where planned contrib
utilities can get at them.
* Checkpoint will now occur every CHECKPOINT_SEGMENTS log segments or
every CHECKPOINT_TIMEOUT seconds, whichever comes first. It is also
possible to force a checkpoint by sending SIGUSR1 to the postmaster
(undocumented feature...)
* Defend against kill -9 postmaster by storing shmem block's key and ID
in postmaster.pid lockfile, and checking at startup to ensure that no
processes are still connected to old shmem block (if it still exists).
* Switch backends to accept SIGQUIT rather than SIGUSR1 for emergency
stop, for symmetry with postmaster and xlog utilities. Clean up signal
handling in bootstrap.c so that xlog utilities launched by postmaster
will react to signals better.
* Standalone bootstrap now grabs lockfile in target directory, as added
insurance against running it in parallel with live postmaster.
only if at least N other backends currently have open transactions. This
is not a great deal of intelligence about whether a delay might be
profitable ... but it beats no intelligence at all. Note that the default
COMMIT_DELAY is still zero --- this new code does nothing unless that
setting is changed.
Also, mark ENABLEFSYNC as a system-wide setting. It's no longer safe to
allow that to be set per-backend, since we may be relying on some other
backend's fsync to have synced the WAL log.
compressed storage works perfectly well. Might as well have a coherent
strategy for applying it, rather than the haphazard store-what-you-get
approach that was in the code before. The strategy I've set up here is
to attempt compression of any compressible index value exceeding
BLCKSZ/16, or about 500 bytes by default.
bothering to check the return value --- which meant that in case the
update or delete failed because of a concurrent update, you'd not find
out about it, except by observing later that the transaction produced
the wrong outcome. There are now subroutines simple_heap_update and
simple_heap_delete that should be used anyplace that you're not prepared
to do the full nine yards of coping with concurrent updates. In
practice, that seems to mean absolutely everywhere but the executor,
because *noplace* else was checking.
are treated more like 'cancel' interrupts: the signal handler sets a
flag that is examined at well-defined spots, rather than trying to cope
with an interrupt that might happen anywhere. See pghackers discussion
of 1/12/01.
are now critical sections, so as to ensure die() won't interrupt us while
we are munging shared-memory data structures. Avoid insecure intermediate
states in some code that proc_exit will call, like palloc/pfree. Rename
START/END_CRIT_CODE to START/END_CRIT_SECTION, since that seems to be
what people tend to call them anyway, and make them be called with () like
a function call, in hopes of not confusing pg_indent.
I doubt that this is sufficient to make SIGTERM safe anywhere; there's
just too much code that could get invoked during proc_exit().
1. Distinguish cases where a Datum representing a tuple datatype is an OID
from cases where it is a pointer to TupleTableSlot, and make sure we use
the right typlen in each case.
2. Make fetchatt() and related code support 8-byte by-value datatypes on
machines where Datum is 8 bytes. Centralize knowledge of the available
by-value datatype sizes in two macros in tupmacs.h, so that this will be
easier if we ever have to do it again.
to ensure that we have released buffer refcounts and so forth, rather than
putting ad-hoc operations before (some of the calls to) proc_exit. Add
commentary to discourage future hackers from repeating that mistake.
both MULTIBYTE and TOAST prevent char(n) from being truly fixed-size.
Simplify and speed up fastgetattr() and index_getattr() macros by
eliminating special cases for attnum=1. It's just as fast to handle
the first attribute by presetting its attcacheoff to zero; so do that
instead when loading the tupledesc in relcache.c.
re-adopt these settings at every postmaster or standalone-backend startup.
This should fix problems with indexes becoming corrupt due to failure to
provide consistent locale environment for postmaster at all times. Also,
refuse to start up a non-locale-enabled compilation in a database originally
initdb'd with a non-C locale. Suppress LIKE index optimization if locale
is not "C" or "POSIX" (are there any other locales where it's safe?).
Issue NOTICE during initdb if selected locale disables LIKE optimization.
as MaxHeapAttributeNumber. Increase MaxAttrSize to something more
reasonable (given what it's used for, namely checking char(n) declarations,
I didn't make it the full 1G that it could theoretically be --- 10Mb
seemed a more reasonable number). Improve calculation of MaxTupleSize.
trying to toast tuples inserted into toast tables! Fix is two-pronged:
first, ensure all columns of a toast table are marked attstorage='p',
and second, alter the target chunk size so that it's less than the
threshold for trying to toast a tuple. (Code tried to do that but the
expression was wrong.) A few cosmetic cleanups in tuptoaster too.
NOTE: initdb forced due to change in toaster chunk-size.
These two routines will now ALWAYS elog() on failure, whether you ask for
a lock or not. If you really want to get a NULL return on failure, call
the new routines heap_open_nofail()/heap_openr_nofail(). By my count there
are only about three places that actually want that behavior. There were
rather more than three places that were missing the check they needed to
make under the old convention :-(.
actually, but who could understand it with no comments? Fix bug
while at it: _bt_orderkeys would try to invoke comparisons on
NULL inputs, given the right sort of redundant quals.
duplicate keys by letting search go to the left rather than right when an
equal key is seen at an upper tree level. Fix poor choice of page split
point (leading to insertion failures) that was forced by chaining logic.
Don't store leftmost key in non-leaf pages, since it's not necessary.
Don't create root page until something is first stored in the index, so an
unused index is now 8K not 16K. (Doesn't seem to be as easy to get rid of
the metadata page, unfortunately.) Massive cleanup of unreadable code,
fix poor, obsolete, and just plain wrong documentation and comments.
See src/backend/access/nbtree/README for the gory details.
pass-by-ref data types --- eg, an index on lower(textfield) --- no longer
leak memory during index creation or update. Clean up a lot of redundant
code ... did you know that copy, vacuum, truncate, reindex, extend index,
and bootstrap each basically duplicated the main executor's logic for
extracting information about an index and preparing index entries?
Functional indexes should be a little faster now too, due to removal
of repeated function lookups.
CREATE INDEX 'opt_type' clause is deimplemented by these changes,
but I haven't removed it from the parser yet (need to merge with
Thomas' latest change set first).
Don't go through pg_exec_query_dest(), but directly to the execution
routines. Also, extend parameter lists so that there's no need to
change the global setting of allowSystemTableMods, a hack that was
certain to cause trouble in the event of any error.
Don't use DISABLE_COMPLEX_MACRO on Solaris. Don't define the
replacement function in the header file. Use -KPIC, not -K PIC.
Use CC to link C++ libraries, not ld/ar.
Eliminate file not found warnings in tcl build code.
entries now for int8 and network hash indexes. int24_ops and int42_ops
are gone. pg_opclass no longer contains multiple entries claiming to be
the default opclass for the same datatype. opr_sanity regress test
extended to catch errors like these in the future.
materialized tupleset is small enough) instead of a temporary relation.
This was something I was thinking of doing anyway for performance, and Jan
says he needs it for TOAST because he doesn't want to cope with toasting
noname relations. With this change, the 'noname table' support in heap.c
is dead code, and I have accordingly removed it. Also clean up 'noname'
plan handling in planner --- nonames are either sort or materialize plans,
and it seems less confusing to handle them separately under those names.
inputs have been converted to newstyle. This should go a long way towards
fixing our portability problems with platforms where char and short
parameters are passed differently from int-width parameters. Still
more to do for the Alpha port however.
key call sites are changed, but most called functions are still oldstyle.
An exception is that the PL managers are updated (so, for example, NULL
handling now behaves as expected in plperl and plpgsql functions).
NOTE initdb is forced due to added column in pg_proc.
project I am working on (Recall - a distributed, fault-tolerant,
replicated, storage framework @ http://www.fault-tolerant.org).
Recall is written in C++. I need to include the postgres headers and
there are some problems when including the headers w/C++.
Attached is a patch generated from postgres/src that fixes my problems.
I was hoping to get this into the main source. It's very small (2k) and
3 files are changed: backend/utils/fmgr/fmgr.c,
backend/utils/Gen_fmgrtab.sh.in, and include/access/tupdesc.h.
In C++, you get a multiply defined symbol because the variable
(FmgrInfo *fmgr_pl_finfo) is defined in the header (the patch moves it
to the .c file). The other problem in tupdesc.h is the use of typeid
is a problem in c++ (I renamed it to oidtypeid).
Thanks,
Neal Norwitz
running gcc and HP's cc with warnings cranked way up. Signed vs unsigned
comparisons, routines declared static and then defined not-static,
that kind of thing. Tedious, but perhaps useful...
appropriate btree three-way comparison routine. Not clear why the
three-way comparison routines were being used in some paths and not
others in btree --- incomplete changes by someone long ago, maybe?
Anyway, this makes for a nice speedup in CREATE INDEX.
syscache and relcache flushes). Relcache entry rebuild now preserves
original tupledesc, rewrite rules, and triggers if possible, so that pointers
to these things remain valid --- if these things change while relcache entry
has positive refcount, we elog(ERROR) to avoid later crash. Arrange for
xact-local rels to be rebuilt when an SI inval message is seen for them,
so that they are updated by CommandCounterIncrement the same as regular rels.
(This is useful because of Hiroshi's recent changes to process our own SI
messages at CommandCounterIncrement time.) This allows simplification of
some routines that previously hacked around the lack of an automatic update.
catcache now keeps its own copy of tupledesc for its relation, rather than
depending on the relcache's copy; this avoids needing to reinitialize catcache
during a cache flush, which saves some cycles and eliminates nasty circularity
problems that occur if a cache flush happens while trying to initialize a
catcache.
Eliminate a number of permanent memory leaks that used to happen during
catcache or relcache flush; not least of which was that catcache never
freed any cached tuples! (Rule parsetree storage is still leaked, however;
will fix that separately.)
Nothing done yet about code that uses tuples retrieved by SearchSysCache
for longer than is safe.
pghackers discussion of 5-Jan-2000. The amopselect and amopnpages
estimators are gone, and in their place is a per-AM amcostestimate
procedure (linked to from pg_am, not pg_amop).
relcache entry no longer leaks a small amount of memory. index_endscan
now releases all the memory acquired by index_beginscan, so callers of it
should NOT pfree the scan descriptor anymore.
a generalized module 'tuplesort.c' that can sort either HeapTuples or
IndexTuples, and is not tied to execution of a Sort node. Clean up
memory leakages in sorting, and replace nbtsort.c's private implementation
of mergesorting with calls to tuplesort.c.
expressions in CREATE TABLE. There is no longer an emasculated expression
syntax for these things; it's full a_expr for constraints, and b_expr
for defaults (unfortunately the fact that NOT NULL is a part of the
column constraint syntax causes a shift/reduce conflict if you try a_expr.
Oh well --- at least parenthesized boolean expressions work now). Also,
stored expression for a column default is not pre-coerced to the column
type; we rely on transformInsertStatement to do that when the default is
actually used. This means "f1 datetime default 'now'" behaves the way
people usually expect it to.
BTW, all the support code is now there to implement ALTER TABLE ADD
CONSTRAINT and ALTER TABLE ADD COLUMN with a default value. I didn't
actually teach ALTER TABLE to call it, but it wouldn't be much work.
additional argument specifying the kind of lock to acquire/release (or
'NoLock' to do no lock processing). Ensure that all relations are locked
with some appropriate lock level before being examined --- this ensures
that relevant shared-inval messages have been processed and should prevent
problems caused by concurrent VACUUM. Fix several bugs having to do with
mismatched increment/decrement of relation ref count and mismatched
heap_open/close (which amounts to the same thing). A bogus ref count on
a relation doesn't matter much *unless* a SI Inval message happens to
arrive at the wrong time, which is probably why we got away with this
sloppiness for so long. Repair missing grab of AccessExclusiveLock in
DROP TABLE, ALTER/RENAME TABLE, etc, as noted by Hiroshi.
Recommend 'make clean all' after pulling this update; I modified the
Relation struct layout slightly.
Will post further discussion to pghackers list shortly.
Also, move responsibility for calling vc_abort into main xact.c list of
things-to-call-at-abort. What in the world was it doing down inside of
TransactionIdAbort()?
and possibly for other cases too:
DO NOT cache status of transaction in unknown state
(i.e. non-committed and non-aborted ones)
Example:
T1 reads row updated/inserted by running T2 and cache T2 status.
T2 commits.
Now T1 reads a row updated by T2 and with HEAP_XMAX_COMMITTED
in t_infomask (so cached T2 status is not changed).
Now T1 EvalPlanQual gets updated row version without HEAP_XMIN_COMMITTED
-> TransactionIdDidCommit(t_xmin) and TransactionIdDidAbort(t_xmin)
return FALSE and T2 decides that t_xmin is not committed and gets
ERROR above.
It's too late to find more smart way to handle such cases and so
I just changed xact status caching and got rid TransactionIdFlushCache()
from code.
Changed: transam.c, xact.c, lmgr.c and transam.h - last three
just because of TransactionIdFlushCache() is removed.
2. heapam.c:
T1 marked a row for update. T2 waits for T1 commit/abort.
T1 commits. T3 updates the row before T2 locks row page.
Now T2 sees that new row t_xmax is different from xact id (T1)
T2 was waiting for. Old code did Assert here. New one goes to
HeapTupleSatisfiesUpdate. Obvious changes too.
3. Added Assert to vacuum.c
4. bufmgr.c: break
Assert(buf->r_locks == 0 && !buf->ri_lock)
into two Asserts.
BT_READ/BT_WRITE are BUFFER_LOCK_SHARE/BUFFER_LOCK_EXCLUSIVE now.
Also get rid of #define BT_VERSION_1 - we use version 1 as default
for near two years now.
2. Much faster btree tuples deletion in the case when first on page
index tuple is deleted (no movement to the left page(s)).
3. Remember blkno of new root page in BTPageOpaque of
left/right siblings when root page is splitted.
so that fetching an attribute value needs only one SearchSysCacheTuple call
instead of two redundant searches. This speeds up a large SELECT by about
ten percent, and probably will help GROUP BY and SELECT DISTINCT too.
no longer returns buffer pointer, can be gotten from scan;
descriptor; bootstrap can create multi-key indexes;
pg_procname index now is multi-key index; oidint2, oidint4, oidname
are gone (must be removed from regression tests); use System Cache
rather than sequential scan in many places; heap_modifytuple no
longer takes buffer parameter; remove unused buffer parameter in
a few other functions; oid8 is not index-able; remove some use of
single-character variable names; cleanup Buffer variables usage
and scan descriptor looping; cleaned up allocation and freeing of
tuples; 18k lines of diff;
1. Remove the char2, char4, char8 and char16 types from postgresql
2. Change references of char16 to name in the regression tests.
3. Rename the char16.sql regression test to name.sql. 4. Modify
the regression test scripts and outputs to match up.
Might require new regression.{SYSTEM} files...
Darren King
The following patches will allow postgreSQL 6.3 to compile and run on a
UNIXWARE 2.1.2 system with the native C compiler with the following library
change:
The alloca function must be copied from the libucb.a archive and added
to the libgen.a archive.
Also, the GNU flex program is needed to successfully build postgreSQL.
Patch by: wieck@sapserv.debis.de (Jan Wieck)
One of the design rules of PostgreSQL is extensibility. And
to follow this rule means (at least for me) that there should
not only be a builtin PL. Instead I would prefer a defined
interface for PL implemetations.
/*
** You can have as many strategies as you please in GiSTs, as
** long as your consistent method can handle them
*/
#define GISTNStrategies 100
^^^
- too big number:
strat.h->StrategyEvaluationData->StrategyExpression expression[12]
^^
- so 12 is real max # of strategies, or StrategyEvaluationIsValid
crashes backend (called if CASSER defined).
Reply-To: hackers@hub.org, Dan McGuirk <mcguirk@indirect.com>
To: hackers@hub.org
Subject: [HACKERS] tmin writeback optimization
I was doing some profiling of the backend, and noticed that during a certain
benchmark I was running somewhere between 30% and 75% of the backend's CPU
time was being spent in calls to TransactionIdDidCommit() from
HeapTupleSatisfiesNow() or HeapTupleSatisfiesItself() to determine that
changed rows' transactions had in fact been committed even though the rows'
tmin values had not yet been set.
When a query looks at a given row, it needs to figure out whether the
transaction that changed the row has been committed and hence it should pay
attention to the row, or whether on the other hand the transaction is still
in progress or has been aborted and hence the row should be ignored. If
a tmin value is set, it is known definitively that the row's transaction
has been committed. However, if tmin is not set, the transaction
referred to in xmin must be looked up in pg_log, and this is what the
backend was spending a lot of time doing during my benchmark.
So, implementing a method suggested by Vadim, I created the following
patch that, the first time a query finds a committed row whose tmin value
is not set, sets it, and marks the buffer where the row is stored as
dirty. (It works for tmax, too.) This doesn't result in the boost in
real time performance I was hoping for, however it does decrease backend
CPU usage by up to two-thirds in certain situations, so it could be
rather beneficial in high-concurrency settings.
Patches from: aoki@CS.Berkeley.EDU (Paul M. Aoki)
i gave jolly my btree bulkload code a long, long time ago but never
gave him a bunch of my bugfixes. here's a diff against the 6.0
baseline.
for some reason, this code has slowed down somewhat relative to the
insertion-build code on very small tables. don't know why -- it used
to be within about 10%. anyway, here are some (highly unscientific!)
timings on a dec 3000/300 for synthetic tables with 10k, 100k and
1000k tuples (basically, 1mb, 10mb and 100mb heaps). 'c' means
clustered (pre-sorted) inputs and 'u' means unclustered (randomly
ordered) inputs. the 10k table basically fits in the buffer pool, but
the 100k and 1000k tables don't. as you can see, insertion build is
fine if you've sorted your heaps on your index key or if your heap
fits in core, but is absolutely horrible on unordered data (yes,
that's 7.5 hours to index 100mb of data...) because of the zillions of
random i/os.
if it doesn't work for you for whatever reason, you can always turn it
back off by flipping the FastBuild flag in nbtree.c. i don't have
time to maintain it.
good luck!
baseline code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 8.6
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 9.1
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.2
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 652.4
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.1
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 26772.9
bulkloading code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 11.3
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 10.4
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.5
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 63.5
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.9
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 701.0
Changes:
* Unique index capability works using the syntax 'create unique
index'.
* Duplicate OID's in the system tables are removed. I put
little scripts called 'duplicate_oids' and 'find_oid' in
include/catalog that help to find and remove duplicate OID's.
I also moved 'unused_oids' from backend/catalog to
include/catalog, since it has to be in the same directory
as the include files in order to work.
* The backend tries converting the name of a function or aggregate
to all lowercase if the original name given doesn't work (mostly
for compatibility with ODBC).
* You can 'SELECT NULL' to your heart's content.
* I put my _bt_updateitem fix in instead, which uses
_bt_insertonpg so that even if the new key is so big that
the page has to be split, everything still works.
* All literal references to system catalog OID's have been
replaced with references to define'd constants from the catalog
header files.
* I added a couple of node copy functions. I think this was a
preliminary attempt to get rules to work.
Note. all include files that have been hit so far have had extraneous
include files cleaned out and are reduced to...the lowest common
"include file", based on 'cc -Wall -I. test.c', where test.c is:
#include "postgres.h"
#include "<top of branches>" (ie. top of branches this time was utils/fcache2.h)
|
|Here's a patch for Version 2 only. It just adds an Assert to catch some
|inconsistencies in the catalog classes.
|
|--
|Bryan Henderson Phone 408-227-6803
|San Jose, California
|