Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
mkdir_p and check_data_dir will be useful in CREATE TABLESPACE, since we
have agreed that that command should handle subdirectory creation just like
initdb creates the PGDATA directory. Push them into src/port/ so that they
are available to both initdb and the backend. Rename to pg_mkdir_p and
pg_check_dir, just to be on the safe side. Add FreeBSD's copyright notice
to pgmkdirp.c, since that's where the code came from originally (this
really should have been in initdb.c). Very marginal code/comment cleanup.
Purely cosmetic patch to make our coding standards more consistent ---
we were doing symbolic some places and octal other places. This patch
fixes all C-coded uses of mkdir, chmod, and umask. There might be some
other calls I missed. Inconsistency noted while researching tablespace
directory permissions issue.
In addition, add support for a "payload" string to be passed along with
each notify event.
This implementation should be significantly more efficient than the old one,
and is also more compatible with Hot Standby usage. There is not yet any
facility for HS slaves to receive notifications generated on the master,
although such a thing is possible in future.
Joachim Wieland, reviewed by Jeff Davis; also hacked on by me.
These files have apparently been edited over the years by a dozen people
with as many different editor settings, which made the alignment of the
paragraphs quite inconsistent and ugly. I made a pass of M-q with Emacs
to straighten it out.
short-circuit the rather expensive identify_system_timezone() procedure,
which we have no real need for during initdb since nothing done here depends
on the timezone setting. Since we launch quite a few standalone backends
during the initdb sequence, this adds up to a significant savings, and seems
worth doing to save developer time even though it will hardly matter to end
users. Per my report today on pgsql-hackers.
Per discussion, this should result in defaulting to SQL_ASCII encoding.
The original coding could not support that because it conflated selection
of SQL_ASCII encoding with not being able to determine the encoding.
Adjust pg_get_encoding_from_locale()'s API to distinguish these cases,
and fix callers appropriately. Only initdb actually changes behavior,
since the other callers were perfectly content to consider these cases
equivalent.
Per bug #5178 from Boh Yap. Not going to bother back-patching, since
no one has complained before and there's an easy workaround (namely,
specify the encoding you want).
flat password file, because it never will anymore. We had managed to
miss this during the recent flat-file-ectomy because it only happens if
--pwfile or --pwprompt is specified to initdb. Apparently, few hackers
use those. Reported by Erik Rijkers.
(could happen if either postgresql.conf or postmaster.opts is empty).
It's been broken since the C version was written for 8.0, so patch
all the way back.
initdb's copy of the function is broken in the same way, but it's
less important there since the input files should never be empty.
Patch that in HEAD only, and also fix some cosmetic differences that
crept into that copy of the function.
Per report from Corry Haines and Jeff Davis.
Recent commits have removed the various uses it was supporting. It was a
performance bottleneck, according to bug report #4919 by Lauris Ulmanis; seems
it slowed down user creation after a billion users.
cosmetic --- I'm wondering if geteuid could have side effects on errno,
thus possibly resulting in a misleading error message after failure of
getpwuid.
are using our own ports of getopt or getopt_long, those will define
the variable for themselves; and if not, we don't need these, because
we never touch the variable anyway.
free space information is stored in a dedicated FSM relation fork, with each
relation (except for hash indexes; they don't use FSM).
This eliminates the max_fsm_relations and max_fsm_pages GUC options; remove any
trace of them from the backend, initdb, and documentation.
Rewrite contrib/pg_freespacemap to match the new FSM implementation. Also
introduce a new variant of the get_raw_page(regclass, int4, int4) function in
contrib/pageinspect that let's you to return pages from any relation fork, and
a new fsm_page_contents() function to inspect the new FSM pages.
ctype are now more like encoding, stored in new datcollate and datctype
columns in pg_database.
This is a stripped-down version of Radek Strnad's patch, with further
changes by me.
This allows the use of a ramdrive (either through mount or symlink) for
the temporary file that's written every half second, which should
reduce I/O.
On server shutdown/startup, the file is written to the old location in
the global directory, to preserve data across restarts.
Bump catversion since the $PGDATA directory layout changed.
the postgres.bki file during build, because we want that file to be entirely
platform- and configuration-independent; else it can't safely be put into
/usr/share on multiarch machines. We can do the substitution during initdb,
instead. FLOAT4PASSBYVAL and FLOAT8PASSBYVAL are new breakage as of 8.4,
while the NAMEDATALEN hazard has been there all along but I guess no one
tripped over it. Noticed while trying to build "universal" OS X binaries.
doesn't work, and the real reason why not is it's unclear where the path
is relative to (initdb's CWD, or the data directory?). We could make an
arbitrary decision, but it seems best to make the user be unambiguous.
Per gripe from Devrim.
by explicitly adding back the user to the DACL of the new process.
This fixes the failure case when executing as the Administrator
user, which had no permissions left at all after we dropped the
Administrators group.
Dave Page with some modifications from me
only on the 'language' part of the locale name, ignoring the country code.
We may need to be smarter later when there are more built-in configurations,
but for now this is good enough and avoids having to bloat the table.
renumbering of encoding IDs done between 8.2 and 8.3 turns out to break 8.2
initdb and psql if they are run with an 8.3beta1 libpq.so. For the moment
we can rearrange the order of enum pg_enc to keep the same number for
everything except PG_JOHAB, which isn't a problem since there are no direct
references to it in the 8.2 programs anyway. (This does force initdb
unfortunately.)
Going forward, we want to fix things so that encoding IDs can be changed
without an ABI break, and this commit includes the changes needed to allow
libpq's encoding IDs to be treated as fully independent of the backend's.
The main issue is that libpq clients should not include pg_wchar.h or
otherwise assume they know the specific values of libpq's encoding IDs,
since they might encounter version skew between pg_wchar.h and the libpq.so
they are using. To fix, have libpq officially export functions needed for
encoding name<=>ID conversion and validity checking; it was doing this
anyway unofficially.
It's still the case that we can't renumber backend encoding IDs until the
next bump in libpq's major version number, since doing so will break the
8.2-era client programs. However the code is now prepared to avoid this
type of problem in future.
Note that initdb is no longer a libpq client: we just pull in the two
source files we need directly. The patch also fixes a few places that
were being sloppy about checking for an unrecognized encoding name.
databases with encodings that are incompatible with the server's LC_CTYPE
locale, when we can determine that (which we can on most modern platforms,
I believe). C/POSIX locale is compatible with all encodings, of course,
so there is still some usefulness to CREATE DATABASE's ENCODING option,
but this will insulate us against all sorts of recurring complaints
caused by mismatched settings.
I moved initdb's existing LC_CTYPE-to-encoding mapping knowledge into
a new src/port/ file so it could be shared by CREATE DATABASE.