If a standby is broadcasting reply messages and we have named
one or more standbys in synchronous_standby_names then allow
users who set synchronous_replication to wait for commit, which
then provides strict data integrity guarantees. Design avoids
sending and receiving transaction state information so minimises
bookkeeping overheads. We synchronize with the highest priority
standby that is connected and ready to synchronize. Other standbys
can be defined to takeover in case of standby failure.
This version has very strict behaviour; more relaxed options
may be added at a later date.
Simon Riggs and Fujii Masao, with reviews by Yeb Havinga, Jaime
Casanova, Heikki Linnakangas and Robert Haas, plus the assistance
of many other design reviewers.
Without this patch, when wal_receiver_status_interval=0, indicating that no
status messages should be sent, Hot Standby feedback messages are instead sent
extremely frequently.
Fujii Masao, with documentation changes by me.
Standby optionally sends back information about oldestXmin of queries
which is then checked and applied to the WALSender's proc->xmin.
GetOldestXmin() is modified slightly to agree with GetSnapshotData(),
so that all backends on primary include WALSender within their snapshots.
Note this does nothing to change the snapshot xmin on either master or
standby. Feedback piggybacks on the standby reply message.
vacuum_defer_cleanup_age is no longer used on standby, though parameter
still exists on primary, since some use cases still exist.
Simon Riggs, review comments from Fujii Masao, Heikki Linnakangas, Robert Haas
the standby has written, flushed, and applied the WAL. At the moment, this
is for informational purposes only, the values are only shown in
pg_stat_replication system view, but in the future they will also be needed
for synchronous replication.
Extracted from Simon riggs' synchronous replication patch by Robert Haas, with
some tweaking by me.
Waiting for relation locks can lead to starvation - it pins down an
autovacuum worker for as long as the lock is held. But if we're doing
an anti-wraparound vacuum, then we still wait; maintenance can no longer
be put off.
To assist with troubleshooting, if log_autovacuum_min_duration >= 0,
we log whenever an autovacuum or autoanalyze is skipped for this reason.
Per a gripe by Josh Berkus, and ensuing discussion.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
Prior to 9.0, restartpoints never created, deleted, or recycled WAL
files, but now they can. This code makes log_checkpoints treat
checkpoints and restartpoints symmetrically. It also adjusts up
the documentation of the parameter to mention restartpoints.
Fujii Masao. Docs by me, as suggested by Itagaki Takahiro.
This tool makes it possible to do the pg_start_backup/
copy files/pg_stop_backup step in a single command.
There are still some steps to be done before this is a
complete backup solution, such as the ability to stream
the required WAL logs, but it's still usable, and
could do with some buildfarm coverage.
In passing, make the checkpoint request optionally
fast instead of hardcoding it.
Magnus Hagander, reviewed by Fujii Masao and Dimitri Fontaine
If wal_buffers is initially set to -1 (which is now the default), it's
replaced by 1/32nd of shared_buffers, with a minimum of 8 (the old default)
and a maximum of the XLOG segment size. The allowed range for manual
settings is still from 4 up to whatever will fit in shared memory.
Greg Smith, with implementation correction by me.
Recent versions of the Linux system header files cause xlogdefs.h to
believe that open_datasync should be the default sync method, whereas
formerly fdatasync was the default on Linux. open_datasync is a bad
choice, first because it doesn't actually outperform fdatasync (in fact
the reverse), and second because we try to use O_DIRECT with it, causing
failures on certain filesystems (e.g., ext4 with data=journal option).
This part of the patch is largely per a proposal from Marti Raudsepp.
More extensive changes are likely to follow in HEAD, but this is as much
change as we want to back-patch.
Also clean up confusing code and incorrect documentation surrounding the
fsync_writethrough option. Those changes shouldn't result in any actual
behavioral change, but I chose to back-patch them anyway to keep the
branches looking similar in this area.
In 9.0 and HEAD, also do some copy-editing on the WAL Reliability
documentation section.
Back-patch to all supported branches, since any of them might get used
on modern Linux versions.
First, avoid scanning the whole ProcArray once we know there
are at least commit_siblings active; second, skip the check
altogether if commit_siblings = 0.
Greg Smith
In particular, we are now more explicit about the fact that you may need
wal_sync_method=fsync_writethrough for crash-safety on some platforms,
including MaxOS X. There's also now an explicit caution against assuming
that the default setting of wal_sync_method is either crash-safe or best
for performance.
which is perhaps not a terribly good spot for it but there doesn't seem to be
a better place. Also add a source-code comment pointing out a couple reasons
for having a separate lock file. Per suggestion from Greg Smith.
Per gripe from Fujii Masao, though this is not exactly his proposed patch.
Categorize as DEVELOPER_OPTIONS and set context PGC_SIGHUP, as per Fujii,
but set the default to LOG because higher values aren't really sensible
(see the code for trace_recovery()). Fix the documentation to agree with
the code and to try to explain what the variable actually does. Get rid
of no-op calls trace_recovery(LOG), which accomplish nothing except to
demonstrate that this option confuses even its author.
Block elements with verbatim formatting (literallayout, programlisting,
screen, synopsis) should be aligned at column 0 independent of the surrounding
SGML, because whitespace is significant, and indenting them creates erratic
whitespace in the output. The CSS stylesheets already take care of indenting
the output.
Assorted markup improvements to go along with it.
I've added a quote_all_identifiers GUC which affects the behavior
of the backend, and a --quote-all-identifiers argument to pg_dump
and pg_dumpall which sets the GUC and also affects the quoting done
internally by those applications.
Design by Tom Lane; review by Alex Hunsaker; in response to bug #5488
filed by Hartmut Goebel.
Normally, we automatically restart after a backend crash, but in some
cases when PostgreSQL is invoked by clusterware it may be desirable to
suppress this behavior, so we provide an option which does this.
Since no existing GUC group quite fits, create a new group called
"error handling options" for this and the previously undocumented GUC
exit_on_error, which is now documented.
Review by Fujii Masao.
log files created by the syslogger process.
In passing, make unix_file_permissions display its value in octal, same
as log_file_mode now does.
Martin Pihlak
Per extensive discussion on pgsql-hackers. We are deliberately not
back-patching this even though the behavior of 8.3 and 8.4 is
unquestionably broken, for fear of breaking existing users of this
parameter. This incompatibility should be release-noted.
to have different values in different processes of the primary server.
Also put it into the "Streaming Replication" GUC category; it doesn't belong
in "Standby Servers" because you use it on the master not the standby.
In passing also correct guc.c's idea of wal_keep_segments' category.
max_standby_streaming_delay, and revise the implementation to avoid assuming
that timestamps found in WAL records can meaningfully be compared to clock
time on the standby server. Instead, the delay limits are compared to the
elapsed time since we last obtained a new WAL segment from archive or since
we were last "caught up" to WAL data arriving via streaming replication.
This avoids problems with clock skew between primary and standby, as well
as other corner cases that the original coding would misbehave in, such
as the primary server having significant idle time between transactions.
Per my complaint some time ago and considerable ensuing discussion.
Do some desultory editing on the hot standby documentation, too.
description for vacuum_defer_cleanup_age to the correct category.
Sections in postgresql.conf are also sorted in the same order with docs.
Per gripe by Fujii Masao, suggestion by Heikki Linnakangas, and patch by me.
MaxStandbyDelay. Use the GUC units mechanism for the value, and choose more
appropriate timestamp functions for performing tests with it. Make the
ps_activity manipulation in ResolveRecoveryConflictWithVirtualXIDs have
behavior similar to ps_activity code elsewhere, notably not updating the
display when update_process_title is off and not truncating the display
contents at an arbitrarily-chosen length. Improve the docs to be explicit
about what MaxStandbyDelay actually measures, viz the difference between
primary and standby servers' clocks, and the possible hazards if their clocks
aren't in sync.
confusion with streaming-replication settings. Also, change its default
value to "off", because of concern about executing new and poorly-tested
code during ordinary non-replicating operation. Per discussion.
In passing do some minor editing of related documentation.
archival or hot standby should be WAL-logged, instead of deducing that from
other options like archive_mode. This replaces recovery_connections GUC in
the primary, where it now has no effect, but it's still used in the standby
to enable/disable hot standby.
Remove the WAL-logging of "unlogged operations", like creating an index
without WAL-logging and fsyncing it at the end. Instead, we keep a copy of
the wal_mode setting and the settings that affect how much shared memory a
hot standby server needs to track master transactions (max_connections,
max_prepared_xacts, max_locks_per_xact) in pg_control. Whenever the settings
change, at server restart, write a WAL record noting the new settings and
update pg_control. This allows us to notice the change in those settings in
the standby at the right moment, they used to be included in checkpoint
records, but that meant that a changed value was not reflected in the
standby until the first checkpoint after the change.
Bump PG_CONTROL_VERSION and XLOG_PAGE_MAGIC. Whack XLOG_PAGE_MAGIC back to
the sequence it used to follow, before hot standby and subsequent patches
changed it to 0x9003.
Normal superuser processes are allowed to connect even when the database
system is shutting down, or when fewer than superuser_reserved_connection
slots remain. This is intended to make sure an administrator can log in
and troubleshoot, so don't extend these same courtesies to users connecting
for replication.
Also, make the name of the GUC and the name of the backing variable match.
Alnong the way, clean up a couple of slight typographical errors in the
related docs.
The logic for determining whether to materialize has been significantly
overhauled for 9.0. In case there should be any doubt about whether
materialization is a win in any particular case, this should provide a
convenient way of seeing what happens without it; but even with enable_material
turned off, we still materialize in cases where it is required for
correctness.
Thanks to Tom Lane for the review.
doesn't take into account how far the WAL senders are. This way a hung
WAL sender doesn't prevent old WAL segments from being recycled/removed
in the primary, ultimately causing the disk to fill up. Instead add
standby_keep_segments setting to control how many old WAL segments are
kept in the primary. This also makes it more reliable to use streaming
replication without WAL archiving, assuming that you set
standby_keep_segments high enough.
The endterm attribute is mainly useful when the toolchain does not support
automatic link target text generation for a particular situation. In the
past, this was required by the man page tools for all reference page links,
but that is no longer the case, and it now actually gets in the way of
proper automatic link text generation. The only remaining use cases are
currently xrefs to refsects.
how often we do SSL session key renegotiation. Can be set to
0 to disable renegotiation completely, which is required if
a broken SSL library is used (broken patches to CVE-2009-3555
a known cause) or when using a client library that can't do
renegotiation.
false positives during Hot Standby conflict processing. Simple
patch to enhance conflict processing, following previous discussions.
Controlled by parameter minimize_standby_conflicts = on | off, with
default off allows measurement of performance impact to see whether
it should be set on all the time.
default of "plpgsql". This is more reasonable than it was when the DO patch
was written, because we have since decided that plpgsql should be installed
by default. Per discussion, having a parameter for this doesn't seem useful
enough to justify the risk of application breakage if the value is changed
unexpectedly.
woken by alarm we send SIGUSR1 to all backends requesting that they
check to see if they are blocking Startup process. If so, they throw
ERROR/FATAL as for other conflict resolutions. Deadlock stop gap
removed. max_standby_delay = -1 option removed to prevent deadlock.
This includes two new kinds of postmaster processes, walsenders and
walreceiver. Walreceiver is responsible for connecting to the primary server
and streaming WAL to disk, while walsender runs in the primary server and
streams WAL from disk to the client.
Documentation still needs work, but the basics are there. We will probably
pull the replication section to a new chapter later on, as well as the
sections describing file-based replication. But let's do that as a separate
patch, so that it's easier to see what has been added/changed. This patch
also adds a new section to the chapter about FE/BE protocol, documenting the
protocol used by walsender/walreceivxer.
Bump catalog version because of two new functions,
pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for
monitoring the progress of replication.
Fujii Masao, with additional hacking by me
This patch only supports seq_page_cost and random_page_cost as parameters,
but it provides the infrastructure to scalably support many more.
In particular, we may want to add support for effective_io_concurrency,
but I'm leaving that as future work for now.
Thanks to Tom Lane for design help and Alvaro Herrera for the review.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
Rewrite the documentation in more idiomatic English, and in the process make
it somewhat more succinct. Move the discussion of specific large object
privileges out of the "server-side functions" section, where it certainly
doesn't belong, and into "implementation features". That might not be
exactly right either, but it doesn't seem worth creating a new section for
this amount of information. Fix a few spelling and layout problems, too.
pg_ctl gets a new mode that runs initdb. Adjust the documentation a bit to
not assume that initdb is the only way to run database cluster initialization.
But don't replace initdb as the canonical way.
Author: Zdenek Kotala <Zdenek.Kotala@Sun.COM>
default be "throw error on conflict", as per discussions. The GUC variable
is plpgsql.variable_conflict, with values "error", "use_variable",
"use_column". The behavior can also be specified per-function by inserting
one of
#variable_conflict error
#variable_conflict use_variable
#variable_conflict use_column
at the start of the function body.
The 8.5 release notes will need to mention using "use_variable" to retain
backward-compatible behavior, although we should encourage people to migrate
to the much less mistake-prone "error" setting.
Update the plpgsql documentation to match this and other recent changes.
style by default. Per discussion, there seems to be hardly anything that
really relies on being able to change the regex flavor, so the ability to
select it via embedded options ought to be enough for any stragglers.
Also, if we didn't remove the GUC, we'd really be morally obligated to
mark the regex functions non-immutable, which'd possibly create performance
issues.
Per recent discussion, add_missing_from has been deprecated for long enough to
consider removing, and it's getting in the way of planned parser refactoring.
The system now always behaves as though add_missing_from were OFF.
to create a function for it.
Procedural languages now have an additional entry point, namely a function
to execute an inline code block. This seemed a better design than trying
to hide the transient-ness of the code from the PL. As of this patch, only
plpgsql has an inline handler, but probably people will soon write handlers
for the other standard PLs.
In passing, remove the long-dead LANCOMPILER option of CREATE LANGUAGE.
Petr Jelinek
use that value when the backend is new enough to allow it. This responds
to bug report from Keh-Cheng Chu pointing out that although 2 extra digits
should be sufficient to dump and restore float8 exactly, it is possible to
need 3 extra digits for float4 values.
build actually attempts to advertise itself via Bonjour. Formerly it always
did so, which meant that packagers had to decide for their users whether
this behavior was wanted or not. The default is "off" to be on the safe
side, though this represents a change in the default behavior of a
Bonjour-enabled build. Per discussion.
Instead of sending stdout/stderr to /dev/null after forking away from the
terminal, send them to postmaster.log within the data directory. Since
this opens the door to indefinite logfile bloat, recommend even more
strongly that log output be redirected when using silent_mode.
Move the postmaster's initial calls of load_hba() and load_ident() down
to after we have started the log collector, if we are going to. This
is so that errors reported by them will appear in the "usual" place.
Reclassify silent_mode as a LOGGING_WHERE, not LOGGING_WHEN, parameter,
since it's got absolutely nothing to do with the latter category.
In passing, fix some obsolete references to -S ... this option hasn't
had that switch letter for a long time.
Back-patch to 8.4, since as of 8.4 load_hba() and load_ident() are more
picky (and thus more likely to fail) than they used to be. This entire
change was driven by a complaint about those errors disappearing into
the bit bucket.
Both hex format and the traditional "escape" format are automatically
handled on input. The output format is selected by the new GUC variable
bytea_output.
As committed, bytea_output defaults to HEX, which is an *incompatible
change*. We will keep it this way for awhile for testing purposes, but
should consider whether to switch to the more backwards-compatible
default of ESCAPE before 8.5 is released.
Peter Eisentraut
random number seed each time. This is how it used to work years ago, but
we got rid of the seed reset because it was resetting the main random()
sequence and thus having undesirable effects on the rest of the system.
To fix, establish a private random number state for each execution of
geqo(), and initialize the state using the new GUC variable geqo_seed.
People who want to experiment with different random searches can do so
by changing geqo_seed, but you'll always get the same plan for the same
value of geqo_seed (if holding all other planner inputs constant, of course).
The new state is kept in PlannerInfo by adding a "void *" field reserved
for use by join_search hooks. Most of the rather bulky code changes in
this commit are just arranging to pass PlannerInfo around to all the GEQO
functions (many of which formerly didn't receive it).
Andres Freund, with some editorialization by Tom
instead just pointing out that a larger value may trigger use of GEQO.
Per Robert Haas.
In passing, do a bit of wordsmithing on the Genetic Query Optimizer section.
documentation warnings against setting it nonzero unless active use of
prepared transactions is intended and a suitable transaction manager has been
installed. This should help to prevent the type of scenario we've seen
several times now where a prepared transaction is forgotten and eventually
causes severe maintenance problems (or even anti-wraparound shutdown).
The only real reason we had the default be nonzero in the first place was to
support regression testing of the feature. To still be able to do that,
tweak pg_regress to force a nonzero value during "make check". Since we
cannot force a nonzero value in "make installcheck", add a variant regression
test "expected" file that shows the results that will be obtained when
max_prepared_transactions is zero.
Also, extend the HINT messages for transaction wraparound warnings to mention
the possibility that old prepared transactions are causing the problem.
All per today's discussion.
to 100ms (from 1000). This still seems to be comfortably larger than the
useful range of the parameter, and it should help discourage people from
picking uselessly large values. Tweak the documentation to recommend small
values, too. Per discussion of a couple weeks ago.
per-table overrides of parameters.
This removes a whole class of problems related to misusing the catalog,
and perhaps more importantly, gives us pg_dump support for the parameters.
Based on a patch by Euler Taveira de Oliveira, heavily reworked by me.
GUC variable effective_io_concurrency controls how many concurrent block
prefetch requests will be issued.
(The best way to handle this for plain index scans is still under debate,
so that part is not applied yet --- tgl)
Greg Stark
the default. This setting enables constraint exclusion checks only for
appendrel members (ie, inheritance children and UNION ALL arms), which are
the cases in which constraint exclusion is most likely to be useful. Avoiding
the overhead for simple queries that are unlikely to benefit should bring
the cost down to the point where this is a reasonable default setting.
Per today's discussion.
specifically, we can input either the "format with designators" or the
"alternative format", and we can output the former when IntervalStyle is set
to iso_8601.
Ron Mayer
from DateStyle, and create a new interval style that produces output matching
the SQL standard (at least for interval values that fall within the standard's
restrictions). IntervalStyle is also used to resolve the conflict between the
standard and traditional Postgres rules for interpreting negative interval
input.
Ron Mayer
upon requests from backends, rather than on a fixed 500msec cycle. (There's
still throttling logic to ensure it writes no more often than once per
500msec, though.) This should result in a significant reduction in stats file
write traffic in typical scenarios where the stats are demanded only
infrequently.
This approach also means that the former difficulty with changing
stats_temp_directory on-the-fly has gone away, so remove the caution about
that as well as the thrashing we did to minimize the trouble window.
In passing, also fix pgstat_report_stat() so that we will send a stats
message if we have function call stats but not table stats to report;
this fixes a bug in the recent patch to support function-call stats.
Martin Pihlak
free space information is stored in a dedicated FSM relation fork, with each
relation (except for hash indexes; they don't use FSM).
This eliminates the max_fsm_relations and max_fsm_pages GUC options; remove any
trace of them from the backend, initdb, and documentation.
Rewrite contrib/pg_freespacemap to match the new FSM implementation. Also
introduce a new variant of the get_raw_page(regclass, int4, int4) function in
contrib/pageinspect that let's you to return pages from any relation fork, and
a new fsm_page_contents() function to inspect the new FSM pages.
debug_print_plan to appear at LOG message level, not DEBUG1 as historically.
Make debug_pretty_print default to on. Also, cause plans generated via
EXPLAIN to be subject to debug_print_plan. This is all to make
debug_print_plan a reasonably comfortable substitute for the former behavior
of EXPLAIN VERBOSE.
variable stats_temp_directory, instead of requiring the admin to
mount/symlink the pg_stat_tmp directory manually.
For now the config variable is PGC_POSTMASTER. Room for further improvment
that would allow it to be changed on-the-fly.
wal_segment_size to make those configuration parameters available to clients,
in the same way that block_size was previously exposed. Bernd Helmle, with
comments from Abhijit Menon-Sen and some further tweaking by me.
As the buffer could now be a lot larger than before, and copying it could
thus be a lot more expensive than before, use strcpy instead of memcpy to
copy the query string, as was already suggested in comments. Also, only copy
the PgBackendStatus struct and string if the slot is in use.
Patch by Thomas Lee, with some changes by me.
functions.
Note that because this patch changes FmgrInfo, any external C functions
you might be testing with 8.4 will need to be recompiled.
Patch by Martin Pihlak, some editorialization by me (principally, removing
tracking of getrusage() numbers)
it vary with BLCKSZ as before. This agrees with what the documentation says,
and avoids a regression test problem when BLCKSZ is larger than default.
Per recent discussion.
of each plan node, instead of its former behavior of dumping the internal
representation of the plan tree. The latter display is still available for
those who really want it (see debug_print_plan), but uses for it are certainly
few and and far between. Per discussion.
This patch also removes the explain_pretty_print GUC, which is obsoleted
by the change.
This requires a working 64-bit integer type. If such a type cannot
be found, "--disable-integer-datetimes" can be used to switch
back to the previous floating point-based datetime implementation.
variables to it. More need to be converted, but I wanted to get this in
before it conflicts with too much...
Other than just centralising the text-to-int conversion for parameters,
this allows the pg_settings view to contain a list of available options
and allows an error hint to show what values are allowed.
With the addition of multiple autovacuum workers, our choices were to delete
the check, document the interaction with autovacuum_max_workers, or complicate
the check to try to hide that interaction. Since this restriction has never
been adequate to ensure backends can't run out of pinnable buffers, it doesn't
really have enough excuse to live to justify the second or third choices.
Per discussion of a complaint from Andreas Kling (see also bug #3888).
This commit also removes several documentation references to this restriction,
but I'm not sure I got them all.
with the logged event. CSV logs are now a first-class citizen along plain
text logs in that they carry much of the same information.
Per complaint from depesz on bug #3799.
to validate the realm of the connecting user. By default
it's empty meaning no verification, which is the way
Kerberos authentication has traditionally worked in
PostgreSQL.
the same transaction can be identified even when no regular XID was assigned.
This seems essential after addition of the lazy-XID patch. Also some
minor code cleanup in write_csvlog().
- create a separate archive_mode GUC, on which archive_command is dependent
- %r option in recovery.conf sends last restartpoint to recovery command
- %r used in pg_standby, updated README
- minor other code cleanup in pg_standby
- doc on Warm Standby now mentions pg_standby and %r
- log_restartpoints recovery option emits LOG message at each restartpoint
- end of recovery now displays last transaction end time, as requested
by Warren Little; also shown at each restartpoint
- restart archiver if needed to carry away WAL files at shutdown
Simon Riggs
buffers that cannot possibly need to be cleaned, and estimates how many
buffers it should try to clean based on moving averages of recent allocation
requests and density of reusable buffers. The patch also adds a couple
more columns to pg_stat_bgwriter to help measure the effectiveness of the
bgwriter.
Greg Smith, building on his own work and ideas from several other people,
in particular a much older patch from Itagaki Takahiro.
* stats_start_collector goes away; we always start the collector process,
unless prevented by a problem with setting up the stats UDP socket.
* stats_reset_on_server_start goes away; it seems useless in view of the
availability of pg_stat_reset().
* stats_block_level and stats_row_level are merged into a single variable
"track_counts", which controls all reports sent to the collector process.
* stats_command_string is renamed to track_activities.
* log_autovacuum is renamed to log_autovacuum_min_duration to better reflect
its meaning.
The log_autovacuum change is not a compatibility issue since it didn't exist
before 8.3 anyway. The other changes need to be release-noted.
rows will normally never obtain an XID at all. We already did things this way
for subtransactions, but this patch extends the concept to top-level
transactions. In applications where there are lots of short read-only
transactions, this should improve performance noticeably; not so much from
removal of the actual XID-assignments, as from reduction of overhead that's
driven by the rate of XID consumption. We add a concept of a "virtual
transaction ID" so that active transactions can be uniquely identified even
if they don't have a regular XID. This is a much lighter-weight concept:
uniqueness of VXIDs is only guaranteed over the short term, and no on-disk
record is made about them.
Florian Pflug, with some editorialization by Tom.
displayed in the postmaster log. This avoids Windows-specific problems with
localized time zone names that are in the wrong encoding, and generally seems
like a good idea to forestall other potential platform-dependent issues.
To preserve the existing behavior that all backends will log in the same time
zone, create a new GUC variable log_timezone that can only be changed on a
system-wide basis, and reference log-related calculations to that zone instead
of the TimeZone variable.
This fixes the issue reported by Hiroshi Saito that timestamps printed by
xlog.c startup could be improperly localized on Windows. We still need a
simpler patch for that problem in the back branches, however.
so that we will be able to create a cookie for all processes for CSVlogs.
It is set wherever MyProcPid is set. Take the opportunity to remove the now
unnecessary session-only restriction on the %s and %c escapes in log_line_prefix.
before reporting a transaction committed. Data consistency is still
guaranteed (unlike setting fsync = off), but a crash may lose the effects
of the last few transactions. Patch by Simon, some editorialization by Tom.
and fsync WAL at convenient intervals. For the moment it just tries to
offload this work from backends, but soon it will be responsible for
guaranteeing a maximum delay before asynchronously-committed transactions
will be flushed to disk.
This is a portion of Simon Riggs' async-commit patch, committed to CVS
separately because a background WAL writer seems like it might be a good idea
independently of the async-commit feature. I rebased walwriter.c on
bgwriter.c because it seemed like a more appropriate way of handling signals;
while the startup/shutdown logic in postmaster.c is more like autovac because
we want walwriter to quit before we start the shutdown checkpoint.
I/O utilization, per discussion.
While at it, lower the autovacuum vacuum and analyze threshold values to 50
tuples. It is a bit higher (i.e. more conservative) than what I originally
proposed but much better than the old values for small tables.
over a fairly long period of time, rather than being spat out in a burst.
This happens only for background checkpoints carried out by the bgwriter;
other cases, such as a shutdown checkpoint, are still done at full speed.
Remove the "all buffers" scan in the bgwriter, and associated stats
infrastructure, since this seems no longer very useful when the checkpoint
itself is properly throttled.
Original patch by Itagaki Takahiro, reworked by Heikki Linnakangas,
and some minor API editorialization by me.
provide visual separation from the rest of the log line; I've been
noticing lately that quite a few newbies fail to figure this out for
themselves. Also a little editorial cleanup of the log_line_prefix
description.
within a signal handler (this might be safe given the relatively narrow code
range in which the interrupt is enabled, but it seems awfully risky); do issue
more informative log messages that tell what is being waited for and the exact
length of the wait; minor other code cleanup. Greg Stark and Tom Lane
for each temp file, rather than once per sort or hashjoin; this allows
spreading the data of a large sort or join across multiple tablespaces.
(I remain dubious that this will make any difference in practice, but certain
people insisted.) Arrange to cache the results of parsing the GUC variable
instead of recomputing from scratch on every demand, and push usage of the
cache down to the bottommost fd.c level.
tablespace(s) in which to store temp tables and temporary files. This is a
list to allow spreading the load across multiple tablespaces (a random list
element is chosen each time a temp object is to be created). Temp files are
not stored in per-database pgsql_tmp/ directories anymore, but per-tablespace
directories.
Jaime Casanova and Albert Cervera, with review by Bernd Helmle and Tom Lane.
"autovacuum = off", the system may still periodically start autovacuum
processes to prevent XID wraparound. Patch from David Fetter, with
editorializing.
This is needed to allow a security-definer function to set a truly secure
value of search_path. Without it, a malicious user can use temporary objects
to execute code with the privileges of the security-definer function. Even
pushing the temp schema to the back of the search path is not quite good
enough, because a function or operator at the back of the path might still
capture control from one nearer the front due to having a more exact datatype
match. Hence, disable searching the temp schema altogether for functions and
operators.
Security: CVE-2007-2138
processes to be running simultaneously. Also, now autovacuum processes do not
count towards the max_connections limit; they are counted separately from
regular processes, and are limited by the new GUC variable
autovacuum_max_workers.
The launcher now has intelligence to launch workers on each database every
autovacuum_naptime seconds, limited only on the max amount of worker slots
available.
Also, the global worker I/O utilization is limited by the vacuum cost-based
delay feature. Workers are "balanced" so that the total I/O consumption does
not exceed the established limit. This part of the patch was contributed by
ITAGAKI Takahiro.
Per discussion.
rules to be defined with different, per session controllable, behaviors
for replication purposes.
This will allow replication systems like Slony-I and, as has been stated
on pgsql-hackers, other products to control the firing mechanism of
triggers and rewrite rules without modifying the system catalog directly.
The firing mechanisms are controlled by a new superuser-only GUC
variable, session_replication_role, together with a change to
pg_trigger.tgenabled and a new column pg_rewrite.ev_enabled. Both
columns are a single char data type now (tgenabled was a bool before).
The possible values in these attributes are:
'O' - Trigger/Rule fires when session_replication_role is "origin"
(default) or "local". This is the default behavior.
'D' - Trigger/Rule is disabled and fires never
'A' - Trigger/Rule fires always regardless of the setting of
session_replication_role
'R' - Trigger/Rule fires when session_replication_role is "replica"
The GUC variable can only be changed as long as the system does not have
any cached query plans. This will prevent changing the session role and
accidentally executing stored procedures or functions that have plans
cached that expand to the wrong query set due to differences in the rule
firing semantics.
The SQL syntax for changing a triggers/rules firing semantics is
ALTER TABLE <tabname> <when> TRIGGER|RULE <name>;
<when> ::= ENABLE | ENABLE ALWAYS | ENABLE REPLICA | DISABLE
psql's \d command as well as pg_dump are extended in a backward
compatible fashion.
Jan
log_min_messages does; and arrange to suppress the duplicative output
that would otherwise result from log_statement and log_duration messages.
Bruce Momjian and Tom Lane.
o read global SSL configuration file
o add GUC "ssl_ciphers" to control allowed ciphers
o add libpq environment variable PGSSLKEY to control SSL hardware keys
Victor B. Wagner
Standard English uses "may", "can", and "might" in different ways:
may - permission, "You may borrow my rake."
can - ability, "I can lift that log."
might - possibility, "It might rain today."
Unfortunately, in conversational English, their use is often mixed, as
in, "You may use this variable to do X", when in fact, "can" is a better
choice. Similarly, "It may crash" is better stated, "It might crash".
Also update two error messages mentioned in the documenation to match.
- Add new SQL command SET XML OPTION (also available via regular GUC) to
control the DOCUMENT vs. CONTENT option in implicit parsing and
serialization operations.
- Subtle corrections in the handling of the standalone property in
xmlroot().
- Allow xmlroot() to work on content fragments.
- Subtle corrections in the handling of the version property in
xmlconcat().
- Code refactoring for producing XML declarations.
match the postgresql.conf file. Also add units to descriptions that
lacked them. Wording improvements. Mention pg_settings.unit as the way
to find the default units for setting.
Backpatch to 8.2.X.
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
command strings inserts relative not absolute path of file to process.
This is a side-effect of 2005-07-04 change that makes the server use
relative paths in general. Noted by Bernd Helmle.
max_stack_depth is not set to an unsafe value.
This commit also provides configure-time checking for <sys/resource.h>,
and cleans up some perhaps-unportable code associated with use of that
include file and getrlimit().
than being equivalent to setting log_min_duration_statement to zero, this
option now forces logging of all query durations, but doesn't force logging
of query text. Also, add duration logging coverage for fastpath function
calls.
proposal. Parameter logging works even for binary-format parameters, and
logging overhead is avoided when disabled.
log_statement = all output for the src/test/examples/testlibpq3.c example
now looks like
LOG: statement: execute <unnamed>: SELECT * FROM test1 WHERE t = $1
DETAIL: parameters: $1 = 'joe''s place'
LOG: statement: execute <unnamed>: SELECT * FROM test1 WHERE i = $1::int4
DETAIL: parameters: $1 = '2'
and log_min_duration_statement = 0 results in
LOG: duration: 2.431 ms parse <unnamed>: SELECT * FROM test1 WHERE t = $1
LOG: duration: 2.335 ms bind <unnamed> to <unnamed>: SELECT * FROM test1 WHERE t = $1
DETAIL: parameters: $1 = 'joe''s place'
LOG: duration: 0.394 ms execute <unnamed>: SELECT * FROM test1 WHERE t = $1
DETAIL: parameters: $1 = 'joe''s place'
LOG: duration: 1.251 ms parse <unnamed>: SELECT * FROM test1 WHERE i = $1::int4
LOG: duration: 0.566 ms bind <unnamed> to <unnamed>: SELECT * FROM test1 WHERE i = $1::int4
DETAIL: parameters: $1 = '2'
LOG: duration: 0.173 ms execute <unnamed>: SELECT * FROM test1 WHERE i = $1::int4
DETAIL: parameters: $1 = '2'
(This example demonstrates the folly of ignoring parse/bind steps for duration
logging purposes, BTW.)
Along the way, create a less ad-hoc mechanism for determining which commands
are logged by log_statement = mod and log_statement = ddl. The former coding
was actually missing quite a few things that look like ddl to me, and it
did not handle EXECUTE or extended query protocol correctly at all.
This commit does not do anything about the question of whether log_duration
should be removed or made less redundant with log_min_duration_statement.
"server_version" but uses the handy PG_VERSION_NUM which allows apps to
do things like if ($version >= 80200) without having to parse apart the
value of server_version themselves.
Greg Sabino Mullane greg@turnstep.com
optionally bind. I re-added the "statement:" label so people will
understand why the line is being printed (it is log_*statement
behavior).
Use single quotes for bind values, instead of double quotes, and double
literal single quotes in bind values (and document that). I also made
use of the DETAIL line to have much cleaner output.
than N seconds apart. This allows a simple, if not very high performance,
means of guaranteeing that a PITR archive is no more than N seconds behind
real time. Also make pg_current_xlog_location return the WAL Write pointer,
add pg_current_xlog_insert_location to return the Insert pointer, and fix
pg_xlogfile_name_offset to return its results as a two-element record instead
of a smashed-together string, as per recent discussion.
Simon Riggs
such as debugging and performance measurement. This consists of two features:
a table of "rendezvous variables" that allows separately-loaded shared
libraries to communicate, and a new GUC setting "local_preload_libraries"
that allows libraries to be loaded into specific sessions without explicit
cooperation from the client application. To make local_preload_libraries
as flexible as possible, we do not restrict its use to superusers; instead,
it is restricted to load only libraries stored in $libdir/plugins/. The
existing LOAD command has also been modified to allow non-superusers to
LOAD libraries stored in this directory.
This patch also renames the existing GUC variable preload_libraries to
shared_preload_libraries (after a suggestion by Simon Riggs) and does some
code refactoring in dfmgr.c to improve clarity.
Korry Douglas, with a little help from Tom Lane.
loaded libraries: call functions _PG_init() and _PG_fini() if the library
defines such symbols. Hence we no longer need to specify an initialization
function in preload_libraries: we can assume that the library used the
_PG_init() convention, instead. This removes one source of pilot error
in use of preloaded libraries. Original patch by Ralf Engelschall,
preload_libraries changes by me.
o print user name for all
o print portal name if defined for all
o print query for all
o reduce log_statement header to single keyword
o print bind parameters as DETAIL if text mode
configuration files that can be altered by a DBA. The australian_timezones
GUC setting disappears, replaced by a timezone_abbreviations setting (set this
to 'Australia' to get the effect of australian_timezones). The list of zone
names defined by default has undergone a bit of cleanup, too. Documentation
still needs some work --- in particular, should we fix Table B-4, or just get
rid of it? Joachim Wieland, with some editorializing by moi.
current commands; instead, store current-status information in shared
memory. This substantially reduces the overhead of stats_command_string
and also ensures that pg_stat_activity is fully up to date at all times.
Per my recent proposal.
This shouldn't affect simple indexscans much, while for bitmap scans that
are touching a lot of index rows, this seems to bring the estimates more
in line with reality. Per recent discussion.
assumed that a sequential page fetch has cost 1.0. This patch doesn't
in itself change the system's behavior at all, but it opens the door to
people adopting other units of measurement for EXPLAIN costs. Also, if
we ever decide it's worth inventing per-tablespace access cost settings,
this change provides a workable intellectual framework for that.
parser will allow "\'" to be used to represent a literal quote mark. The
"\'" representation has been deprecated for some time in favor of the
SQL-standard representation "''" (two single quote marks), but it has been
used often enough that just disallowing it immediately won't do. Hence
backslash_quote allows the settings "on", "off", and "safe_encoding",
the last meaning to allow "\'" only if client_encoding is a valid server
encoding. That is now the default, and the reason is that in encodings
such as SJIS that allow 0x5c (ASCII backslash) to be the last byte of a
multibyte character, accepting "\'" allows SQL-injection attacks as per
CVE-2006-2314 (further details will be published after release). The
"on" setting is available for backward compatibility, but it must not be
used with clients that are exposed to untrusted input.
Thanks to Akio Ishida and Yasuo Ohgaki for identifying this security issue.
throw warnings for 100%-SQL-standard constructs, clean up some minor
infelicities, try to un-break ecpg to the best of my ability. (It's not clear
how ecpg is going to find out the setting of standard_conforming_strings,
though.) I think pg_dump still needs work, too.
transaction_timestamp() (just like now()).
Also update statement_timeout() to mention it is statement arrival time
that is measured.
Catalog version updated.
compatibility for release 7.2 and earlier. I have not altered any
mentions of release 7.3 or later. The release notes were not modified,
so the changes are still documented, just not in the main docs.
8.0), and add as suggestion to use log_min_error_statement for this
purpose. I also fixed the code so the first EXECUTE has it's prepare,
rather than the last which is what was in the current code. Also remove
"protocol" prefix for SQL EXECUTE output because it is not accurate.
Backpatch to 8.1.X.
... in fact, it will be applied now in any query whatsoever. I'm still
a bit concerned about the cycles that might be expended in failed proof
attempts, but given that CE is turned off by default, it's the user's
choice whether to expend those cycles or not. (Possibly we should
change the simple bool constraint_exclusion parameter to something
more fine-grained?)
suggestion a couple days ago. Fix some cases in which the documentation
neglected to mention any restriction on when a parameter can be set.
Try to be consistent about calling parameters parameters; use the term
option only for command-line switches.
be consistent about whether it's called a daemon or a subprocess, and
don't describe the autovacuum setting in exactly the same way as the
stats_start_collector setting, because that leaves people thinking (if
they aren't paying close attention) that autovacuum can't be changed
on the fly.