The code for matching clients to pg_hba.conf lines that specify host names
(instead of IP address ranges) failed to complain if reverse DNS lookup
failed; instead it silently didn't match, so that you might end up getting
a surprising "no pg_hba.conf entry for ..." error, as seen in bug #9518
from Mike Blackwell. Since we don't want to make this a fatal error in
situations where pg_hba.conf contains a mixture of host names and IP
addresses (clients matching one of the numeric entries should not have to
have rDNS data), remember the lookup failure and mention it as DETAIL if
we get to "no pg_hba.conf entry". Apply the same approach to forward-DNS
lookup failures, too, rather than treating them as immediate hard errors.
Along the way, fix a couple of bugs that prevented us from detecting an
rDNS lookup error reliably, and make sure that we make only one rDNS lookup
attempt; formerly, if the lookup attempt failed, the code would try again
for each host name entry in pg_hba.conf. Since more or less the whole
point of this design is to ensure there's only one lookup attempt not one
per entry, the latter point represents a performance bug that seems
sufficient justification for back-patching.
Also, adjust src/port/getaddrinfo.c so that it plays as well as it can
with this code. Which is not all that well, since it does not have actual
support for rDNS lookup, but at least it should return the expected (and
required by spec) error codes so that the main code correctly perceives the
lack of functionality as a lookup failure. It's unlikely that PG is still
being used in production on any machines that require our getaddrinfo.c,
so I'm not excited about working harder than this.
To keep the code in the various branches similar, this includes
back-patching commits c424d0d105 and
1997f34db4 into 9.2 and earlier.
Back-patch to 9.1 where the facility for hostnames in pg_hba.conf was
introduced.
Initialization of this field was not being done according to the
st_changecount protocol (it has to be done within the changecount increment
range, not outside). And the test to see if the value should be reported
as null was wrong. Noted while perusing uses of Port.remote_hostname.
This was wrong from the introduction of this code (commit 4a25bc145),
so back-patch to 9.1.
Inserting a downlink to an internal page clears the incomplete-split flag
of the child's left sibling, so the left sibling's LSN also needs to be
updated and it needs to be marked dirty. The codepath for an insertion got
this right, but the case where the internal node is split because of
inserting the new downlink missed that.
Inserting a downlink to an internal page clears the incomplete-split flag
of the child's left sibling, so the left sibling's LSN also needs to be
updated.
Otherwise, the compiler might decide to move modifications to data
within this structure outside the enclosing SpinLockAcquire /
SpinLockRelease pair, leading to shared memory corruption.
This may or may not explain a recent lmgr-related buildfarm failure
on prairiedog, but it needs to be fixed either way.
Previously, such buffers weren't counted, with the possible result that
EXPLAIN (BUFFERS) and pg_stat_statements would understate the true
number of blocks dirtied by an SQL statement.
Back-patch to 9.2, where this counter was introduced.
Amit Kapila
Inserting (in retail) into the new 9.4 format GIN posting tree created much
larger WAL records than in 9.3. The previous strategy to WAL logging was
basically to log the whole page on each change, with the exception of
completely unmodified segments up to the first modified one. That was not
too bad when appending to the end of the page, as only the last segment had
to be WAL-logged, but per Fujii Masao's testing, even that produced 2x the
WAL volume that 9.3 did.
The new strategy is to keep track of changes to the posting lists in a more
fine-grained fashion, and also make the repacking" code smarter to avoid
decoding and re-encoding segments unnecessarily.
The original coding of EquivalenceClasses didn't foresee that appendrel
child relations might themselves be appendrels; but this is possible for
example when a UNION ALL subquery scans a table with inheritance children.
The oversight led to failure to optimize ordering-related issues very well
for the grandchild tables. After some false starts involving explicitly
flattening the appendrel representation, we found that this could be fixed
easily by removing a few implicit assumptions about appendrel parent rels
not being children themselves.
Kyotaro Horiguchi and Tom Lane, reviewed by Noah Misch
Commit 613c6d26bd sloppily replaced a
lookup of the UID obtained from getpeereid() with a lookup of the
server's own user name, thus totally destroying peer authentication.
Revert. Per report from Christoph Berg.
In passing, make sure get_user_name() zeroes *errstr on success on
Windows as well as non-Windows. I don't think any callers actually
depend on this ATM, but we should be consistent across platforms.
If you compile with WAL_DEBUG and enable it with wal_debug=on, we used to
only pass the first XLogRecData entry to the rm_desc routine. I think the
original assumprion was that the first XLogRecData entry contains all the
necessary information for the rm_desc routine, but that's a pretty shaky
assumption. At least standby_redo didn't get the memo.
To fix, piece together all the data in a temporary buffer, and pass that to
the rm_desc routine.
It's been like this forever, but the patch didn't apply cleanly to
back-branches. Probably wouldn't be hard to fix the conflicts, but it's
not worth the trouble.
Set function parameter names and defaults. Add jsonb versions (which the
code already provided for so the actual new code is trivial). Add jsonb
regression tests and docs.
Bump catalog version (which I apparently forgot to do when jsonb was
committed).
Assert errors were thrown for functions being passed invalid encodings,
while the main code handled it just fine.
Also document that libpq's PQclientEncoding() returns -1 for an encoding
lookup failure.
Per report from Peter Geoghegan
The new format accepts exactly the same data as the json type. However, it is
stored in a format that does not require reparsing the orgiginal text in order
to process it, making it much more suitable for indexing and other operations.
Insignificant whitespace is discarded, and the order of object keys is not
preserved. Neither are duplicate object keys kept - the later value for a given
key is the only one stored.
The new type has all the functions and operators that the json type has,
with the exception of the json generation functions (to_json, json_agg etc.)
and with identical semantics. In addition, there are operator classes for
hash and btree indexing, and two classes for GIN indexing, that have no
equivalent in the json type.
This feature grew out of previous work by Oleg Bartunov and Teodor Sigaev, which
was intended to provide similar facilities to a nested hstore type, but which
in the end proved to have some significant compatibility issues.
Authors: Oleg Bartunov, Teodor Sigaev, Peter Geoghegan and Andrew Dunstan.
Review: Andres Freund
This covers all the SQL-standard trigger types supported for regular
tables; it does not cover constraint triggers. The approach for
acquiring the old row mirrors that for view INSTEAD OF triggers. For
AFTER ROW triggers, we spool the foreign tuples to a tuplestore.
This changes the FDW API contract; when deciding which columns to
populate in the slot returned from data modification callbacks, writable
FDWs will need to check for AFTER ROW triggers in addition to checking
for a RETURNING clause.
In support of the feature addition, refactor the TriggerFlags bits and
the assembly of old tuples in ModifyTable.
Ronan Dunklau, reviewed by KaiGai Kohei; some additional hacking by me.
equalTupleDescs() neglected both of these ConstrCheck fields, and
CreateTupleDescCopyConstr() neglected ccnoinherit. At this time, the
only known behavior defect resulting from these omissions is constraint
exclusion disregarding a CHECK constraint validated by an ALTER TABLE
VALIDATE CONSTRAINT statement issued earlier in the same transaction.
Back-patch to 9.2, where these fields were introduced.
Also fix the name of the dtrace probe for LWLockAcquireOrWait(). The
function was renamed from LWLockWaitUntilFree to LWLockAqcuireOrWait, but
the dtrace probe was neglected.
Pointed out by Andres Freund and the buildfarm.
Clear errno before calling readdir() and handle old MinGW errno bug
while adding full test coverage for readdir/closedir failures.
Backpatch through 8.4.
The special feature the XLogInsert slots had over regular LWLocks is the
insertingAt value that was updated atomically with releasing backends
waiting on it. Add new functions to the LWLock API to do that, and replace
the slots with LWLocks. This reduces the amount of duplicated code.
(There's still some duplication, but at least it's all in lwlock.c now.)
Reviewed by Andres Freund.
The previous method was overly complex and underly correct; in particular,
by assigning the default value with PGC_S_OVERRIDE, it prevented later
attempts to change the setting in postgresql.conf, as noted by Jeff Janes.
We should just assign the default value with source PGC_S_DYNAMIC_DEFAULT,
which will have the desired priority relative to the boot_val as well as
user-set values.
There is still a gap in this method: if there's an explicit assignment of
effective_cache_size = -1 in the postgresql.conf file, and that assignment
appears before shared_buffers is assigned, the code will substitute 4 times
the bootstrap default for shared_buffers, and that value will then persist
(since it will have source PGC_S_FILE). I don't see any very nice way
to avoid that though, and it's not a case to be expected in practice.
The existing comments in guc-file.l look forward to a redesign of the
DYNAMIC_DEFAULT mechanism; if that ever happens, we should consider this
case as one of the things we'd like to improve.
With this in place, a session blocking behind another one because of
tuple locks will get a context line mentioning the relation name, tuple
TID, and operation being done on tuple. For example:
LOG: process 11367 still waiting for ShareLock on transaction 717 after 1000.108 ms
DETAIL: Process holding the lock: 11366. Wait queue: 11367.
CONTEXT: while updating tuple (0,2) in relation "foo"
STATEMENT: UPDATE foo SET value = 3;
Most usefully, the new line is displayed by log entries due to
log_lock_waits, although of course it will be printed by any other log
message as well.
Author: Christian Kruse, some tweaks by Álvaro Herrera
Reviewed-by: Amit Kapila, Andres Freund, Tom Lane, Robert Haas
For a regex containing backrefs, pg_regexec() might fail to free all the
sub-DFAs that were created during execution, resulting in a permanent
(session lifespan) memory leak. Problem was introduced by me in commit
587359479a. Per report from Sandro Santilli;
diagnosis by Greg Stark.
It is no longer used, none of the resource managers have multi-record
actions that would make it unsafe to perform a restartpoint.
Also don't allow rm_cleanup to write WAL records, it's also no longer
required. Move the call to rm_cleanup routines to make it more symmetric
with rm_startup.
Splitting a page consists of two separate steps: splitting the child page,
and inserting the downlink for the new right page to the parent. Previously,
we handled the case that you crash in between those steps with a cleanup
routine after the WAL recovery had finished, which finished the incomplete
split. However, that doesn't help if the page split is interrupted but the
database doesn't crash, so that you don't perform WAL recovery. That could
happen for example if you run out of disk space.
Remove the end-of-recovery cleanup step. Instead, when a page is split, the
left page is marked with a new INCOMPLETE_SPLIT flag, and when the downlink
is inserted to the parent, the flag is cleared again. If an insertion sees
a page with the flag set, it knows that the split was interrupted for some
reason, and inserts the missing downlink before proceeding.
I used the same approach to fix GIN and GiST split algorithms earlier. This
was the last WAL cleanup routine, so we could get rid of that whole
machinery now, but I'll leave that for a separate patch.
Reviewed by Peter Geoghegan.
One path through the loop over indexes forgot to do index_close(). Rather
than adding a fourth call, restructure slightly so that there's only one.
In passing, get rid of an unnecessary syscache lookup: the pg_index struct
for the index is already available from its relcache entry.
Per report from YAMAMOTO Takashi, though this is a bit different from his
suggested patch. This is new code in HEAD, so no need for back-patch.
Revise the original decision to expose a uint64-based interface and
use Size everywhere possible. Avoid assuming that MAXIMUM_ALIGNOF is
8, or making any assumption about the relationship between that value
and sizeof(Size). If MAXIMUM_ALIGNOF is bigger, we'll now insert
padding after the length word; if it's smaller, we are now prepared
to read and write the length word in chunks.
Per discussion with Tom Lane.
The new function dsm_detach_all() can be used either by postmaster
children that don't wish to take any risk of accidentally corrupting
shared memory; or by forked children of regular backends with
the same need. This patch also updates the postmaster children that
already do PGSharedMemoryDetach() to do dsm_detach_all() as well.
Per discussion with Tom Lane.
The recently-fixed bug in WAL replay could result in not finding a parent
tuple for a heap-only tuple. The existing code would either Assert or
generate an invalid index entry, neither of which is desirable. Throw a
regular error instead.
On clean shutdown, walsender waits for all WAL to be replicated to a standby,
and exits. It determined whether that replication had been completed by
checking whether its sent location had been equal to a standby's flush
location. Unfortunately this condition never becomes true when the standby
such as pg_receivexlog which always returns an invalid flush location is
connecting to walsender, and then walsender waits forever.
This commit changes walsender so that it just checks a standby's write
location if a flush location is invalid.
Back-patch to 9.1 where enough infrastructure for this exists.
krb_srvname is actually not available anymore as a parameter server-side, since
with gssapi we accept all principals in our keytab. It's still used in libpq for
client side specification.
In passing remove declaration of krb_server_hostname, where all the functionality
was already removed.
Noted by Stephen Frost, though a different solution than his suggestion