InitXLogInsert() cannot be called in a critical section, because it
allocates memory. But CreateCheckPoint() did that, when called for the
end-of-recovery checkpoint by the startup process.
In the passing, fix the scratch space allocation in InitXLogInsert to go to
the right memory context. Also update the comment at InitXLOGAccess, which
hasn't been totally accurate since hot standby was introduced (in a hot
standby backend, InitXLOGAccess isn't called at backend startup).
Reported by Michael Paquier
As pointed out by Robert, we should really have named pg_rowsecurity
pg_policy, as the objects stored in that catalog are policies. This
patch fixes that and updates the column names to start with 'pol' to
match the new catalog name.
The security consideration for COPY with row level security, also
pointed out by Robert, has also been addressed by remembering and
re-checking the OID of the relation initially referenced during COPY
processing, to make sure it hasn't changed under us by the time we
finish planning out the query which has been built.
Robert and Alvaro also commented on missing OCLASS and OBJECT entries
for POLICY (formerly ROWSECURITY or POLICY, depending) in various
places. This patch fixes that too, which also happens to add the
ability to COMMENT on policies.
In passing, attempt to improve the consistency of messages, comments,
and documentation as well. This removes various incarnations of
'row-security', 'row-level security', 'Row-security', etc, in favor
of 'policy', 'row level security' or 'row_security' as appropriate.
Happy Thanksgiving!
In passing, add an Assert defending the presumption that bytes_left
is positive to start with. (I'm not exactly convinced that using an
unsigned type was such a bright thing here, but let's at least do
this much.)
These cases formerly failed with errors about "could not find array type
for data type". Now they yield arrays of the same element type and one
higher dimension.
The implementation involves creating functions with API similar to the
existing accumArrayResult() family. I (tgl) also extended the base family
by adding an initArrayResult() function, which allows callers to avoid
special-casing the zero-inputs case if they just want an empty array as
result. (Not all do, so the previous calling convention remains valid.)
This allowed simplifying some existing code in xml.c and plperl.c.
Ali Akbar, reviewed by Pavel Stehule, significantly modified by me
Per discussion with Tom and Andrew, 64bit integers are no longer a
problem for the catalogs, so go ahead and add the mapping from the C
int64 type to the int8 SQL identification to allow using them.
Patch by Adam Brightwell
The old method of appending options to the connection string didn't work if
the primary_conninfo was a postgres:// style URI, instead of a traditional
connection string. Use PQconnectdbParams instead.
Alex Shulgin
Code that check the flag no longer need #ifdef's, which is more convenient.
In particular, makes it easier to write extensions that depend on it.
In the passing, modify sslinfo's ssl_is_used function to check ssl_in_use
instead of the OpenSSL specific 'ssl' pointer. It doesn't make any
difference currently, as sslinfo is only compiled when built with OpenSSL,
but seems cleaner anyway.
This gives an overview of what Lehman & Yao's paper is all about, so that
you can understand the rest of the README without having to read the paper.
Per discussion with Peter Geoghegan and others.
Add a new XLOG_FPI_FOR_HINT record type, and use that for full-page images
generated for hint bit updates, when checksums are enabled. The new record
type is replayed exactly the same as XLOG_FPI, but allows them to be tallied
separately e.g. in pg_xlogdump.
The locution "EXISTS(SELECT ... LIMIT 1)" seems to be rather common among
people who don't realize that the database already performs optimizations
equivalent to putting LIMIT 1 in the sub-select. Unfortunately, this was
actually making things worse, because it prevented us from optimizing such
EXISTS clauses into semi or anti joins. Teach simplify_EXISTS_query() to
suppress constant-positive LIMIT clauses. That fixes the semi/anti-join
case, and may help marginally even for cases that have to be left as
sub-SELECTs.
Marti Raudsepp, reviewed by David Rowley
postgres_fdw would send query conditions involving system columns to the
remote server, even though it makes no effort to ensure that system
columns other than CTID match what the remote side thinks. tableoid,
in particular, probably won't match and might have some use in queries.
Hence, prevent sending conditions that include non-CTID system columns.
Also, create_foreignscan_plan neglected to check local restriction
conditions while determining whether to set fsSystemCol for a foreign
scan plan node. This again would bollix the results for queries that
test a foreign table's tableoid.
Back-patch the first fix to 9.3 where postgres_fdw was introduced.
Back-patch the second to 9.2. The code is probably broken in 9.1 as
well, but the patch doesn't apply cleanly there; given the weak state
of support for FDWs in 9.1, it doesn't seem worth fixing.
Etsuro Fujita, reviewed by Ashutosh Bapat, and somewhat modified by me
Make it work more like FDW plans do: instead of assuming that there are
expressions in a CustomScan plan node that the core code doesn't know
about, insist that all subexpressions that need planner attention be in
a "custom_exprs" list in the Plan representation. (Of course, the
custom plugin can break the list apart again at executor initialization.)
This lets us revert the parts of the patch that exposed setrefs.c and
subselect.c processing to the outside world.
Also revert the GetSpecialCustomVar stuff in ruleutils.c; that concept
may work in future, but it's far from fully baked right now.
Instead of register_custom_path_provider and a CreateCustomScanPath
callback, let's just provide a standard function hook in set_rel_pathlist.
This is more flexible than what was previously committed, is more like the
usual conventions for planner hooks, and requires less support code in the
core. We had discussed this design (including centralizing the
set_cheapest() calls) back in March or so, so I'm not sure why it wasn't
done like this already.
There seems no prospect that any of this will ever be useful, and indeed
it's questionable whether some of it would work if it ever got called;
it's certainly not been exercised in a very long time, if ever. So let's
get rid of it, and make the comments about mark/restore in execAmi.c less
wishy-washy.
The mark/restore support for Result nodes is also currently dead code,
but that's due to planner limitations not because it's impossible that
it could be useful. So I left it in.
Get rid of the pernicious entanglement between planner and executor headers
introduced by commit 0b03e5951b.
Also, rearrange the CustomFoo struct/typedef definitions so that all the
typedef names are seen as used by the compiler. Without this pgindent
will mess things up a bit, which is not so important perhaps, but it also
removes a bizarre discrepancy between the declaration arrangement used for
CustomExecMethods and that used for CustomScanMethods and
CustomPathMethods.
Clean up the commentary around ExecSupportsMarkRestore to reflect the
rather large change in its API.
Const-ify register_custom_path_provider's argument. This necessitates
casting away const in the function, but that seems better than forcing
callers of the function to do so (or else not const-ify their method
pointer structs, which was sort of the whole point).
De-export fix_expr_common. I don't like the exporting of fix_scan_expr
or replace_nestloop_params either, but this one surely has got little
excuse.
execCurrent.c's search_plan_tree() must recognize a CustomScan on the
target relation. This would only be helpful for custom providers that
support CurrentOfExpr quals, which is probably a bit far-fetched, but
it's not impossible I think. But even without assuming that, we need
to recognize a scanned-relation match so that we will properly throw
error if the desired relation is being scanned with both a CustomScan
and a regular scan (ie, self-join).
Also recognize ForeignScanState for similar reasons. Supporting WHERE
CURRENT OF on a foreign table is probably even more far-fetched than
it is for custom scans, but I think in principle you could do it with
postgres_fdw (or another FDW that supports the ctid column). This
would be a back-patchable bug fix if existing FDWs handled CurrentOfExpr,
but I doubt any do so I won't bother back-patching.
It's a false positive - the variable is only used when 'onleft' is true,
and it is initialized in that case. But the compiler doesn't necessarily
see that.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
For <, <=, > and >= strategies, mark the first scan key
as already matched if scanning in an appropriate direction.
If index tuple contains no nulls we can skip the first
re-check for each tuple.
Author: Rajeev Rastogi
Reviewer: Haribabu Kommi
Rework of the code and comments by Simon Riggs
There was some confusion on how to record the case that the operation
unlinks the last non-leaf page in the branch being deleted.
_bt_unlink_halfdead_page set the "topdead" field in the WAL record to
the leaf page, but the redo routine assumed that it would be an invalid
block number in that case. This commit fixes _bt_unlink_halfdead_page to
do what the redo routine expected.
This code is new in 9.4, so backpatch there.
Buildfarm members with CLOBBER_CACHE_ALWAYS advised us that commit
85b506bbfc was mistaken in setting the relpersistence value of the
index directly in the relcache entry, within reindex_index. The reason
for the failure is that an invalidation message that comes after mucking
with the relcache entry directly, but before writing it to the catalogs,
would cause the entry to become rebuilt in place from catalogs with the
old contents, losing the update.
Fix by passing the correct persistence value to
RelationSetNewRelfilenode instead; this routine also writes the updated
tuple to pg_class, avoiding the problem. Suggested by Tom Lane.
When checking a table that has an inheritance tree marked,
if no child tables remain, we skip ANALYZE. This patch emits
a message to show that the action has been skipped.
Author: Etsuro Fujita
Reviewer: Furuya Osamu
This removes ATChangeIndexesPersistence() introduced by f41872d0c1
which was too ugly to live for long. Instead, the correct persistence
marking is passed all the way down to reindex_index, so that the
transient relation built to contain the index relfilenode can
get marked correctly right from the start.
Author: Fabrízio de Royes Mello
Review and editorialization by Michael Paquier
and Álvaro Herrera
Unlogged relations are only reset when performing a unclean
restart. That means they have to be synced to disk during clean
shutdowns. During normal processing that's achieved by registering a
buffer's file to be fsynced at the next checkpoint when flushed. But
ResetUnloggedRelations() doesn't go through the buffer manager, so
nothing will force reset relations to disk before the next shutdown
checkpoint.
So just make ResetUnloggedRelations() fsync the newly created main
forks to disk.
Discussion: 20140912112246.GA4984@alap3.anarazel.de
Backpatch to 9.1 where unlogged tables were introduced.
Abhijit Menon-Sen and Andres Freund
Unlogged relations are reset at the end of crash recovery as they're
only synced to disk during a proper shutdown. Unfortunately that and
later steps can fail, e.g. due to running out of space. This reset
was, up to now performed after marking the database as having finished
crash recovery successfully. As out of space errors trigger a crash
restart that could lead to the situation that not all unlogged
relations are reset.
Once that happend usage of unlogged relations could yield errors like
"could not open file "...": No such file or directory". Luckily
clusters that show the problem can be fixed by performing a immediate
shutdown, and starting the database again.
To fix, just call ResetUnloggedRelations(UNLOGGED_RELATION_INIT)
earlier, before marking the database as having successfully recovered.
Discussion: 20140912112246.GA4984@alap3.anarazel.de
Backpatch to 9.1 where unlogged tables were introduced.
Abhijit Menon-Sen and Andres Freund
The initial patch for RLS mistakenly included headers associated with
the executor and planner bits in rewrite/rowsecurity.h. Per policy and
general good sense, executor headers should not be included in planner
headers or vice versa.
The include of execnodes.h was a mistaken holdover from previous
versions, while the include of relation.h was used for Relation's
definition, which should have been coming from utils/relcache.h. This
patch cleans these issues up, adds comments to the RowSecurityPolicy
struct and the RowSecurityConfigType enum, and changes Relation->rsdesc
to Relation->rd_rsdesc to follow Relation field naming convention.
Additionally, utils/rel.h was including rewrite/rowsecurity.h, which
wasn't a great idea since that was pulling in things not really needed
in utils/rel.h (which gets included in quite a few places). Instead,
use 'struct RowSecurityDesc' for the rd_rsdesc field and add comments
explaining why.
Lastly, add an include into access/nbtree/nbtsort.c for
utils/sortsupport.h, which was evidently missed due to the above mess.
Pointed out by Tom in 16970.1415838651@sss.pgh.pa.us; note that the
concerns regarding a similar situation in the custom-path commit still
need to be addressed.
This function has a loop which can lead to uninterruptible process
"stalls" (actually infinite loops) when some bugs are triggered. Avoid
that unpleasant situation by adding a check for interrupts in a place
that shouldn't degrade performance in the normal case.
Backpatch to 9.3. Older branches have an identical loop here, but the
aforementioned bugs are only a problem starting in 9.3 so there doesn't
seem to be any point in backpatching any further.
In some workloads BufferGetBlockNumber() shows up in profiles due to
the sheer number of calls to it (and because it causes cache
misses). The compiler can't move it out of the loop because it's a
full extern function call...
There are basically three situations in which logical decoding needs
to perform cache invalidation. During/After replaying a transaction
with catalog changes, when skipping a uninteresting transaction that
performed catalog changes and when erroring out while replaying a
transaction. Unfortunately these three cases were all done slightly
differently - partially because 8de3e410fa, which greatly simplifies
matters, got committed in the midst of the development of logical
decoding.
The actually problematic case was when logical decoding skipped
transaction commits (and thus processed invalidations). When used via
the SQL interface cache invalidation could access the catalog - bad,
because we didn't set up enough state to allow that correctly. It'd
not be hard to setup sufficient state, but the simpler solution is to
always perform cache invalidation outside a valid transaction.
Also make the different cache invalidation cases look as similar as
possible, to ease code review.
This fixes the assertion failure reported by Antonin Houska in
53EE02D9.7040702@gmail.com. The presented testcase has been expanded
into a regression test.
Backpatch to 9.4, where logical decoding was introduced.
When building the initial historic catalog snapshot there were
scenarios where snapbuild.c would use incorrect xmin/xmax values when
starting from a xl_running_xacts record. The values used were always a
bit suspect, but happened to be correct in the easy to test
cases. Notably the values used when the the initial snapshot was
computed while no other transactions were running were correct.
This is likely to be the cause of the occasional buildfarm failures on
animals markhor and tick; but it's quite possible to reproduce
problems without CLOBBER_CACHE_ALWAYS.
Backpatch to 9.4, where logical decoding was introduced.
There was a window in RestoreBackupBlock where a page would be zeroed out,
but not yet locked. If a backend pinned and locked the page in that window,
it saw the zeroed page instead of the old page or new page contents, which
could lead to missing rows in a result set, or errors.
To fix, replace RBM_ZERO with RBM_ZERO_AND_LOCK, which atomically pins,
zeroes, and locks the page, if it's not in the buffer cache already.
In stable branches, the old RBM_ZERO constant is renamed to RBM_DO_NOT_USE,
to avoid breaking any 3rd party extensions that might use RBM_ZERO. More
importantly, this avoids renumbering the other enum values, which would
cause even bigger confusion in extensions that use ReadBufferExtended, but
haven't been recompiled.
Backpatch to all supported versions; this has been racy since hot standby
was introduced.
The hope is that we can use this to produce better diagnostics in
some cases.
Peter Geoghegan, reviewed by Michael Paquier, with some further
changes by me.
This only happens if a client issues a Parse message with an empty query
string, which is a bit odd; but since it is explicitly called out as legal
by our FE/BE protocol spec, we'd probably better continue to allow it.
Fix by adding tests everywhere that the raw_parse_tree field is passed to
functions that don't or shouldn't accept NULL. Also make it clear in the
relevant comments that NULL is an expected case.
This reverts commits a73c9dbab0 and
2e9650cbcf, which fixed specific crash
symptoms by hacking things at what now seems to be the wrong end, ie the
callee functions. Making the callees allow NULL is superficially more
robust, but it's not always true that there is a defensible thing for the
callee to do in such cases. The caller has more context and is better
able to decide what the empty-query case ought to do.
Per followup discussion of bug #11335. Back-patch to 9.2. The code
before that is sufficiently different that it would require development
of a separate patch, which doesn't seem worthwhile for what is believed
to be an essentially cosmetic change.
Heikki noticed in 544E23C0.8090605@vmware.com that slot.c and
snapbuild.c were missing the FIN_CRC32 call when computing/checking
checksums of on disk files. That doesn't lower the the error detection
capabilities of the checksum, but is inconsistent with other usages.
In a followup mail Heikki also noticed that, contrary to a comment,
the 'version' and 'length' struct fields of replication slot's on disk
data where not covered by the checksum. That's not likely to lead to
actually missed corruption as those fields are cross checked with the
expected version and the actual file length. But it's wrong
nonetheless.
As fixing these issues makes existing on disk files unreadable, bump
the expected versions of on disk files for both slots and logical
decoding historic catalog snapshots. This means that loading old
files will fail with
ERROR: "replication slot file ... has unsupported version 1"
and
ERROR: "snapbuild state file ... has unsupported version 1 instead of
2" respectively. Given the low likelihood of anybody already using
these new features in a production setup that seems acceptable.
Fixing these issues made me notice that there's no regression test
covering the loading of historic snapshot from disk - so add one.
Backpatch to 9.4 where these features were introduced.
On Windows, DROP TABLESPACE has a race condition when run concurrently
with other processes having opened files in the tablespace. This led to
a rare failure on buildfarm member frogmouth. Back-patch to 9.4, where
the reconnection was introduced.
When the recursive search in dependency.c visits a column and then later
visits the whole table containing the column, it needs to propagate the
drop-context flags for the table to the existing target-object entry for
the column. Otherwise we might refuse the DROP (if not CASCADE) on the
incorrect grounds that there was no automatic drop pathway to the column.
Remarkably, this has not been reported before, though it's possible at
least when an extension creates both a datatype and a table using that
datatype.
Rather than just marking the column as allowed to be dropped, it might
seem good to skip the DROP COLUMN step altogether, since the later DROP
of the table will surely get the job done. The problem with that is that
the datatype would then be dropped before the table (since the whole
situation occurred because we visited the datatype, and then recursed to
the dependent column, before visiting the table). That seems pretty risky,
and the case is rare enough that it doesn't seem worth expending a lot of
effort or risk to make the drops happen in a safe order. So we just play
dumb and delete the column separately according to the existing drop
ordering rules.
Per report from Petr Jelinek, though this is different from his proposed
patch.
Back-patch to 9.1, where extensions were introduced. There's currently
no evidence that such cases can arise before 9.1, and in any case we would
also need to back-patch cb5c2ba2d8 to 9.0
if we wanted to back-patch this.
Previously the maximum size of GIN pending list was controlled only by
work_mem. But the reasonable value of work_mem and the reasonable size
of the list are basically not the same, so it was not appropriate to
control both of them by only one GUC, i.e., work_mem. This commit
separates new GUC, pending_list_cleanup_size, from work_mem to allow
users to control only the size of the list.
Also this commit adds pending_list_cleanup_size as new storage parameter
to allow users to specify the size of the list per index. This is useful,
for example, when users want to increase the size of the list only for
the GIN index which can be updated heavily, and decrease it otherwise.
Reviewed by Etsuro Fujita.
The code that generates the BRIN_XLOG_UPDATE removes the buffer
reference when the page that's target for the updated tuple is freshly
initialized. This is a pretty usual optimization, but was breaking the
case where the revmap buffer, which is referenced in the same WAL
record, is getting a backup block: the replay code was using backup
block index 1, which is not valid when the update target buffer gets
pruned; the revmap buffer gets assigned 0 instead. Make sure to use the
correct backup block index for revmap when replaying.
Bug reported by Fujii Masao.
At one time it wasn't terribly important what column names were associated
with the fields of a composite Datum, but since the introduction of
operations like row_to_json(), it's important that looking up the rowtype
ID embedded in the Datum returns the column names that users would expect.
That did not work terribly well before this patch: you could get the column
names of the underlying table, or column aliases from any level of the
query, depending on minor details of the plan tree. You could even get
totally empty field names, which is disastrous for cases like row_to_json().
To fix this for whole-row Vars, look to the RTE referenced by the Var, and
make sure its column aliases are applied to the rowtype associated with
the result Datums. This is a tad scary because we might have to return
a transient RECORD type even though the Var is declared as having some
named rowtype. In principle it should be all right because the record
type will still be physically compatible with the named rowtype; but
I had to weaken one Assert in ExecEvalConvertRowtype, and there might be
third-party code containing similar assumptions.
Similarly, RowExprs have to be willing to override the column names coming
from a named composite result type and produce a RECORD when the column
aliases visible at the site of the RowExpr differ from the underlying
table's column names.
In passing, revert the decision made in commit 398f70ec07 to add
an alias-list argument to ExecTypeFromExprList: better to provide that
functionality in a separate function. This also reverts most of the code
changes in d685814835, which we don't need because we're no longer
depending on the tupdesc found in the child plan node's result slot to be
blessed.
Back-patch to 9.4, but not earlier, since this solution changes the results
in some cases that users might not have realized were buggy. We'll apply a
more restricted form of this patch in older branches.
Besides a couple of typo fixes, per David Rowley, Thom Brown, and Amit
Langote, and mentions of BRIN in the general CREATE INDEX page again per
David, this includes silencing MSVC compiler warnings (thanks Microsoft)
and an additional variable initialization per Coverity scanner.
Reported by David Rowley: variadic macros are a problem. Get rid of
them using a trick suggested by Tom Lane: add extra parentheses where
needed. In the future we might decide we don't need the calls at all
and remove them, but it seems appropriate to keep them while this code
is still new.
Also from David Rowley: brininsert() was trying to use a variable before
initializing it. Fix by moving the brin_form_tuple call (which
initializes the variable) to within the locked section.
Reported by Peter Eisentraut: can't use "new" as a struct member name,
because C++ compilers will choke on it, as reported by cpluspluscheck.
This allows extension modules to define their own methods for
scanning a relation, and get the core code to use them. It's
unclear as yet how much use this capability will find, but we
won't find out if we never commit it.
KaiGai Kohei, reviewed at various times and in various levels
of detail by Shigeru Hanada, Tom Lane, Andres Freund, Álvaro
Herrera, and myself.
Now that the backup blocks are appended to the WAL record in xloginsert.c,
XLogInsert doesn't see them anymore and cannot remove them from the version
reconstructed for xlog_outdesc. This makes running with wal_debug=on more
expensive, as we now make (unnecessary) temporary copies of the backup
blocks, but it doesn't seem worth convoluting the code to keep that
optimization.
Reported by Alvaro Herrera.
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
The code that generated a record to clear the F_TUPLES_DELETED flag hasn't
existed since we got rid of old-style VACUUM FULL. I kept the code that sets
the flag, although it's not used for anything anymore, because it might
still be interesting information for debugging purposes that some tuples
have been deleted from a page.
Likewise, the code to turn the root page from non-leaf to leaf page was
removed when we got rid of old-style VACUUM FULL. Remove the code to replay
that action, too.
dict_thesaurus stored phrase IDs in uint16 fields, so it would get confused
and even crash if there were more than 64K entries in the configuration
file. It turns out to be basically free to widen the phrase IDs to uint32,
so let's just do so.
This was complained of some time ago by David Boutin (in bug #7793);
he later submitted an informal patch but it was never acted on.
We now have another complaint (bug #11901 from Luc Ouellette) so it's
time to make something happen.
This is basically Boutin's patch, but for future-proofing I also added a
defense against too many words per phrase. Note that we don't need any
explicit defense against overflow of the uint32 counters, since before that
happens we'd hit array allocation sizes that repalloc rejects.
Back-patch to all supported branches because of the crash risk.
The default JSONB GIN opclass (jsonb_ops) converts numeric data values
to strings for storage in the index. It must ensure that numeric values
that would compare equal (such as 12 and 12.00) produce identical strings,
else index searches would have behavior different from regular JSONB
comparisons. Unfortunately the function charged with doing this was
completely wrong: it could reduce distinct numeric values to the same
string, or reduce equivalent numeric values to different strings. The
former type of error would only lead to search inefficiency, but the
latter type of error would cause index entries that should be found by
a search to not be found.
Repairing this bug therefore means that it will be necessary for 9.4 beta
testers to reindex GIN jsonb_ops indexes, if they care about getting
correct results from index searches involving numeric data values within
the comparison JSONB object.
Per report from Thomas Fanghaenel.
Previously .ready file was created for the timeline history file at the end
of an archive recovery even when WAL archiving was not enabled.
This creation is unnecessary and causes .ready file to remain infinitely.
This commit changes an archive recovery so that it creates .ready file for
the timeline history file only when WAL archiving is enabled.
Backpatch to all supported versions.
xlog.c is huge, this makes it a little bit smaller, which is nice. Functions
related to putting together the WAL record are in xloginsert.c, and the
lower level stuff for managing WAL buffers and such are in xlog.c.
Also move the definition of XLogRecord to a separate header file. This
causes churn in the #includes of all the files that write WAL records, and
redo routines, but it avoids pulling in xlog.h into most places.
Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.
Long ago we briefly had an "autocommit" GUC that turned server-side
autocommit on and off. That behavior was removed in 7.4 after concluding
that it broke far too much client-side logic, and making clients cope with
both behaviors was impractical. But the GUC variable was left behind, so
as not to break any client code that might be trying to read its value.
Enough time has now passed that we should remove the GUC completely.
Whatever vestigial backwards-compatibility benefit it had is outweighed by
the risk of confusion for newbies who assume it ought to do something,
as per a recent complaint from Wolfgang Wilhelm.
In passing, adjust what seemed to me a rather confusing documentation
reference to libpq's autocommit behavior. libpq as such knows nothing
about autocommit, so psql is probably what was meant.
Obviously, every translation unit should not be declaring this
separately. It needs to be PGDLLIMPORT as well, to avoid breaking
third-party code that uses any of the functions that the commit
mentioned above changed to macros.
This is a followup to commit 43ac12c6e6,
which added regression tests checking that I/O functions of built-in
types are not marked volatile. Complaining in CREATE TYPE should push
developers of add-on types to fix any misdeclared functions in their
types. It's just a warning not an error, to avoid creating upgrade
problems for what might be just cosmetic mis-markings.
Aside from adding the warning code, fix a number of types that were
sloppily created in the regression tests.
The previous coding assumed that we could just let buffers for the
database's old tablespace age out of the buffer arena naturally.
The folly of that is exposed by bug #11867 from Marc Munro: the user could
later move the database back to its original tablespace, after which any
still-surviving buffers would match lookups again and appear to contain
valid data. But they'd be missing any changes applied while the database
was in the new tablespace.
This has been broken since ALTER SET TABLESPACE was introduced, so
back-patch to all supported branches.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
A background worker can use pq_redirect_to_shm_mq() to direct protocol
that would normally be sent to the frontend to a shm_mq so that another
process may read them.
The receiving process may use pq_parse_errornotice() to parse an
ErrorResponse or NoticeResponse from the background worker and, if
it wishes, ThrowErrorData() to propagate the error (with or without
further modification).
Patch by me. Review by Andres Freund.
This reassociates a dynamic shared memory handle previous passed to
dsm_pin_mapping with the current resource owner, so that it will be
cleaned up at the end of the current query.
Patch by me. Review of the function name by Andres Freund, Amit
Kapila, Jim Nasby, Petr Jelinek, and Álvaro Herrera.
As noted by Noah Misch, my initial cut at fixing bug #11638 didn't cover
all cases where ANALYZE might be invoked in an unsafe context. We need to
test the result of IsInTransactionChain not IsTransactionBlock; which is
notationally a pain because IsInTransactionChain requires an isTopLevel
flag, which would have to be passed down through several levels of callers.
I chose to pass in_outer_xact (ie, the result of IsInTransactionChain)
rather than isTopLevel per se, as that seemed marginally more apropos
for the intermediate functions to know about.
Nobody seemed concerned about this naming when it originally went in,
but there's a pending patch that implements the opposite of
dsm_keep_mapping, and the term "unkeep" was judged unpalatable.
"unpin" has existing precedent in the PostgreSQL code base, and the
English language, so use this terminology instead.
Per discussion, back-patch to 9.4.
VACUUM and ANALYZE update the target table's pg_class row in-place, that is
nontransactionally. This is OK, more or less, for the statistical columns,
which are mostly nontransactional anyhow. It's not so OK for the DDL hint
flags (relhasindex etc), which might get changed in response to
transactional changes that could still be rolled back. This isn't a
problem for VACUUM, since it can't be run inside a transaction block nor
in parallel with DDL on the table. However, we allow ANALYZE inside a
transaction block, so if the transaction had earlier removed the last
index, rule, or trigger from the table, and then we roll back the
transaction after ANALYZE, the table would be left in a corrupted state
with the hint flags not set though they should be.
To fix, suppress the hint-flag updates if we are InTransactionBlock().
This is safe enough because it's always OK to postpone hint maintenance
some more; the worst-case consequence is a few extra searches of pg_index
et al. There was discussion of instead using a transactional update,
but that would change the behavior in ways that are not all desirable:
in most scenarios we're better off keeping ANALYZE's statistical values
even if the ANALYZE itself rolls back. In any case we probably don't want
to change this behavior in back branches.
Per bug #11638 from Casey Shobe. This has been broken for a good long
time, so back-patch to all supported branches.
Tom Lane and Michael Paquier, initial diagnosis by Andres Freund
Instead of initializing a new TransInvalidationInfo for every
transaction or subtransaction, we can just do it for those
transactions or subtransactions that actually need to queue
invalidation messages. That also avoids needing to free those
entries at the end of a transaction or subtransaction that does
not generate any invalidation messages, which is by far the
common case.
Patch by me. Review by Simon Riggs and Andres Freund.
Since we got rid of non-MVCC catalog scans, the fourth reason given for
using a non-transactional update in index_update_stats() is obsolete.
The other three are still good, so we're not going to change the code,
but fix the comment.
1. The comparison for matching terms used only the CRC to decide if there's
a match. Two different terms with the same CRC gave a match.
2. It assumed that if the second operand has more terms than the first, it's
never a match. That assumption is bogus, because there can be duplicate
terms in either operand.
Rewrite the implementation in a way that doesn't have those bugs.
Backpatch to all supported versions.
Since we taught btree to handle ScalarArrayOpExpr quals natively (commit
9e8da0f757), the planner has always included
ScalarArrayOpExpr quals in index conditions if possible. However, if the
qual is for a non-first index column, this could result in an inferior plan
because we can no longer take advantage of index ordering (cf. commit
807a40c551). It can be better to omit the
ScalarArrayOpExpr qual from the index condition and let it be done as a
filter, so that the output doesn't need to get sorted. Indeed, this is
true for the query introduced as a test case by the latter commit.
To fix, restructure get_index_paths and build_index_paths so that we
consider paths both with and without ScalarArrayOpExpr quals in non-first
index columns. Redesign the API of build_index_paths so that it reports
what it found, saving useless second or third calls.
Report and patch by Andrew Gierth (though rather heavily modified by me).
Back-patch to 9.2 where this code was introduced, since the issue can
result in significant performance regressions compared to plans produced
by 9.1 and earlier.
Don't crash if an ispell dictionary definition contains flags but not
any compound affixes. (This isn't a security issue since only superusers
can install affix files, but still it's a bad thing.)
Also, be more careful about detecting whether an affix-file FLAG command
is old-format (ispell) or new-format (myspell/hunspell). And change the
error message about mixed old-format and new-format commands into something
intelligible.
Per bug #11770 from Emre Hasegeli. Back-patch to all supported branches.
Testing reveals that the memory allocation we do at transaction start
has small but measurable overhead on simple transactions. To cut
down on that overhead, defer some of that work to the point when
AFTER triggers are first used, thus avoiding it altogether if they
never are.
Patch by me. Review by Andres Freund.
Previously, this was not exposed outside of miscinit.c. It is needed
for the pending pg_background patch, and will also be needed for
parallelism. Without it, there's no way for a background worker to
re-create the exact authentication environment that was present in the
process that started it, which could lead to security exposures.
Previously the archive recovery always created .ready file for
the last WAL file of the old timeline at the end of recovery even when
it's restored from the archive and has .done file. That is, there was
the case where the WAL file had both .ready and .done files.
This caused the already-archived WAL file to be archived again.
This commit prevents the archive recovery from creating .ready file
for the last WAL file if it has .done file, in order to prevent it from
being archived again.
This bug was added when cascading replication feature was introduced,
i.e., the commit 5286105800.
So, back-patch to 9.2, where cascading replication was added.
Reviewed by Michael Paquier
In a couple of code paths, pg_class_aclcheck is called in succession
with multiple different modes set. This patch combines those modes to
have a single call of this function and reduce a bit process overhead
for permission checking.
Author: Michael Paquier <michael@otacoo.com>
Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com>
The duplication originated in cdd46c765, where restartpoints were
introduced.
In LogCheckpointStart's case the duplication actually lead to the
compiler's format string checking not to be effective because the
format string wasn't constant.
Arguably these messages shouldn't be elog(), but ereport() style
messages. That'd even allow to translate the messages... But as
there's more mistakes of that kind in surrounding code, it seems
better to change that separately.
Commit 7dbb606938 added a new CHECKPOINT_FLUSH_ALL flag. As that
commit needed to be backpatched I didn't change the numeric values of
the existing flags as that could lead to nastly problems if any
external code issued checkpoints. That's not a concern on master, so
renumber them there.
Also add a comment about CHECKPOINT_FLUSH_ALL above
CreateCheckPoint().
CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE copy the source
database directory on the filesystem level. To ensure the on disk
state is consistent they block out users of the affected database and
force a checkpoint to flush out all data to disk. Unfortunately, up to
now, that checkpoint didn't flush out dirty buffers from unlogged
relations.
That bug means there could be leftover dirty buffers in either the
template database, or the database in its old location. Leading to
problems when accessing relations in an inconsistent state; and to
possible problems during shutdown in the SET TABLESPACE case because
buffers belonging files that don't exist anymore are flushed.
This was reported in bug #10675 by Maxim Boguk.
Fix by Pavan Deolasee, modified somewhat by me. Reviewed by MauMau and
Fujii Masao.
Backpatch to 9.1 where unlogged tables were introduced.
If an inline-able SQL function taking a composite argument is used in a
LATERAL subselect, and the composite argument is a lateral reference,
the planner could fail with "variable not found in subplan target list",
as seen in bug #11703 from Karl Bartel. (The outer function call used in
the bug report and in the committed regression test is not really necessary
to provoke the bug --- you can get it if you manually expand the outer
function into "LATERAL (SELECT inner_function(outer_relation))", too.)
The cause of this is that we generate the reltargetlist for the referenced
relation before doing eval_const_expressions() on the lateral sub-select's
expressions (cf find_lateral_references()), so what's scheduled to be
emitted by the referenced relation is a whole-row Var, not the simplified
single-column Var produced by optimizing the function's FieldSelect on the
whole-row Var. Then setrefs.c fails to match up that lateral reference to
what's available from the outer scan.
Preserving the FieldSelect optimization in such cases would require either
major planner restructuring (to recursively do expression simplification
on sub-selects much earlier) or some amazingly ugly kluge to change the
reltargetlist of a possibly-already-planned relation. It seems better
just to skip the optimization when the Var is from an upper query level;
the case is not so common that it's likely anyone will notice a few
wasted cycles.
AFAICT this problem only occurs for uplevel LATERAL references, so
back-patch to 9.3 where LATERAL was added.
interval precision can only be specified after the "interval" keyword if
no units are specified.
Previously we incorrectly checked the units to see if the precision was
legal, causing confusion.
Report by Alvaro Herrera
Nearly all Paths have parents, but a ResultPath representing an empty FROM
clause does not. Avoid a core dump in such cases. I believe this is only
a hazard for debugging usage, not for production, else we'd have heard
about it before. Nonetheless, back-patch to 9.1 where the troublesome code
was introduced. Noted while poking at bug #11703.
Up to now, PG has assumed that any given timezone abbreviation (such as
"EDT") represents a constant GMT offset in the usage of any particular
region; we had a way to configure what that offset was, but not for it
to be changeable over time. But, as with most things horological, this
view of the world is too simplistic: there are numerous regions that have
at one time or another switched to a different GMT offset but kept using
the same timezone abbreviation. Almost the entire Russian Federation did
that a few years ago, and later this month they're going to do it again.
And there are similar examples all over the world.
To cope with this, invent the notion of a "dynamic timezone abbreviation",
which is one that is referenced to a particular underlying timezone
(as defined in the IANA timezone database) and means whatever it currently
means in that zone. For zones that use or have used daylight-savings time,
the standard and DST abbreviations continue to have the property that you
can specify standard or DST time and get that time offset whether or not
DST was theoretically in effect at the time. However, the abbreviations
mean what they meant at the time in question (or most recently before that
time) rather than being absolutely fixed.
The standard abbreviation-list files have been changed to use this behavior
for abbreviations that have actually varied in meaning since 1970. The
old simple-numeric definitions are kept for abbreviations that have not
changed, since they are a bit faster to resolve.
While this is clearly a new feature, it seems necessary to back-patch it
into all active branches, because otherwise use of Russian zone
abbreviations is going to become even more problematic than it already was.
This change supersedes the changes in commit 513d06ded et al to modify the
fixed meanings of the Russian abbreviations; since we've not shipped that
yet, this will avoid an undesirably incompatible (not to mention incorrect)
change in behavior for timestamps between 2011 and 2014.
This patch makes some cosmetic changes in ecpglib to keep its usage of
datetime lookup tables as similar as possible to the backend code, but
doesn't do anything about the increasingly obsolete set of timezone
abbreviation definitions that are hard-wired into ecpglib. Whatever we
do about that will likely not be appropriate material for back-patching.
Also, a potential free() of a garbage pointer after an out-of-memory
failure in ecpglib has been fixed.
This patch also fixes pre-existing bugs in DetermineTimeZoneOffset() that
caused it to produce unexpected results near a timezone transition, if
both the "before" and "after" states are marked as standard time. We'd
only ever thought about or tested transitions between standard and DST
time, but that's not what's happening when a zone simply redefines their
base GMT offset.
In passing, update the SGML documentation to refer to the Olson/zoneinfo/
zic timezone database as the "IANA" database, since it's now being
maintained under the auspices of IANA.
We've gotten enough push-back on that change to make it clear that it
wasn't an especially good idea to do it like that. Revert plain EXPLAIN
to its previous behavior, but keep the extra output in EXPLAIN ANALYZE.
Per discussion.
Internally, I set this up as a separate flag ExplainState.summary that
controls printing of planning time and execution time. For now it's
just copied from the ANALYZE option, but we could consider exposing it
to users.
LWLockRelease should release all backends waiting with LWLockWaitForVar,
even when another backend has already been woken up to acquire the lock,
i.e. when releaseOK is false. LWLockWaitForVar can return as soon as the
protected value changes, even if the other backend will acquire the lock.
Fix that by resetting releaseOK to true in LWLockWaitForVar, whenever
adding itself to the wait queue.
This should fix the bug reported by MauMau, where the system occasionally
hangs when there is a lot of concurrent WAL activity and a checkpoint.
Backpatch to 9.4, where this code was added.
If we expect batching at the very beginning, we size nbuckets for
"full work_mem" (see how many tuples we can get into work_mem,
while not breaking NTUP_PER_BUCKET threshold).
If we expect to be fine without batching, we start with the 'right'
nbuckets and track the optimal nbuckets as we go (without actually
resizing the hash table). Once we hit work_mem (considering the
optimal nbuckets value), we keep the value.
At the end of the first batch, we check whether (nbuckets !=
nbuckets_optimal) and resize the hash table if needed. Also, we
keep this value for all batches (it's OK because it assumes full
work_mem, and it makes the batchno evaluation trivial). So the
resize happens only once.
There could be cases where it would improve performance to allow
the NTUP_PER_BUCKET threshold to be exceeded to keep everything in
one batch rather than spilling to a second batch, but attempts to
generate such a case have so far been unsuccessful; that issue may
be addressed with a follow-on patch after further investigation.
Tomas Vondra with minor format and comment cleanup by me
Reviewed by Robert Haas, Heikki Linnakangas, and Kevin Grittner
When determining whether one JSONB object contains another, it's okay to
make a quick exit if the first object has fewer pairs than the second:
because we de-duplicate keys within objects, it is impossible that the
first object has all the keys the second does. However, the code was
applying this rule to JSONB arrays as well, where it does *not* hold
because arrays can contain duplicate entries. The test was really in
the wrong place anyway; we should do it within JsonbDeepContains, where
it can be applied to nested objects not only top-level ones.
Report and test cases by Alexander Korotkov; fix by Peter Geoghegan and
Tom Lane.
The new header contains many prototypes for functions in ruleutils.c
that are not exposed to the SQL level.
Reviewed by Andres Freund and Michael Paquier.
shm_mq_sendv sends a message to the queue assembled from multiple
locations. This is expected to be used by forthcoming patches to
allow frontend/backend protocol messages to be sent via shm_mq, but
might be useful for other purposes as well.
shm_mq_set_handle associates a BackgroundWorkerHandle with an
already-existing shm_mq_handle. This solves a timing problem when
creating a shm_mq to communicate with a newly-launched background
worker: if you attach to the queue first, and the background worker
fails to start, you might block forever trying to do I/O on the queue;
but if you start the background worker first, but then die before
attaching to the queue, the background worrker might block forever
trying to do I/O on the queue. This lets you attach before starting
the worker (so that the worker is protected) and then associate the
BackgroundWorkerHandle later (so that you are also protected).
Patch by me, reviewed by Stephen Frost.
This clause changes the behavior of SELECT locking clauses in the
presence of locked rows: instead of causing a process to block waiting
for the locks held by other processes (or raise an error, with NOWAIT),
SKIP LOCKED makes the new reader skip over such rows. While this is not
appropriate behavior for general purposes, there are some cases in which
it is useful, such as queue-like tables.
Catalog version bumped because this patch changes the representation of
stored rules.
Reviewed by Craig Ringer (based on a previous attempt at an
implementation by Simon Riggs, who also provided input on the syntax
used in the current patch), David Rowley, and Álvaro Herrera.
Author: Thomas Munro
Teach sigusr1_handler() to use the same test for whether a worker
might need to be started as ServerLoop(). Aside from being perhaps
a bit simpler, this prevents a potentially-unbounded delay when
starting a background worker. On some platforms, select() doesn't
return when interrupted by a signal, but is instead restarted,
including a reset of the timeout to the originally-requested value.
If signals arrive often enough, but no connection requests arrive,
sigusr1_handler() will be executed repeatedly, but the body of
ServerLoop() won't be reached. This change ensures that, even in
that case, background workers will eventually get launched.
This is far from a perfect fix; really, we need select() to return
control to ServerLoop() after an interrupt, either via the self-pipe
trick or some other mechanism. But that's going to require more
work and discussion, so let's do this for now to at least mitigate
the damage.
Per investigation of test_shm_mq failures on buildfarm member anole.
Most zones in the Russian Federation are subtracting one or two hours
as of 2014-10-26. Update the meanings of the abbreviations IRKT, KRAT,
MAGT, MSK, NOVT, OMST, SAKT, VLAT, YAKT, YEKT to match.
The IANA timezone database has adopted abbreviations of the form AxST/AxDT
for all Australian time zones, reflecting what they believe to be current
majority practice Down Under. These names do not conflict with usage
elsewhere (other than ACST for Acre Summer Time, which has been in disuse
since 1994). Accordingly, adopt these names into our "Default" timezone
abbreviation set. The "Australia" abbreviation set now contains only
CST,EAST,EST,SAST,SAT,WST, all of which are thought to be mostly historical
usage. Note that SAST has also been changed to be South Africa Standard
Time in the "Default" abbreviation set.
Add zone abbreviations SRET (Asia/Srednekolymsk) and XJT (Asia/Urumqi),
and use WSST/WSDT for western Samoa.
Also a DST law change in the Turks & Caicos Islands (America/Grand_Turk),
and numerous corrections for historical time zone data.
Peter G pointed out that valgrind was, rightfully, complaining about
CreatePolicy() ending up copying beyond the end of the parsed policy
name. Name is a fixed-size type and we need to use namein (through
DirectFunctionCall1()) to flush out the entire array before we pass
it down to heap_form_tuple.
Michael Paquier pointed out that pg_dump --verbose was missing a
newline and Fabrízio de Royes Mello further pointed out that the
schema was also missing from the messages, so fix those also.
Also, based on an off-list comment from Kevin, rework the psql \d
output to facilitate copy/pasting into a new CREATE or ALTER POLICY
command.
Lastly, improve the pg_policies view and update the documentation for
it, along with a few other minor doc corrections based on an off-list
discussion with Adam Brightwell.
When there are cost-delay-related storage options set for a table,
trying to make that table participate in the autovacuum cost-limit
balancing algorithm produces undesirable results: instead of using the
configured values, the global values are always used,
as illustrated by Mark Kirkwood in
http://www.postgresql.org/message-id/52FACF15.8020507@catalyst.net.nz
Since the mechanism is already complicated, just disable it for those
cases rather than trying to make it cope. There are undesirable
side-effects from this too, namely that the total I/O impact on the
system will be higher whenever such tables are vacuumed. However, this
is seen as less harmful than slowing down vacuum, because that would
cause bloat to accumulate. Anyway, in the new system it is possible to
tweak options to get the precise behavior one wants, whereas with the
previous system one was simply hosed.
This has been broken forever, so backpatch to all supported branches.
This might affect systems where cost_limit and cost_delay have been set
for individual tables.
The page splitting code would go into infinite recursion if you try to
insert an index tuple that doesn't fit even on an empty page.
Per analysis and suggested fix by Andrew Gierth. Fixes bug #11555, reported
by Bryan Seitz (analysis happened over IRC). Backpatch to all supported
versions.
Testing by Amit Kapila, Andres Freund, and myself, with and without
other patches that also aim to improve scalability, seems to indicate
that this change is a significant win over the current value and over
smaller values such as 64. It's not clear how high we can push this
value before it starts to have negative side-effects elsewhere, but
going this far looks OK.
As of commit a87c72915 (which later got backpatched as far as 9.1),
we're explicitly supporting the notion that append relations can be
nested; this can occur when UNION ALL constructs are nested, or when
a UNION ALL contains a table with inheritance children.
Bug #11457 from Nelson Page, as well as an earlier report from Elvis
Pranskevichus, showed that there were still nasty bugs associated with such
cases: in particular the EquivalenceClass mechanism could try to generate
"join" clauses connecting an appendrel child to some grandparent appendrel,
which would result in assertion failures or bogus plans.
Upon investigation I concluded that all current callers of
find_childrel_appendrelinfo() need to be fixed to explicitly consider
multiple levels of parent appendrels. The most complex fix was in
processing of "broken" EquivalenceClasses, which are ECs for which we have
been unable to generate all the derived equality clauses we would like to
because of missing cross-type equality operators in the underlying btree
operator family. That code path is more or less entirely untested by
the regression tests to date, because no standard opfamilies have such
holes in them. So I wrote a new regression test script to try to exercise
it a bit, which turned out to be quite a worthwhile activity as it exposed
existing bugs in all supported branches.
The present patch is essentially the same as far back as 9.2, which is
where parameterized paths were introduced. In 9.0 and 9.1, we only need
to back-patch a small fragment of commit 5b7b5518d, which fixes failure to
propagate out the original WHERE clauses when a broken EC contains constant
members. (The regression test case results show that these older branches
are noticeably stupider than 9.2+ in terms of the quality of the plans
generated; but we don't really care about plan quality in such cases,
only that the plan not be outright wrong. A more invasive fix in the
older branches would not be a good idea anyway from a plan-stability
standpoint.)
I left the GUC in place for the beta period, so that people could experiment
with different values. No-one's come up with any data that a different value
would be better under some circumstances, so rather than try to document to
users what the GUC, let's just hard-code the current value, 8.
DetermineSleepTime() was previously called without blocked
signals. That's not good, because it allows signal handlers to
interrupt its workings.
DetermineSleepTime() was added in 9.3 with the addition of background
workers (da07a1e856), where it only read from
BackgroundWorkerList.
Since 9.4, where dynamic background workers were added (7f7485a0cd),
the list is also manipulated in DetermineSleepTime(). That's bad
because the list now can be persistently corrupted if modified by both
a signal handler and DetermineSleepTime().
This was discovered during the investigation of hangs on buildfarm
member anole. It's unclear whether this bug is the source of these
hangs or not, but it's worth fixing either way. I have confirmed that
it can cause crashes.
It luckily looks like this only can cause problems when bgworkers are
actively used.
Discussion: 20140929193733.GB14400@awork2.anarazel.de
Backpatch to 9.3 where background workers were introduced.
As noted in http://bugs.debian.org/763098 there is a conflict between
postgres' definition of CACHE_LINE_SIZE and the definition by various
*bsd platforms. It's debatable who has the right to define such a
name, but postgres' use was only introduced in 375d8526f2 (9.4), so
it seems like a good idea to rename it.
Discussion: 20140930195756.GC27407@msg.df7cb.de
Per complaint of Christoph Berg in the above email, although he's not
the original bug reporter.
Backpatch to 9.4 where the define was introduced.
Per discussion, revert the commit which added 'ignore_nulls' to
row_to_json. This capability would be better added as an independent
function rather than being bolted on to row_to_json. Additionally,
the implementation didn't address complex JSON objects, and so was
incomplete anyway.
Pointed out by Tom and discussed with Andrew and Robert.
The original design used an array of offsets into the variable-length
portion of a JSONB container. However, such an array is basically
uncompressible by simple compression techniques such as TOAST's LZ
compressor. That's bad enough, but because the offset array is at the
front, it tended to trigger the give-up-after-1KB heuristic in the TOAST
code, so that the entire JSONB object was stored uncompressed; which was
the root cause of bug #11109 from Larry White.
To fix without losing the ability to extract a random array element in O(1)
time, change this scheme so that most of the JEntry array elements hold
lengths rather than offsets. With data that's compressible at all, there
tend to be fewer distinct element lengths, so that there is scope for
compression of the JEntry array. Every N'th entry is still an offset.
To determine the length or offset of any specific element, we might have
to examine up to N preceding JEntrys, but that's still O(1) so far as the
total container size is concerned. Testing shows that this cost is
negligible compared to other costs of accessing a JSONB field, and that
the method does largely fix the incompressible-data problem.
While at it, rearrange the order of elements in a JSONB object so that
it's "all the keys, then all the values" not alternating keys and values.
This doesn't really make much difference right at the moment, but it will
allow providing a fast path for extracting individual object fields from
large JSONB values stored EXTERNAL (ie, uncompressed), analogously to the
existing optimization for substring extraction from large EXTERNAL text
values.
Bump catversion to denote the incompatibility in on-disk format.
We will need to fix pg_upgrade to disallow upgrading jsonb data stored
with 9.4 betas 1 and 2.
Heikki Linnakangas and Tom Lane
Andres pointed out that there was an extra ';' in equalPolicies, which
made me realize that my prior testing with CLOBBER_CACHE_ALWAYS was
insufficient (it didn't always catch the issue, just most of the time).
Thanks to that, a different issue was discovered, specifically in
equalRSDescs. This change corrects eqaulRSDescs to return 'true' once
all policies have been confirmed logically identical. After stepping
through both functions to ensure correct behavior, I ran this for
about 12 hours of CLOBBER_CACHE_ALWAYS runs of the regression tests
with no failures.
In addition, correct a few typos in the documentation which were pointed
out by Thom Brown (thanks!) and improve the policy documentation further
by adding a flushed out usage example based on a unix passwd file.
Lastly, clean up a few comments in the regression tests and pg_dump.h.
Several upcoming performance/scalability improvements require atomic
operations. This new API avoids the need to splatter compiler and
architecture dependent code over all the locations employing atomic
ops.
For several of the potential usages it'd be problematic to maintain
both, a atomics using implementation and one using spinlocks or
similar. In all likelihood one of the implementations would not get
tested regularly under concurrency. To avoid that scenario the new API
provides a automatic fallback of atomic operations to spinlocks. All
properties of atomic operations are maintained. This fallback -
obviously - isn't as fast as just using atomic ops, but it's not bad
either. For one of the future users the atomics ontop spinlocks
implementation was actually slightly faster than the old purely
spinlock using implementation. That's important because it reduces the
fear of regressing older platforms when improving the scalability for
new ones.
The API, loosely modeled after the C11 atomics support, currently
provides 'atomic flags' and 32 bit unsigned integers. If the platform
efficiently supports atomic 64 bit unsigned integers those are also
provided.
To implement atomics support for a platform/architecture/compiler for
a type of atomics 32bit compare and exchange needs to be
implemented. If available and more efficient native support for flags,
32 bit atomic addition, and corresponding 64 bit operations may also
be provided. Additional useful atomic operations are implemented
generically ontop of these.
The implementation for various versions of gcc, msvc and sun studio have
been tested. Additional existing stub implementations for
* Intel icc
* HUPX acc
* IBM xlc
are included but have never been tested. These will likely require
fixes based on buildfarm and user feedback.
As atomic operations also require barriers for some operations the
existing barrier support has been moved into the atomics code.
Author: Andres Freund with contributions from Oskari Saarenmaa
Reviewed-By: Amit Kapila, Robert Haas, Heikki Linnakangas and Álvaro Herrera
Discussion: CA+TgmoYBW+ux5-8Ja=Mcyuy8=VXAnVRHp3Kess6Pn3DMXAPAEA@mail.gmail.com,
20131015123303.GH5300@awork2.anarazel.de,
20131028205522.GI20248@awork2.anarazel.de
We removed a similar ban on this in json_object recently, but the ban in
datum_to_json was left, which generate4d sprutious errors in othee json
generators, notable json_build_object.
Along the way, add an assertion that datum_to_json is not passed a null
key. All current callers comply with this rule, but the assertion will
catch any possible future misbehaviour.
Previously, we used an lwlock that was held from the time we began
seeking a candidate buffer until the time when we found and pinned
one, which is disastrous for concurrency. Instead, use a spinlock
which is held just long enough to pop the freelist or advance the
clock sweep hand, and then released. If we need to advance the clock
sweep further, we reacquire the spinlock once per buffer.
This represents a significant increase in atomic operations around
buffer eviction, but it still wins on many workloads. On others, it
may result in no gain, or even cause a regression, unless the number
of buffer mapping locks is also increased. However, that seems like
material for a separate commit. We may also need to consider other
methods of mitigating contention on this spinlock, such as splitting
it into multiple locks or jumping the clock sweep hand more than one
buffer at a time, but those, too, seem like separate improvements.
Patch by me, inspired by a much larger patch from Amit Kapila.
Reviewed by Andres Freund.
Some compilers don't automatically search the current directory for
included files. 9cc2c182fc fixed that for builds from tarballs by
adding an include to the source directory. But that doesn't work when
the scanner is generated in the VPATH directory. Use the same search
path as the other parsers in the tree.
One compiler that definitely was affected is solaris' sun cc.
Backpatch to 9.1 which introduced using an actual parser for
replication commands.
Address a few typos in the row security update, pointed out
off-list by Adam Brightwell. Also include 'ALL' in the list
of commands supported, for completeness.
Buildfarm member tick identified an issue where the policies in the
relcache for a relation were were being replaced underneath a running
query, leading to segfaults while processing the policies to be added
to a query. Similar to how TupleDesc RuleLocks are handled, add in a
equalRSDesc() function to check if the policies have actually changed
and, if not, swap back the rsdesc field (using the original instead of
the temporairly built one; the whole structure is swapped and then
specific fields swapped back). This now passes a CLOBBER_CACHE_ALWAYS
for me and should resolve the buildfarm error.
In addition to addressing this, add a new chapter in Data Definition
under Privileges which explains row security and provides examples of
its usage, change \d to always list policies (even if row security is
disabled- but note that it is disabled, or enabled with no policies),
rework check_role_for_policy (it really didn't need the entire policy,
but it did need to be using has_privs_of_role()), and change the field
in pg_class to relrowsecurity from relhasrowsecurity, based on
Heikki's suggestion. Also from Heikki, only issue SET ROW_SECURITY in
pg_restore when talking to a 9.5+ server, list Bypass RLS in \du, and
document --enable-row-security options for pg_dump and pg_restore.
Lastly, fix a number of minor whitespace and typo issues from Heikki,
Dimitri, add a missing #include, per Peter E, fix a few minor
variable-assigned-but-not-used and resource leak issues from Coverity
and add tab completion for role attribute bypassrls as well.
This function created new Vars with varno different from varnoold, which
is a condition that should never prevail before setrefs.c does the final
variable-renumbering pass. The created Vars could not be seen as equal()
to normal Vars, which among other things broke equivalence-class processing
for them. The consequences of this were indeed visible in the regression
tests, in the form of failure to propagate constants as one would expect.
I stumbled across it while poking at bug #11457 --- after intentionally
disabling join equivalence processing, the security-barrier regression
tests started falling over with fun errors like "could not find pathkey
item to sort", because of failure to match the corrupted Vars to normal
ones.
When the number of allowed iterations is limited (either a "?" quantifier
or a bound expression), the last sub-match has to reach to the end of the
target string. The previous coding here first tried the shortest possible
match (one character, usually) and then gave up and back-tracked if that
didn't work, typically leading to failure to match overall, as shown in
bug #11478 from Christoph Berg. The minimum change to fix that would be to
not decrement k before "goto backtrack"; but that would be a pretty stupid
solution, because we'd laboriously try each possible sub-match length
before finally discovering that only ending at the end can work. Instead,
force the sub-match endpoint limit up to the end for even the first
shortest() call if we cannot have any more sub-matches after this one.
Bug introduced in my rewrite that added the iterdissect logic, commit
173e29aa5d. The shortest-first search code
was too closely modeled on the longest-first code, which hasn't got this
issue since it tries a match reaching to the end to start with anyway.
Back-patch to all affected branches.
Per discussion in bug #11350, log ALTER SYSTEM commands at the
log_statement=ddl level, rather than at the log_statement=all level.
Pointed out by Tomonari Katsumata.
Back-patch to 9.4 where ALTER SYSTEM was introduced.
While withCheckOption exprs had been handled in many cases by
happenstance, they need to be handled during set_plan_references and
more specifically down in set_plan_refs for ModifyTable plan nodes.
This is to ensure that the opfuncid's are set for operators referenced
in the withCheckOption exprs.
Identified as an issue by Thom Brown
Patch by Dean Rasheed
Back-patch to 9.4, where withCheckOption was introduced.
For the reason outlined in df4077cda2 also remove volatile qualifiers
from xlog.c. Some of these uses of volatile have been added after
noticing problems back when spinlocks didn't imply compiler
barriers. So they are a good test - in fact removing the volatiles
breaks when done without the barriers in spinlocks present.
Several uses of volatile remain where they are explicitly used to
access shared memory without locks. These locations are ok with
slightly out of date data, but removing the volatile might lead to the
variables never being reread from memory. These uses could also be
replaced by barriers, but that's a separate change of doubtful value.
Now that spinlocks (hopefully!) act as compiler barriers, as of commit
0709b7ee72, this should be safe. This
serves as a demonstration of the new coding style, and may be optimized
better on some machines as well.
There are four weaknesses in728f152e07f998d2cb4fe5f24ec8da2c3bda98f2:
* append_init() in heapdesc.c was ugly and required that rm_identify
return values are only valid till the next call. Instead just add a
couple more switch() cases for the INIT_PAGE cases. Now the returned
value will always be valid.
* a couple rm_identify() callbacks missed masking xl_info with
~XLR_INFO_MASK.
* pg_xlogdump didn't map a NULL rm_identify to UNKNOWN or a similar
string.
* append_init() was called when id=NULL - which should never actually
happen. But it's better to be careful.
Testing reveals that that doing a memcmp() before the strcoll() costs
practically nothing, at least on the systems we tested, and it speeds
up sorts containing many equal strings significatly.
Peter Geoghegan. Review by myself and Heikki Linnakangas. Comments
rewritten by me.
Building on the updatable security-barrier views work, add the
ability to define policies on tables to limit the set of rows
which are returned from a query and which are allowed to be added
to a table. Expressions defined by the policy for filtering are
added to the security barrier quals of the query, while expressions
defined to check records being added to a table are added to the
with-check options of the query.
New top-level commands are CREATE/ALTER/DROP POLICY and are
controlled by the table owner. Row Security is able to be enabled
and disabled by the owner on a per-table basis using
ALTER TABLE .. ENABLE/DISABLE ROW SECURITY.
Per discussion, ROW SECURITY is disabled on tables by default and
must be enabled for policies on the table to be used. If no
policies exist on a table with ROW SECURITY enabled, a default-deny
policy is used and no records will be visible.
By default, row security is applied at all times except for the
table owner and the superuser. A new GUC, row_security, is added
which can be set to ON, OFF, or FORCE. When set to FORCE, row
security will be applied even for the table owner and superusers.
When set to OFF, row security will be disabled when allowed and an
error will be thrown if the user does not have rights to bypass row
security.
Per discussion, pg_dump sets row_security = OFF by default to ensure
that exports and backups will have all data in the table or will
error if there are insufficient privileges to bypass row security.
A new option has been added to pg_dump, --enable-row-security, to
ask pg_dump to export with row security enabled.
A new role capability, BYPASSRLS, which can only be set by the
superuser, is added to allow other users to be able to bypass row
security using row_security = OFF.
Many thanks to the various individuals who have helped with the
design, particularly Robert Haas for his feedback.
Authors include Craig Ringer, KaiGai Kohei, Adam Brightwell, Dean
Rasheed, with additional changes and rework by me.
Reviewers have included all of the above, Greg Smith,
Jeff McCormick, and Robert Haas.
This is primarily useful for the upcoming pg_xlogdump --stats feature,
but also allows to remove some duplicated code in the rmgr_desc
routines.
Due to the separation and harmonization, the output of dipsplayed
records changes somewhat. But since this isn't enduser oriented
content that's ok.
It's potentially desirable to further change pg_xlogdump's display of
records. It previously wasn't possible to show the record type
separately from the description forcing it to be in the last
column. But that's better done in a separate commit.
Author: Abhijit Menon-Sen, slightly editorialized by me
Reviewed-By: Álvaro Herrera, Andres Freund, and Heikki Linnakangas
Discussion: 20140604104716.GA3989@toroid.org
This new GUC context option allows GUC parameters to have the combined
properties of PGC_BACKEND and PGC_SUSET, ie, they don't change after
session start and non-superusers can't change them. This is a more
appropriate choice for log_connections and log_disconnections than their
previous context of PGC_BACKEND, because we don't want non-superusers
to be able to affect whether their sessions get logged.
Note: the behavior for log_connections is still a bit odd, in that when
a superuser attempts to set it from PGOPTIONS, the setting takes effect
but it's too late to enable or suppress connection startup logging.
It's debatable whether that's worth fixing, and in any case there is
a reasonable argument for PGC_SU_BACKEND to exist.
In passing, re-pgindent the files touched by this commit.
Fujii Masao, reviewed by Joe Conway and Amit Kapila
Since this makes the bucket headers use ~10x as much memory, properly
account for that memory when we figure out whether everything fits
in work_mem. This might result in some cases that previously used
only a single batch getting split into multiple batches, but it's
unclear as yet whether we need defenses against that case, and if so,
what the shape of those defenses should be.
It's worth noting that even in these edge cases, users should still be
no worse off than they would have been last week, because commit
45f6240a8f saved a big pile of memory
on exactly the same workloads.
Tomas Vondra, reviewed and somewhat revised by me.
Previously replication commands like IDENTIFY_COMMAND were not logged
even when log_statements is set to all. Some users who want to audit
all types of statements were not satisfied with this situation. To
address the problem, this commit adds new GUC log_replication_commands.
If it's enabled, all replication commands are logged in the server log.
There are many ways to allow us to enable that logging. For example,
we can extend log_statement so that replication commands are logged
when it's set to all. But per discussion in the community, we reached
the consensus to add separate GUC for that.
Reviewed by Ian Barwick, Robert Haas and Heikki Linnakangas.
The code that tried to split a page at 75/25 ratio, when appending to the
end of an index, was buggy in two ways. First, there was a silly typo that
caused it to just fill the left page as full as possible. But the logic as
it was intended wasn't correct either, and would actually have given a ratio
closer to 60/40 than 75/25.
Gaetano Mendola spotted the typo. Backpatch to 9.4, where this code was added.
The code for raising a NUMERIC value to an integer power wasn't very
careful about large powers. It got an outright wrong answer for an
exponent of INT_MIN, due to failure to consider overflow of the Abs(exp)
operation; which is fixable by using an unsigned rather than signed
exponent value after that point. Also, even though the number of
iterations of the power-computation loop is pretty limited, it's easy for
the repeated squarings to result in ridiculously enormous intermediate
values, which can take unreasonable amounts of time/memory to process,
or even overflow the internal "weight" field and so produce a wrong answer.
We can forestall misbehaviors of that sort by bailing out as soon as the
weight value exceeds what will fit in int16, since then the final answer
must overflow (if exp > 0) or underflow (if exp < 0) the packed numeric
format.
Per off-list report from Pavel Stehule. Back-patch to all supported
branches.
Provide an option to skip NULL values in a row when generating a JSON
object from that row with row_to_json. This can reduce the size of the
JSON object in cases where columns are NULL without really reducing the
information in the JSON object.
This also makes row_to_json into a single function with default values,
rather than having multiple functions. In passing, change array_to_json
to also be a single function with default values (we don't add an
'ignore_nulls' option yet- it's not clear that there is a sensible
use-case there, and it hasn't been asked for in any case).
Pavel Stehule
Instead of palloc'ing each HashJoinTuple individually, allocate 32kB chunks
and pack the tuples densely in the chunks. This avoids the AllocChunk
header overhead, and the space wasted by standard allocator's habit of
rounding sizes up to the nearest power of two.
This doesn't contain any planner changes, because the planner's estimate of
memory usage ignores the palloc overhead. Now that the overhead is smaller,
the planner's estimates are in fact more accurate.
Tomas Vondra, reviewed by Robert Haas.
The code I added in commit f343a880d5 was
careless about preserving AND/OR flatness: it could create a structure with
an OR node directly underneath another one. That breaks an assumption
that's fairly important for planning efficiency, not to mention triggering
various Asserts (as reported by Benjamin Smith). Add a trifle more logic
to handle the case properly.
Previously, they functioned as barriers against CPU reordering but not
compiler reordering, an odd API that required extensive use of volatile
everywhere that spinlocks are used. That's error-prone and has negative
implications for performance, so change it.
In theory, this makes it safe to remove many of the uses of volatile
that we currently have in our code base, but we may find that there are
some bugs in this effort when we do. In the long run, though, this
should make for much more maintainable code.
Patch by me. Review by Andres Freund.
This provides a convenient method of classifying input values into buckets
that are not necessarily equal-width. It works on any sortable data type.
The choice of function name is a bit debatable, perhaps, but showing that
there's a relationship to the SQL standard's width_bucket() function seems
more attractive than the other proposals.
Petr Jelinek, reviewed by Pavel Stehule
The xml type previously rejected "content" that is empty or consists
only of spaces. But the SQL/XML standard allows that, so change that.
The accepted values for XML "documents" are not changed.
Reviewed-by: Ali Akbar <the.apaan@gmail.com>
Now that ALTER TABLE .. ALL IN TABLESPACE has replaced the previous
ALTER TABLESPACE approach, it makes sense to move the calls down in
to ProcessUtilitySlow where the rest of ALTER TABLE is handled.
This also means that event triggers will support ALTER TABLE .. ALL
(which was the impetus for the original change, though it has other
good qualities also).
Álvaro Herrera
Back-patch to 9.4 as the original rework was.
Some Sparc CPUs can be run in various coherence models, ranging from
RMO (relaxed) over PSO (partial) to TSO (total). Solaris has always
run CPUs in TSO mode while in userland, but linux didn't use to and
the various *BSDs still don't. Unfortunately the sparc TAS/S_UNLOCK
were only correct under TSO. Fix that by adding the necessary memory
barrier instructions. On sparcv8+, which should be all relevant CPUs,
these are treated as NOPs if the current consistency model doesn't
require the barriers.
Discussion: 20140630222854.GW26930@awork2.anarazel.de
Will be backpatched to all released branches once a few buildfarm
cycles haven't shown up problems. As I've no access to sparc, this is
blindly written.
Every redo routine uses the same idiom to determine what to do to a page:
check if there's a backup block for it, and if not read, the buffer if the
block exists, and check its LSN. Refactor that into a common function,
XLogReadBufferForRedo, making all the redo routines shorter and more
readable.
This has no user-visible effect, and makes no changes to the WAL format.
Reviewed by Andres Freund, Alvaro Herrera, Michael Paquier.
This patch allows us to execute ALTER SYSTEM RESET command to
remove the configuration entry from postgresql.auto.conf.
Vik Fearing, reviewed by Amit Kapila and me.
68a2e52bba has introduced LWLockAcquireCommon() containing the
previous contents of LWLockAcquire() plus added functionality. The
latter then calls it, just like LWLockAcquireWithVar(). Because the
majority of callers don't need the added functionality, declare the
common code as inline. The compiler then can optimize away the unused
code. Doing so is also useful when looking at profiles, to
differentiate the users.
Backpatch to 9.4, the first branch to contain LWLockAcquireCommon().
Since the dawn of time (aka Postgres95) multiple pins of the same
buffer by one backend have been optimized not to modify the shared
refcount more than once. This optimization has always used a NBuffer
sized array in each backend keeping track of a backend's pins.
That array (PrivateRefCount) was one of the biggest per-backend memory
allocations, depending on the shared_buffers setting. Besides the
waste of memory it also has proven to be a performance bottleneck when
assertions are enabled as we make sure that there's no remaining pins
left at the end of transactions. Also, on servers with lots of memory
and a correspondingly high shared_buffers setting the amount of random
memory accesses can also lead to poor cpu cache efficiency.
Because of these reasons a backend's buffers pins are now kept track
of in a small statically sized array that overflows into a hash table
when necessary. Benchmarks have shown neutral to positive performance
results with considerably lower memory usage.
Patch by me, review by Robert Haas.
Discussion: 20140321182231.GA17111@alap3.anarazel.de
The list of posting lists it's dealing with can contain placeholders for
deleted posting lists. The placeholders are kept around so that they can
be WAL-logged, but we must be careful to not try to access them.
This fixes bug #11280, reported by Mårten Svantesson. Backpatch to 9.4,
where the compressed data leaf page code was added.
This reverts commit e23014f3d4.
As the side effect of the reverted commit, when the unit is
specified, the reloption was stored in the catalog with the unit.
This broke pg_dump (specifically, it prevented pg_dump from
outputting restorable backup regarding the reloption) and
turned the buildfarm red. Revert the commit until the fixed
version is ready.
This is useful to allow to set GUCs to values that include spaces;
something that wasn't previously possible. The primary case motivating
this is the desire to set default_transaction_isolation to 'repeatable
read' on a per connection basis, but other usecases like seach_path do
also exist.
This introduces a slight backward incompatibility: Previously a \ in
an option value would have been passed on literally, now it'll be
taken as an escape.
The relevant mailing list discussion starts with
20140204125823.GJ12016@awork2.anarazel.de.
This introduces an infrastructure which allows us to specify the units
like ms (milliseconds) in integer relation option, like GUC parameter.
Currently only autovacuum_vacuum_cost_delay reloption can accept
the units.
Reviewed by Michael Paquier
Previously, only a single-byte character was allowed as an
escape. This patch allows it to be a multi-byte character, though it
still must be a single character.
Reviewed by Heikki Linnakangas and Tom Lane.
If SELECT FOR UPDATE NOWAIT tries to lock a tuple that is concurrently
being updated, it might fail to honor its NOWAIT specification and block
instead of raising an error.
Fix by adding a no-wait flag to EvalPlanQualFetch which it can pass down
to heap_lock_tuple; also use it in EvalPlanQualFetch itself to avoid
blocking while waiting for a concurrent transaction.
Authors: Craig Ringer and Thomas Munro, tweaked by Álvaro
http://www.postgresql.org/message-id/51FB6703.9090801@2ndquadrant.com
Per Thomas Munro in the course of his SKIP LOCKED feature submission,
who also provided one of the isolation test specs.
Backpatch to 9.4, because that's as far back as it applies without
conflicts (although the bug goes all the way back). To that branch also
backpatch Thomas Munro's new NOWAIT test cases, committed in master by
Heikki as commit 9ee16b49f0 .
In some cases, not all Vars were being correctly marked as having been
modified for updatable security barrier views, which resulted in invalid
plans (eg: when security barrier views were created over top of
inheiritance structures).
In passing, be sure to update both varattno and varonattno, as _equalVar
won't consider the Vars identical otherwise. This isn't known to cause
any issues with updatable security barrier views, but was noticed as
missing while working on RLS and makes sense to get fixed.
Back-patch to 9.4 where updatable security barrier views were
introduced.
Use SECURITY_LOCAL_USERID_CHANGE while building temporary tables;
only escalate to SECURITY_RESTRICTED_OPERATION while potentially
running user-supplied code. The more secure mode was preventing
temp table creation. Add regression tests to cover this problem.
This fixes Bug #11208 reported by Bruno Emanuel de Andrade Silva.
Backpatch to 9.4, where the bug was introduced.
Prevent automatic oid assignment when in binary upgrade mode. Also
throw an error when contrib/pg_upgrade_support functions are called when
not in binary upgrade mode.
This prevent automatically-assigned oids from conflicting with later
pre-assigned oids coming from the old cluster. It also makes sure oids
are preserved in call important cases.
There's no point in setting up a context error callback when doing
conditional lock acquisition, because we never actually wait and so the
user wouldn't be able to see the context message anywhere. In fact,
this is more in line with what ConditionalXactLockTableWait is doing.
Backpatch to 9.4, where this was added.
The symbol was added by 71901ab6d; the original code was introduced by
6868ed749. Development of both overlapped which is why we apparently
failed to notice.
This is a (very slight) behavior change, so I'm not backpatching this to
9.4 for now, even though the symbol does exist there.
Event triggers want to know the OID of the interesting object created,
which is the main type. The array created as part of the operation is
just a subsidiary object which is not of much interest.
Other DDL commands are already returning the OID, which is required for
future additional event trigger work. This is merely making these
commands in line with the rest of utility command support.
Add a succint comment explaining why it's correct to change the
persistence in this way. Also s/loggedness/persistence/ because native
speakers didn't like the latter term.
Fabrízio and Álvaro
CheckConstraintFetch() leaked a cstring in the caller's context for each
CHECK constraint expression it copied into the relcache. Ordinarily that
isn't problematic, but it can be during CLOBBER_CACHE testing because so
many reloads can happen during a single query; so complicate the code
slightly to allow freeing the cstring after use. Per testing on buildfarm
member barnacle.
This is exactly like the leak fixed in AttrDefaultFetch() by commit
078b2ed291. (Yes, this time I did look for
other instances of the same coding pattern :-(.) Like that patch, no
back-patch, since it seems unlikely that there's any problem except under
very artificial test conditions.
BTW, it strikes me that both of these places would require further work
comparable to commit ab8c84db2f, if we ever
supported defaults or check constraints on system catalogs: they both
assume they are copying into an empty relcache data structure, and that
conceivably wouldn't be the case during recursive reloading of a system
catalog. This does not seem worth worrying about for the moment, since
there is no near-term prospect of supporting any such thing. So I'll
just note the possibility for the archives' sake.
This enables changing permanent (logged) tables to unlogged and
vice-versa.
(Docs for ALTER TABLE / SET TABLESPACE got shuffled in an order that
hopefully makes more sense than the original.)
Author: Fabrízio de Royes Mello
Reviewed by: Christoph Berg, Andres Freund, Thom Brown
Some tweaking by Álvaro Herrera
Cause the path extraction operators to return their lefthand input,
not NULL, if the path array has no elements. This seems more consistent
since the case ought to correspond to applying the simple extraction
operator (->) zero times.
Cause other corner cases in field/element/path extraction to return NULL
rather than failing. This behavior is arguably more useful than throwing
an error, since it allows an expression index using these operators to be
built even when not all values in the column are suitable for the
extraction being indexed. Moreover, we already had multiple
inconsistencies between the path extraction operators and the simple
extraction operators, as well as inconsistencies between the JSON and
JSONB code paths. Adopt a uniform rule of returning NULL rather than
throwing an error when the JSON input does not have a structure that
permits the request to be satisfied.
Back-patch to 9.4. Update the release notes to list this as a behavior
change since 9.3.
As 'ALTER TABLESPACE .. MOVE ALL' really didn't change the tablespace
but instead changed objects inside tablespaces, it made sense to
rework the syntax and supporting functions to operate under the
'ALTER (TABLE|INDEX|MATERIALIZED VIEW)' syntax and to be in
tablecmds.c.
Pointed out by Alvaro, who also suggested the new syntax.
Back-patch to 9.4.
jsonb's #> operator segfaulted (dereferencing a null pointer) if the RHS
was a zero-length array, as reported in bug #11207 from Justin Van Winkle.
json's #> operator returns NULL in such cases, so for the moment let's
make jsonb act likewise.
Also add a bunch of regression test queries memorializing the -> and #>
operators' behavior for this and other corner cases.
There is a good argument for changing some of these behaviors, as they
are not very consistent with each other, and throwing an error isn't
necessarily a desirable behavior for operators that are likely to be
used in indexes. However, everybody can agree that a core dump is the
Wrong Thing, and we need test cases even if we decide to change their
expected output later.
While the space is optional, it seems nicer to be consistent with what
you get if you do "SET search_path=...". SET always normalizes the
separator to be comma+space.
Christoph Martin
This reverts commit 083d29c65b.
The commit changed the code so that it causes an errors when
IDENTIFY_SYSTEM returns three columns. But which prevents us
from using the replication-related utilities against the server
with older version. This is not what we want. For that
compatibility, we allow the utilities to receive three columns
as the result of IDENTIFY_SYSTEM eventhough it actually returns
four columns in 9.4 or later.
Pointed out by Andres Freund.
5a991ef869 added new column into
the result of IDENTIFY_SYSTEM command. But it was not reflected into
several codes checking that result. Specifically though the number of
columns in the result was increased to 4, it was still compared with 3
in some replication codes.
Back-patch to 9.4 where the number of columns in IDENTIFY_SYSTEM
result was increased.
Report from Michael Paquier
In support of this, have the MSVC build follow GNU make in preferring
GNUmakefile over Makefile when a directory contains both.
Michael Paquier, reviewed by MauMau.
strncmp() is a specialized API unsuited for routine copying into
fixed-size buffers. On a system where the length of a single filename
can exceed MAXPGPATH, the pg_archivecleanup change prevents a simple
crash in the subsequent strlen(). Few filesystems support names that
long, and calling pg_archivecleanup with untrusted input is still not a
credible use case. Therefore, no back-patch.
David Rowley
Move the functions within the file so that public interface functions come
first, followed by internal functions. Previously, be_tls_write was first,
then internal stuff, and finally the rest of the public interface, which
clearly didn't make much sense.
Per Andres Freund's complaint.
Commit f30015b6d7 made this happen for
timestamp and timestamptz, but it seems pretty inconsistent to not
do it for simple dates as well.
(In passing, I re-pgindent'd json.c.)
PG_RETURN_BOOL() should only be used in functions following the V1 SQL
function API. This coding accidentally fails to fail since letting the
compiler coerce the Datum representation of bool back to plain bool
does give the right answer; but that doesn't make it a good idea.
Back-patch to older branches just to avoid unnecessary code divergence.
This provides a small but worthwhile speedup when sorting text, at least
in cases to which the sortsupport machinery applies.
Robert Haas and Peter Geoghegan
parseRelOptions() tended to leak memory in the caller's context. Most
of the time this doesn't really matter since the caller's context is
at most query-lifespan, and the function won't be invoked very many times.
However, when testing with CLOBBER_CACHE_RECURSIVELY, the same relcache
entry can get rebuilt a *lot* of times in one query, leading to significant
intraquery memory bloat if it has any reloptions. Noted while
investigating a related report from Tomas Vondra.
In passing, get rid of some Asserts that are redundant with the one
done by deconstruct_array().
As with other patches to avoid leaks in CLOBBER_CACHE testing, it doesn't
really seem worth back-patching this.
When replacing rd_indexlist, rd_indexattr, etc, we neglected to pfree any
old value of these fields. Under ordinary circumstances, the old value
would always be NULL, so this seemed reasonable enough. However, in cases
where we're rebuilding a system catalog's relcache entry and another cache
flush occurs on that same catalog meanwhile, it's possible for the field to
not be NULL when we return to the outer level, because we already refilled
it while recovering from the inner flush. This leads to a fairly small
session-lifespan leak in CacheMemoryContext. In real-world usage the leak
would be too small to notice; but in testing with CLOBBER_CACHE_RECURSIVELY
the leakage can add up to the point of causing OOM failures, as reported by
Tomas Vondra.
The issue has been there a long time, but it only seems worth fixing in
HEAD, like the previous fix in this area (commit 078b2ed291).
When doing logical decoding using START_LOGICAL_REPLICATION in a
walsender process the walsender sometimes was sending out keepalive
messages too frequently. Asking for feedback every time.
WalSndWaitForWal() sends out keepalive messages when it's waiting for
new WAL to be generated locally when it sees that the remote side
hasn't yet flushed WAL up to the local position. That generally is
good but causes problems if the remote side only writes but doesn't
flush changes yet. So check for both remote write and flush position.
Additionally we've asked for feedback to the keepalive message which
isn't warranted when waiting for WAL in contrast to preventing
timeouts because of wal_sender_timeout.
Complaint and patch by Steve Singer.
When both postgresql.conf and postgresql.auto.conf have their own entry of
the same parameter, PostgreSQL uses the entry in postgresql.auto.conf because
it appears last in the configuration scan. IOW, the other entries which appear
earlier are ignored. But, previously, ProcessConfigFile() detected the invalid
settings of even those unused entries and emitted the error messages
complaining about them, at postmaster startup. Complaining about the entries
to ignore is basically useless.
This problem happened because ProcessConfigFile() was called twice at
postmaster startup and the first call read only postgresql.conf. That is, the
first call could check the entry which might be ignored eventually by
the second call which read both postgresql.conf and postgresql.auto.conf.
To work around the problem, this commit changes ProcessConfigFile so that
its first call processes only data_directory and the second one does all the
entries. It's OK to process data_directory in the first call because it's
ensured that data_directory doesn't exist in postgresql.auto.conf.
Back-patch to 9.4 where postgresql.auto.conf was added.
Patch by me. Review by Amit Kapila
This refactoring is in preparation for adding support for other SSL
implementations, with no user-visible effects. There are now two #defines,
USE_OPENSSL which is defined when building with OpenSSL, and USE_SSL which
is defined when building with any SSL implementation. Currently, OpenSSL is
the only implementation so the two #defines go together, but USE_SSL is
supposed to be used for implementation-independent code.
The libpq SSL code is changed to use a custom BIO, which does all the raw
I/O, like we've been doing in the backend for a long time. That makes it
possible to use MSG_NOSIGNAL to block SIGPIPE when using SSL, which avoids
a couple of syscall for each send(). Probably doesn't make much performance
difference in practice - the SSL encryption is expensive enough to mask the
effect - but it was a natural result of this refactoring.
Based on a patch by Martijn van Oosterhout from 2006. Briefly reviewed by
Alvaro Herrera, Andreas Karlsson, Jeff Janes.
There's actually no need for any special case for unknown-type literals,
since we only need to push the value through its output function and
unknownout() works fine. The code that was here was completely bizarre
anyway, and would fail outright in cases that should work, not to mention
suffering from some copy-and-paste bugs.
Fix an obvious typo in json_build_object()'s complaint about invalid
number of arguments, and make the errhint a bit more sensible too.
Per discussion about how to word the improved hint, change the few places
in the documentation that refer to JSON object field names as "names" to
say "keys" instead, since that's what we've said in the vast majority of
places in the docs. Arguably "name" is more correct, since that's the
terminology used in RFC 7159; but we're stuck with "key" in view of the
naming of json_object_keys() so let's at least be self-consistent.
I adjusted a few code comments to match this as well, and failed to
resist the temptation to clean up some odd whitespace choices in the
same area, as well as a useless duplicate PG_ARGISNULL() check. There's
still quite a bit of code that uses the phrase "field name" in non-user-
visible ways, so I left those usages alone.
Such cases are disallowed by the SQL spec, and even if we wanted to allow
them, the semantics seem ambiguous: how should the FK columns be matched up
with the columns of a unique index? (The matching could be significant in
the presence of opclasses with different notions of equality, so this issue
isn't just academic.) However, our code did not previously reject such
cases, but instead would either fail to match to any unique index, or
generate a bizarre opclass-lookup error because of sloppy thinking in the
index-matching code.
David Rowley
Previously, TOAST tables only required in the new cluster could cause
oid conflicts if they were auto-numbered and a later conflicting oid had
to be assigned.
Backpatch through 9.3
This could be useful for datatypes like text, where we might want
to optimize for some collations but not others. However, this patch
doesn't introduce any new sortsupport functions that work this way;
it merely revises the code so that future patches may do so.
Patch by me. Review by Peter Geoghegan.
When more than one setting entries of same parameter exist in the
configuration file, PostgreSQL uses only entry appearing last in
configuration file scan. Since the other entries are not used,
ParseConfigFp() doesn't need to process them, but previously it did
that. This problematic behavior caused the configuration file scan
to detect invalid settings of unused entries (e.g., existence of
multiple entries of PGC_POSTMASTER parameter) and log the messages
complaining about them.
This commit changes the configuration file scan so that it processes
only last entry of each parameter.
Note that when multiple entries of same parameter exist both in
postgresql.conf and postgresql.auto.conf, unused entries in
postgresql.conf are still processed only at postmaster startup.
The problem has existed since old version, but a user is more likely
to encounter it since 9.4 where ALTER SYSTEM command was introduced.
So back-patch to 9.4.
Amit Kapila, slightly modified by me. Per report from Christoph Berg.
log_newpage is used by many indexams, in addition to heap, but for
historical reasons it's always been part of the heapam rmgr. Starting with
9.3, we have another WAL record type for logging an image of a page,
XLOG_FPI. Simplify things by moving log_newpage and log_newpage_buffer to
xlog.c, and switch to using the XLOG_FPI record type.
Bump the WAL version number because the code to replay the old HEAP_NEWPAGE
records is removed.
When autovacuum is nominally off, we will still launch autovac workers
to vacuum tables that are at risk of XID wraparound. But after we'd done
that, an autovac worker would proceed to autovacuum every table in the
targeted database, if they meet the usual thresholds for autovacuuming.
This is at best pretty unexpected; at worst it delays response to the
wraparound threat. Fix it so that if autovacuum is nominally off, we
*only* do forced vacuums and not any other work.
Per gripe from Andrey Zhidenkov. This has been like this all along,
so back-patch to all supported branches.
InitProcess() relies on IsBackgroundWorker to decide whether the PGPROC
for a new backend should be taken from ProcGlobal's freeProcs or from
bgworkerFreeProcs. In EXEC_BACKEND builds, InitProcess() is called
sooner than in non-EXEC_BACKEND builds, and IsBackgroundWorker wasn't
getting initialized soon enough.
Report by Noah Misch. Diagnosis and fix by me.
Commit 0ac5ad5134 removed an optimization in multixact.c that skipped
fetching members of MultiXactId that were older than our
OldestVisibleMXactId value. The reason this was removed is that it is
possible for multixacts that contain updates to be older than that
value. However, if the caller is certain that the multi does not
contain an update (because the infomask bits say so), it can pass this
info down to GetMultiXactIdMembers, enabling it to use the old
optimization.
Pointed out by Andres Freund in 20131121200517.GM7240@alap2.anarazel.de
Testing for abortedness of a multixact member that's being frozen is
unnecessary: we only need to know whether the transaction is still in
progress or committed to determine whether it must be kept or not. This
let us simplify the code a bit and avoid a useless TransactionIdDidAbort
test.
Suggested by Andres Freund awhile back.
There were several oversights in recovery code where COMMIT/ABORT PREPARED
records were ignored:
* pg_last_xact_replay_timestamp() (wasn't updated for 2PC commits)
* recovery_min_apply_delay (2PC commits were applied immediately)
* recovery_target_xid (recovery would not stop if the XID used 2PC)
The first of those was reported by Sergiy Zuban in bug #11032, analyzed by
Tom Lane and Andres Freund. The bug was always there, but was masked before
commit d19bd29f07, because COMMIT PREPARED
always created an extra regular transaction that was WAL-logged.
Backpatch to all supported versions (older versions didn't have all the
features and therefore didn't have all of the above bugs).
This return code is possible wherever we pass bAlertable = TRUE; it
arises when Windows caused the current thread to run an "I/O completion
routine" or an "asynchronous procedure call". PostgreSQL does not
provoke either of those Windows facilities, hence this bug remaining
largely unnoticed, but other local code might do so. Due to a shortage
of complaints, no back-patch for now.
Per report from Shiv Shivaraju Gowda, this bug can cause
PGSemaphoreLock() to PANIC. The bug can also cause select() to report
timeout expiration too early, which might confuse pgstat_init() and
CheckRADIUSAuth().
shm_mq_send_bytes didn't invariably initialize *bytes_written before
returning, which would cause shm_mq_send to read from uninitialized
memory and add the value it found there to mqh->mqh_partial_bytes.
This could cause the next attempt to send a message via the queue to
fail an assertion (if the queue was detached) or copy data from a
garbage pointer value into the queue (if non-blocking mode was in use).
Nothing in the checkpointer calls InitXLOGAccess(), so WALInsertLocks
never got initialized there. Without EXEC_BACKEND, it works anyway
because the correct value is inherited from the postmaster, but
with EXEC_BACKEND we've got a problem. The problem appears to have
been introduced by commit 68a2e52bba.
To fix, move the relevant initialization steps from InitXLOGAccess()
to XLOGShmemInit(), making this more parallel to what we do
elsewhere.
Amit Kapila
Ephemeral slots - slots that shouldn't survive database restarts -
weren't properly cleaned up after a immediate/crash restart. They were
ignored in the sense that they weren't restored into memory and thus
didn't cause unwanted resource retention; but they prevented a new
slot with the same name from being created.
Now ephemeral slots are fully removed during startup.
Backpatch to 9.4 where replication slots where added.
The executor has thrown errors for negative OFFSET values since 8.4 (see
commit bfce56eea4), but in a moment of brain
fade I taught the planner that OFFSET with a constant negative value was a
no-op (commit 1a1832eb08). Reinstate the
former behavior by only discarding OFFSET with a value of exactly 0. In
passing, adjust a planner comment that referenced the ancient behavior.
Back-patch to 9.3 where the mistake was introduced.
get_raw_page tried to validate the supplied block number against
RelationGetNumberOfBlocks(), which of course is only right when
accessing the main fork. In most cases, the main fork is longer
than the others, so that the check was too weak (allowing a
lower-level error to be reported, but no real harm to be done).
However, very small tables could have an FSM larger than their heap,
in which case the mistake prevented access to some FSM pages.
Per report from Torsten Foertsch.
In passing, make the bad-block-number error into an ereport not elog
(since it's certainly not an internal error); and fix sloppily
maintained comment for RelationGetNumberOfBlocksInFork.
This has been wrong since we invented relation forks, so back-patch
to all supported branches.
In commit 631dc390f4, we started to handle
simple numeric timezone offsets via the zic library instead of the old
CTimeZone/HasCTZSet kluge. However, we overlooked the fact that the zic
code will reject UTC offsets exceeding a week (which seems a bit arbitrary,
but not because it's too tight ...). This led to possibly setting
session_timezone to NULL, which results in crashes in most timezone-related
operations as of 9.4, and crashes in a small number of places even before
that. So check for NULL return from pg_tzset_offset() and report an
appropriate error message. Per bug #11014 from Duncan Gillis.
Back-patch to all supported branches, like the previous patch.
(Unfortunately, as of today that no longer includes 8.4.)
In commit a61daa14d5, we fixed pg_upgrade so
that it would install sane relminmxid and datminmxid values, but that does
not cure the problem for installations that were already pg_upgraded to
9.3; they'll initially have "1" in those fields. This is not a big problem
so long as 1 is "in the past" compared to the current nextMultiXact
counter. But if an installation were more than halfway to the MXID wrap
point at the time of upgrade, 1 would appear to be "in the future" and
that would effectively disable tracking of oldest MXIDs in those
tables/databases, until such time as the counter wrapped around.
While in itself this isn't worse than the situation pre-9.3, where we did
not manage MXID wraparound risk at all, the consequences of premature
truncation of pg_multixact are worse now; so we ought to make some effort
to cope with this. We discussed advising users to fix the tracking values
manually, but that seems both very tedious and very error-prone.
Instead, this patch adopts two amelioration rules. First, a relminmxid
value that is "in the future" is allowed to be overwritten with a
full-table VACUUM's actual freeze cutoff, ignoring the normal rule that
relminmxid should never go backwards. (This essentially assumes that we
have enough defenses in place that wraparound can never occur anymore,
and thus that a value "in the future" must be corrupt.) Second, if we see
any "in the future" values then we refrain from truncating pg_clog and
pg_multixact. This prevents loss of clog data until we have cleaned up
all the broken tracking data. In the worst case that could result in
considerable clog bloat, but in practice we expect that relfrozenxid-driven
freezing will happen soon enough to fix the problem before clog bloat
becomes intolerable. (Users could do manual VACUUM FREEZEs if not.)
Note that this mechanism cannot save us if there are already-wrapped or
already-truncated-away MXIDs in the table; it's only capable of dealing
with corrupt tracking values. But that's the situation we have with the
pg_upgrade bug.
For consistency, apply the same rules to relfrozenxid/datfrozenxid. There
are not known mechanisms for these to get messed up, but if they were, the
same tactics seem appropriate for fixing them.
When a view has a function-returning-composite in FROM, and there are
some dropped columns in the underlying composite type, ruleutils.c
printed junk in the column alias list for the reconstructed FROM entry.
Before 9.3, this was prevented by doing get_rte_attribute_is_dropped
tests while printing the column alias list; but that solution is not
currently available to us for reasons I'll explain below. Instead,
check for empty-string entries in the alias list, which can only exist
if that column position had been dropped at the time the view was made.
(The parser fills in empty strings to preserve the invariant that the
aliases correspond to physical column positions.)
While this is sufficient to handle the case of columns dropped before
the view was made, we have still got issues with columns dropped after
the view was made. In particular, the view could contain Vars that
explicitly reference such columns! The dependency machinery really
ought to refuse the column drop attempt in such cases, as it would do
when trying to drop a table column that's explicitly referenced in
views. However, we currently neglect to store dependencies on columns
of composite types, and fixing that is likely to be too big to be
back-patchable (not to mention that existing views in existing databases
would not have the needed pg_depend entries anyway). So I'll leave that
for a separate patch.
Pre-9.3, ruleutils would print such Vars normally (with their original
column names) even though it suppressed their entries in the RTE's
column alias list. This is certainly bogus, since the printed view
definition would fail to reload, but at least it didn't crash. However,
as of 9.3 the printed column alias list is tightly tied to the names
printed for Vars; so we can't treat columns as dropped for one purpose
and not dropped for the other. This is why we can't just put back the
get_rte_attribute_is_dropped test: it results in an assertion failure
if the view in fact contains any Vars referencing the dropped column.
Once we've got dependencies preventing such cases, we'll probably want
to do it that way instead of relying on the empty-string test used here.
This fix turned up a very ancient bug in outfuncs/readfuncs, namely
that T_String nodes containing empty strings were not dumped/reloaded
correctly: the node was printed as "<>" which is read as a string
value of <>. Since (per SQL) we disallow empty-string identifiers,
such nodes don't occur normally, which is why we'd not noticed.
(Such nodes aren't used for literal constants, just identifiers.)
Per report from Marc Schablewski. Back-patch to 9.3 which is where
the rule printing behavior changed. The dangling-variable case is
broken all the way back, but that's not what his complaint is about.
If pg_regcomp failed after having invoked markst/cleanst, it would leak any
"struct subre" nodes it had created. (We've already detected all regex
syntax errors at that point, so the only likely causes of later failure
would be query cancel or out-of-memory.) To fix, make sure freesrnode
knows the difference between the pre-cleanst and post-cleanst cleanup
procedures. Add some documentation of this less-than-obvious point.
Also, newlacon did the wrong thing with an out-of-memory failure from
realloc(), so that the previously allocated array would be leaked.
Both of these are pretty low-probability scenarios, but a bug is a bug,
so patch all the way back.
Per bug #10976 from Arthur O'Dwyer.
pg_ctl will log to the Windows event log when it is running as a service,
which is the primary way of running PostgreSQL on Windows. This option
makes it possible to specify which event source to use for this, in order
to separate different instances. The server logging itself is still controlled
by the regular logging parameters, including a separate setting for the event
source. The parameter to pg_ctl only controlls the logging from pg_ctl itself.
MauMau, review in many iterations by Amit Kapila and me.
The consistent function contained several bugs:
* The "if (which2) { ... }" block was broken. It compared the argument's
lower bound against centroid's upper bound, while it was supposed to compare
the argument's upper bound against the centroid's lower bound (the comment
was correct, code was wrong). Also, it cleared bits in the "which1"
variable, while it was supposed to clear bits in "which2".
* If the argument's upper bound was equal to the centroid's lower bound, we
descended to both halves (= all quadrants). That's unnecessary, searching
the right quadrants is sufficient. This didn't lead to incorrect query
results, but was clearly wrong, and slowed down queries unnecessarily.
* In the case that argument's lower bound is adjacent to the centroid's
upper bound, we also don't need to visit all quadrants. Per similar
reasoning as previous point.
* The code where we compare the previous centroid with the current centroid
should match the code where we compare the current centroid with the
argument. The point of that code is to redo the calculation done in the
previous level, to see if we were supposed to traverse left or right (or up
or down), and if we actually did. If we moved in the different direction,
then we know there are no matches for bound.
Refactor the code and adds comments to make it more readable and easier to
reason about.
Backpatch to 9.3 where SP-GiST support for range types was introduced.
We can remove a left join to a relation if the relation's output is
provably distinct for the columns involved in the join clause (considering
only equijoin clauses) and the relation supplies no variables needed above
the join. Previously, the join removal logic could only prove distinctness
by reference to unique indexes of a table. This patch extends the logic
to consider subquery relations, wherein distinctness might be proven by
reference to GROUP BY, DISTINCT, etc.
We actually already had some code to check that a subquery's output was
provably distinct, but it was hidden inside pathnode.c; which was a pretty
bad place for it really, since that file is mostly boilerplate Path
construction and comparison. Move that code to analyzejoins.c, which is
arguably a more appropriate location, and is certainly the site of the
new usage for it.
David Rowley, reviewed by Simon Riggs
Trying to reassign objects owned by a user that had text search
dictionaries or configurations used to fail with:
ERROR: unexpected classid 3600
or
ERROR: unexpected classid 3602
Fix by adding cases for those object types in a switch in pg_shdepend.c.
Both REASSIGN OWNED and text search objects go back all the way to 8.1,
so backpatch to all supported branches. In 9.3 the alter-owner code was
made generic, so the required change in recent branches is pretty
simple; however, for 9.2 and older ones we need some additional
reshuffling to enable specifying objects by OID rather than name.
Text search templates and parsers are not owned objects, so there's no
change required for them.
Per bug #9749 reported by Michal Novotný
Both the psql banner and the connection logging already included
SSL status, cipher and bitlength, this adds the information about
compression being on or off.
Prominent binaries already had this metadata. A handful of minor
binaries, such as pg_regress.exe, still lack it; efforts to eliminate
such exceptions are welcome.
Michael Paquier, reviewed by MauMau.
EXPLAIN ANALYZE shows the information of the numbers of exact/lossy blocks which
bitmap heap scan processes. But, previously, when those numbers were both zero,
it displayed only the prefix "Heap Blocks:" in TEXT output format. This is strange
and would confuse the users. So this commit suppresses such unnecessary information.
Backpatch to 9.4 where EXPLAIN ANALYZE was changed so that such information was
displayed.
Etsuro Fujita
Commit 1b86c81d2d fixed the decoding of toasted columns for the rows
contained in one xl_heap_multi_insert record. But that's not actually
enough, because heap_multi_insert() will actually first toast all
passed in rows and then emit several *_multi_insert records; one for
each page it fills with tuples.
Add a XLOG_HEAP_LAST_MULTI_INSERT flag which is set in
xl_heap_multi_insert->flag denoting that this multi_insert record is
the last emitted by one heap_multi_insert() call. Then use that flag
in decode.c to only set clear_toast_afterwards in the right situation.
Expand the number of rows inserted via COPY in the corresponding
regression test to make sure that more than one heap page is filled
with tuples by one heap_multi_insert() call.
Backpatch to 9.4 like the previous commit.
ExecEvalWholeRowVar incorrectly supposed that it could "bless" the source
TupleTableSlot just once per query. But if the input is coming from an
Append (or, perhaps, other cases?) more than one slot might be returned
over the query run. This led to "record type has not been registered"
errors when a composite datum was extracted from a non-blessed slot.
This bug has been there a long time; I guess it escaped notice because when
dealing with subqueries the planner tends to expand whole-row Vars into
RowExprs, which don't have the same problem. It is possible to trigger
the problem in all active branches, though, as illustrated by the added
regression test.
This command provides an automated way to create foreign table definitions
that match remote tables, thereby reducing tedium and chances for error.
In this patch, we provide the necessary core-server infrastructure and
implement the feature fully in the postgres_fdw foreign-data wrapper.
Other wrappers will throw a "feature not supported" error until/unless
they are updated.
Ronan Dunklau and Michael Paquier, additional work by me
While the x output of "select x from t group by x" can be presumed unique,
this does not hold for "select x, generate_series(1,10) from t group by x",
because we may expand the set-returning function after the grouping step.
(Perhaps that should be re-thought; but considering all the other oddities
involved with SRFs in targetlists, it seems unlikely we'll change it.)
Put a check in query_is_distinct_for() so it's not fooled by such cases.
Back-patch to all supported branches.
David Rowley
When decoding the results of a HEAP2_MULTI_INSERT (currently only
generated by COPY FROM) toast columns for all but the last tuple
weren't replaced by their actual contents before being handed to the
output plugin. The reassembled toast datums where disregarded after
every REORDER_BUFFER_CHANGE_(INSERT|UPDATE|DELETE) which is correct
for plain inserts, updates, deletes, but not multi inserts - there we
generate several REORDER_BUFFER_CHANGE_INSERTs for a single
xl_heap_multi_insert record.
To solve the problem add a clear_toast_afterwards boolean to
ReorderBufferChange's union member that's used by modifications. All
row changes but multi_inserts always set that to true, but
multi_insert sets it only for the last change generated.
Add a regression test covering decoding of multi_inserts - there was
none at all before.
Backpatch to 9.4 where logical decoding was introduced.
Bug found by Petr Jelinek.
The isxdigit() calls relied on undefined behavior. The isascii() call
was well-defined, but our prevailing style is to include the cast.
Back-patch to 9.4, where the isxdigit() calls were introduced.
Although nodeAgg.c currently uses the same per-group memory context for
all groups of a query, that might change in future. Avoid assuming it.
This costs us an extra AggCheckCallContext() call per group, but that's
pretty cheap and is probably good from a safety standpoint anyway.
Back-patch to 9.4 in case any third-party code copies this logic.
Andrew Gierth
The previous design exposed the input and output ExprContexts of the
Agg plan node, but work on grouping sets has suggested that we'll regret
doing that. Instead provide more narrowly-defined APIs that can be
implemented in multiple ways, namely a way to get a short-term memory
context and a way to register an aggregate shutdown callback.
Back-patch to 9.4 where the bad APIs were introduced, since we don't
want third-party code using these APIs and then having to change in 9.5.
Andrew Gierth
If a connection committed or rolled back any transactions within a
PGSTAT_STAT_INTERVAL pacing interval without accessing any tables,
the reporting of those statistics would be held up until the
connection closed or until it ended a PGSTAT_STAT_INTERVAL interval
in which it had accessed a table. This could result in under-
reporting of transactions for an extended period, followed by a
spike in reported transactions.
While this is arguably a bug, the impact is minimal, primarily
affecting, and being affected by, monitoring software. It might
cause more confusion than benefit to change the existing behavior
in released stable branches, so apply only to master and the 9.4
beta.
Gurjeet Singh, with review and editing by Kevin Grittner,
incorporating suggested changes from Abhijit Menon-Sen and Tom
Lane.
The old name wasn't very descriptive as of actual contents of the
directory, which are historical snapshots in the snapshots/
subdirectory and mappingdata for rewritten tuples in
mappings/. There's been a fair amount of discussion what would be a
good name. I'm settling for pg_logical because it's likely that
further data around logical decoding and replication will need saving
in the future.
Also add the missing entry for the directory into storage.sgml's list
of PGDATA contents.
Bumps catversion as the data directories won't be compatible.
This function wasn't originally thought to be really user-facing,
because converting a table to a view isn't something we expect people
to do manually. So not all that much effort was spent on the error
messages; in particular, while the code will complain that you got
the column types wrong it won't say exactly what they are. But since
we repurposed the code to also check compatibility of rule RETURNING
lists, it's definitely user-facing. It now seems worthwhile to add
errdetail messages showing exactly what the conflict is when there's
a mismatch of column names or types. This is prompted by bug #10836
from Matthias Raffelsieper, which might have been forestalled if the
error message had reported the wrong column type as being "record".
Back-patch to 9.4, but not into older branches where the set of
translatable error strings is supposed to be stable.
Historically these database properties could be manipulated only by
manually updating pg_database, which is error-prone and only possible for
superusers. But there seems no good reason not to allow database owners to
set them for their databases, so invent CREATE/ALTER DATABASE options to do
that. Adjust a couple of places that were doing it the hard way to use the
commands instead.
Vik Fearing, reviewed by Pavel Stehule
Most of the existing option names are keywords anyway, but we can get rid
of LC_COLLATE and LC_CTYPE as keywords known to the lexer/grammar. This
immediately reduces the size of the grammar tables by about 8KB, and will
save more when we add additional CREATE/ALTER DATABASE options in future.
A side effect of the implementation is that the CONNECTION LIMIT option
can now also be spelled CONNECTION_LIMIT. We choose not to document this,
however.
Vik Fearing, based on a suggestion by me; reviewed by Pavel Stehule
The previous code, perhaps out of concern for avoid memory leaks, formed
the tuple in one memory context and then copied it to another memory
context. However, this doesn't appear to be necessary, since
index_form_tuple and the functions it calls take precautions against
leaking memory. In my testing, building the tuple directly inside the
sort context shaves several percent off the index build time.
Rearrange things so we do that.
Patch by me. Review by Amit Kapila, Tom Lane, Andres Freund.
When reading large amounts of preexisting WAL during logical decoding
using the SQL interface we possibly could fail to check interrupts in
due time. Similarly the same could happen on systems with a very high
WAL volume while creating a new logical replication slot, independent
of the used interface.
Previously these checks where only performed in xlogreader's read_page
callbacks, while waiting for new WAL to be produced. That's not
sufficient though, if there's never a need to wait. Walsender's send
loop already contains a interrupt check.
Backpatch to 9.4 where the logical decoding feature was introduced.
The assertion failed if WAL_DEBUG or LWLOCK_STATS was enabled; fix that by
using separate memory contexts for the allocations made within those code
blocks.
This patch introduces a mechanism for marking any memory context as allowed
in a critical section. Previously ErrorContext was exempt as a special case.
Instead of a blanket exception of the checkpointer process, only exempt the
memory context used for the pending ops hash table.
The "false" case was really quite useless since all it did was to throw
an error; a definition not helped in the least by making it the default.
Instead let's just have the "true" case, which emits nested objects and
arrays in JSON syntax. We might later want to provide the ability to
emit sub-objects in Postgres record or array syntax, but we'd be best off
to drive that off a check of the target field datatype, not a separate
argument.
For the functions newly added in 9.4, we can just remove the flag arguments
outright. We can't do that for json_populate_record[set], which already
existed in 9.3, but we can ignore the argument and always behave as if it
were "true". It helps that the flag arguments were optional and not
documented in any useful fashion anyway.
When running several postgres clusters on one OS instance it's often
inconveniently hard to identify which "postgres" process belongs to
which postgres instance.
Add the cluster_name GUC, whose value will be included as part of the
process titles if set. With that processes can more easily identified
using tools like 'ps'.
To avoid problems with encoding mismatches between postgresql.conf,
consoles, and individual databases replace non-ASCII chars in the name
with question marks. The length is limited to NAMEDATALEN to make it
less likely to truncate important information at the end of the
status.
Thomas Munro, with some adjustments by me and review by a host of people.
Support for running postgres on Alpha hasn't been tested for a long
while. Due to Alpha's uniquely lax cache coherency model it's a hard
to develop for platform (especially blindly!) and thought to be
unlikely to currently work correctly.
As Alpha is the only supported architecture for Tru64 drop support for
it as well. Tru64's support has ended 2012 and it has been in
maintenance-only mode for much longer.
Also remove stray references to __ksr__ and ultrix defines.
We can allow this even without any specific knowledge of the semantics
of the window function, so long as pushed-down quals will either accept
every row in a given window partition, or reject every such row. Because
window functions act only within a partition, such a case can't result
in changing the window functions' outputs for any surviving row.
Eliminating entire partitions in this way obviously can reduce the cost
of the window-function computations substantially.
The fly in the ointment is that it's hard to be entirely sure whether
this is true for an arbitrary qual condition. This patch allows pushdown
if (a) the qual references only partitioning columns, and (b) the qual
contains no volatile functions. We are at risk of incorrect results if
the qual can produce different answers for values that the partitioning
equality operator sees as equal. While it's not hard to invent cases
for which that can happen, it seems to seldom be a problem in practice,
since no one has complained about a similar assumption that we've had
for many years with respect to DISTINCT. The potential performance
gains seem to be worth the risk.
David Rowley, reviewed by Vik Fearing; some credit is due also to
Thomas Mayer who did considerable preliminary investigation.
Instead of truncating pg_multixact at vacuum time, do it only at
checkpoint time. The reason for doing it this way is twofold: first, we
want it to delete only segments that we're certain will not be required
if there's a crash immediately after the removal; and second, we want to
do it relatively often so that older files are not left behind if
there's an untimely crash.
Per my proposal in
http://www.postgresql.org/message-id/20140626044519.GJ7340@eldon.alvh.no-ip.org
we now execute the truncation in the checkpointer process rather than as
part of vacuum. Vacuum is in only charge of maintaining in shared
memory the value to which it's possible to truncate the files; that
value is stored as part of checkpoints also, and so upon recovery we can
reuse the same value to re-execute truncate and reset the
oldest-value-still-safe-to-use to one known to remain after truncation.
Per bug reported by Jeff Janes in the course of his tests involving
bug #8673.
While at it, update some comments that hadn't been updated since
multixacts were changed.
Backpatch to 9.3, where persistency of pg_multixact files was
introduced by commit 0ac5ad5134.
We were allowing a table's pg_class.relminmxid value to move backwards
when heaps were swapped by VACUUM FULL or CLUSTER. There is a
similar protection against relfrozenxid going backwards, which we
neglected to clone when the multixact stuff was rejiggered by commit
0ac5ad5134.
Backpatch to 9.3, where relminmxid was introduced.
As reported by Heikki in
http://www.postgresql.org/message-id/52401AEA.9000608@vmware.com
Don't assert MultiXactIdIsRunning if the multi came from a tuple that
had been share-locked and later copied over to the new cluster by
pg_upgrade. Doing that causes an error to be raised unnecessarily:
MultiXactIdIsRunning is not open to the possibility that its argument
came from a pg_upgraded tuple, and all its other callers are already
checking; but such multis cannot, obviously, have transactions still
running, so the assert is pointless.
Noticed while investigating the bogus pg_multixact/offsets/0000 file
left over by pg_upgrade, as reported by Andres Freund in
http://www.postgresql.org/message-id/20140530121631.GE25431@alap3.anarazel.de
Backpatch to 9.3, as the commit that introduced the buglet.
A WHERE clause applied to the output of a subquery with DISTINCT should
theoretically be applied only once per distinct row; but if we push it
into the subquery then it will be evaluated at each row before duplicate
elimination occurs. If the qual is volatile this can give rise to
observably wrong results, so don't do that.
While at it, refactor a little bit to allow subquery_is_pushdown_safe
to report more than one kind of restrictive condition without indefinitely
expanding its argument list.
Although this is a bug fix, it seems unwise to back-patch it into released
branches, since it might de-optimize plans for queries that aren't giving
any trouble in practice. So apply to 9.4 but not further back.
I noticed that the functions in jsonfuncs.c sometimes printed error
messages that claimed I'd called some other function. Investigation showed
that this was from repurposing code into "worker" functions without taking
much care as to whether it would mention the right SQL-level function if it
threw an error. Moreover, there was a weird mismash of messages that
contained a fixed function name, messages that used %s for a function name,
and messages that constructed a function name out of spare parts, like
"json%s_populate_record" (which, quite aside from being ugly as sin, wasn't
even sufficient to cover all the cases). This would put an undue burden on
our long-suffering translators. Standardize on inserting the SQL function
name with %s so as to reduce the number of translatable strings, and pass
function names around as needed to make sure we can report the right one.
Fix up some gratuitous variations in wording, too.
Re-pgindent, remove a lot of random vertical whitespace, remove useless
(if not counterproductive) inline markings, get rid of unnecessary
zero-padding of strings for hashtable searches. No functional changes.
populate_recordset_object_start() improperly created a new hash table
(overwriting the link to the existing one) if called at nest levels
greater than one. This resulted in previous fields not appearing in
the final output, as reported by Matti Hameister in bug #10728.
In 9.4 the problem also affects json_to_recordset.
This perhaps missed detection earlier because the default behavior is to
throw an error for nested objects: you have to pass use_json_as_text = true
to see the problem.
In addition, fix query-lifespan leakage of the hashtable created by
json_populate_record(). This is pretty much the same problem recently
fixed in dblink: creating an intended-to-be-temporary context underneath
the executor's per-tuple context isn't enough to make it go away at the
end of the tuple cycle, because MemoryContextReset is not
MemoryContextResetAndDeleteChildren.
Michael Paquier and Tom Lane
The syntax doesn't let you specify "WITH OIDS" for foreign tables, but it
was still possible with default_with_oids=true. But the rest of the system,
including pg_dump, isn't prepared to handle foreign tables with OIDs
properly.
Backpatch down to 9.1, where foreign tables were introduced. It's possible
that there are databases out there that already have foreign tables with
OIDs. There isn't much we can do about that, but at least we can prevent
them from being created in the future.
Patch by Etsuro Fujita, reviewed by Hadi Moshayedi.
Normally, this won't matter too much; but if I/O is really slow, for
example because the system is overloaded, we might write many pages
before checking for interrupts. A single toast insertion might
write up to 1GB of data, and a multi-insert could write hundreds
of tuples (and their corresponding TOAST data).
The catcache code is effectively assuming this already, so let's insist
that the catalog and index are actually declared that way.
Having done that, the comments in indexing.h about non-unique indexes
not being used for catcaches are completely redundant not just mostly so;
and we didn't have such a comment for every such index anyway. So let's
get rid of them.
Per discussion of whether we should identify primary keys for catalogs.
We might or might not take that further step, but this change in itself
will allow quicker detection of misdeclared catcaches, so it seems worth
doing in any case.
Since fdf9e21196 lazy_vacuum_page() rechecks the all-visible status
of pages in the second pass over the heap. It does so inside a
critical section, but both visibilitymap_test() and
heap_page_is_all_visible() perform operations that should not happen
inside one. The former potentially performs IO and both potentially do
memory allocations.
To fix, simply move all the all-visible handling outside the critical
section. Doing so means that the PD_ALL_VISIBLE on the page won't be
included in the full page image of the HEAP2_CLEAN record anymore. But
that's fine, the flag will be set by the HEAP2_VISIBLE logged later.
Backpatch to 9.3 where the problem was introduced. The bug only came
to light due to the assertion added in 4a170ee9 and isn't likely to
cause problems in production scenarios. The worst outcome is a
avoidable PANIC restart.
This also gets rid of the difference in the order of operations
between master and standby mentioned in 2a8e1ac5.
Per reports from David Leverton and Keith Fiske in bug #10533.
The existance of the assert_enabled variable (backing the
debug_assertions GUC) reduced the amount of knowledge some static code
checkers (like coverity and various compilers) could infer from the
existance of the assertion. That could have been solved by optionally
removing the assertion_enabled variable from the Assert() et al macros
at compile time when some special macro is defined, but the resulting
complication doesn't seem to be worth the gain from having
debug_assertions. Recompiling is fast enough.
The debug_assertions GUC is still available, but readonly, as it's
useful when diagnosing problems. The commandline/client startup option
-A, which previously also allowed to enable/disable assertions, has
been removed as it doesn't serve a purpose anymore.
While at it, reduce code duplication in bufmgr.c and localbuf.c
assertions checking for spurious buffer pins. That code had to be
reindented anyway to cope with the assert_enabled removal.
ExecMakeTableFunctionResult evaluated the arguments for a function-in-FROM
in the query-lifespan memory context. This is insignificant in simple
cases where the function relation is scanned only once; but if the function
is in a sub-SELECT or is on the inside of a nested loop, any memory
consumed during argument evaluation can add up quickly. (The potential for
trouble here had been foreseen long ago, per existing comments; but we'd
not previously seen a complaint from the field about it.) To fix, create
an additional temporary context just for this purpose.
Per an example from MauMau. Back-patch to all active branches.
data_directory could be set both in postgresql.conf and postgresql.auto.conf so far.
This could cause some problematic situations like circular definition. To avoid such
situations, this commit forbids a user to set data_directory in postgresql.auto.conf.
Backpatch this to 9.4 where ALTER SYSTEM command was introduced.
Amit Kapila, reviewed by Abhijit Menon-Sen, with minor adjustments by me.
Arrange for postmaster child processes to respond to two environment
variables, PG_OOM_ADJUST_FILE and PG_OOM_ADJUST_VALUE, to determine whether
they reset their OOM score adjustments and if so to what. This is superior
to the previous design involving #ifdef's in several ways. The behavior is
now available in a default build, and both ends of the adjustment --- the
original adjustment of the postmaster's level and the subsequent
readjustment by child processes --- can now be controlled in one place,
namely the postmaster launch script. So it's no longer necessary for the
launch script to act on faith that the server was compiled with the
appropriate options. In addition, if someone wants to use an OOM score
other than zero for the child processes, that doesn't take a recompile
anymore; and we no longer have to cater separately to the two different
historical kernel APIs for this adjustment.
Gurjeet Singh, somewhat revised by me
This SQL-standard feature allows a sub-SELECT yielding multiple columns
(but only one row) to be used to compute the new values of several columns
to be updated. While the same results can be had with an independent
sub-SELECT per column, such a workaround can require a great deal of
duplicated computation.
The standard actually says that the source for a multi-column assignment
could be any row-valued expression. The implementation used here is
tightly tied to our existing sub-SELECT support and can't handle other
cases; the Bison grammar would have some issues with them too. However,
I don't feel too bad about this since other cases can be converted into
sub-SELECTs. For instance, "SET (a,b,c) = row_valued_function(x)" could
be written "SET (a,b,c) = (SELECT * FROM row_valued_function(x))".
Since most of the system thinks AND and OR are N-argument expressions
anyway, let's have the grammar generate a representation of that form when
dealing with input like "x AND y AND z AND ...", rather than generating
a deeply-nested binary tree that just has to be flattened later by the
planner. This avoids stack overflow in parse analysis when dealing with
queries having more than a few thousand such clauses; and in any case it
removes some rather unsightly inconsistencies, since some parts of parse
analysis were generating N-argument ANDs/ORs already.
It's still possible to get a stack overflow with weirdly parenthesized
input, such as "x AND (y AND (z AND ( ... )))", but such cases are not
mainstream usage. The maximum depth of parenthesization is already
limited by Bison's stack in such cases, anyway, so that the limit is
probably fairly platform-independent.
Patch originally by Gurjeet Singh, heavily revised by me
We have for a long time been able to prove implications and refutations
between clauses structured like "expr op const" with the same subexpression
and btree-related operators; for example that "x < 4" implies "x <= 5".
The implication machinery is needed to detect usability of partial indexes,
and the refutation machinery is needed to implement constraint exclusion.
This patch extends that machinery to make proofs for operator expressions
involving the same two immutable-but-not-necessarily-just-Const input
expressions, ie does "expr1 op1 expr2" prove or refute "expr1 op2 expr2" or
"expr2 op2 expr1"? An important example is that we can now prove "x = y"
given "y = x", which formerly the code could not deduce unless x or y was a
constant. We can make use of the system's knowledge of operator commutator
and negator pairs, and can also make use of btree opclass relationships,
for example "x < y" implies "x <= y" and refutes "x > y" (notice that
neither of these could be proven just from commutator or negator links).
Inspired by a gripe from Brian Dunavant. This seems more like a new
feature than a bug fix, though, so no back-patch.
We should report the errno when we get a failure from functions like
BufFileWrite. "ERROR: write failed" is unreasonably taciturn for a
case that's well within the realm of possibility; I've seen it a
couple times in the buildfarm recently, in situations that were
probably out-of-disk-space, but it'd be good to see the errno
to confirm it.
I think this code was originally written without assuming that
the buffile.c functions would return useful errno; but most other
callers *are* assuming that, and a quick look at the buffile code
gives no reason to suppose otherwise.
Also, a couple of the old messages were phrased on the assumption
that a short read might indicate a logic bug in tuplestore itself;
but that code's pretty well tested by now, so a filesystem-level
problem seems much more likely.
The previous naming broke the query that libpq's lo_initialize() uses
to collect the OIDs of the server-side functions it requires, because
that query effectively assumes that there is only one function named
lo_create in the pg_catalog schema (and likewise only one lo_open, etc).
While we should certainly make libpq more robust about this, the naive
query will remain in use in the field for the foreseeable future, so it
seems the only workable choice is to use a different name for the new
function. lo_from_bytea() won a small straw poll.
Back-patch into 9.4 where the new function was introduced.
If a sub-select-in-FROM gets flattened into the upper query, then we
naturally get rid of any output columns that are defined in the sub-select
text but not actually used in the upper query. However, this doesn't
happen when it's not possible to flatten the subquery, for example because
it contains GROUP BY, LIMIT, etc. Allowing the subquery to compute useless
output columns is often fairly harmless, but sometimes it has significant
performance cost: the unused output might be an expensive expression,
or it might be a Var from a relation that we could remove entirely (via
the join-removal logic) if only we realized that we didn't really need
that Var. Situations like this are common when expanding views, so it
seems worth taking the trouble to detect and remove unused outputs.
Because the upper query's Var numbering for subquery references depends on
positions in the subquery targetlist, we don't want to renumber the items
we leave behind. Instead, we can implement "removal" by replacing the
unwanted expressions with simple NULL constants. This wastes a few cycles
at runtime, but not enough to justify more work in the planner.
Change the order of checks in similar functions to be the same; remove
a parameter that's not needed anymore; rename a memory context and
expand a couple of comments.
Per review comments from Amit Kapila
When we grabbed this file off the Snowball project's website, we mistakenly
supposed that it was in LATIN1 encoding, but evidently it was actually in
LATIN2. This resulted in ő (o-double-acute, U+0151, which is code 0xF5 in
LATIN2) being misconverted into õ (o-tilde, U+00F5), as complained of in
bug #10589 from Zoltán Sörös. We'd have messed up u-double-acute too,
but there aren't any of those in the file. Other characters used in the
file have the same codes in LATIN1 and LATIN2, which no doubt helped hide
the problem for so long.
The error is not only ours: the Snowball project also was confused about
which encoding is required for Hungarian. But dealing with that will
require source-code changes that I'm not at all sure we'll wish to
back-patch. Fixing the stopword file seems reasonably safe to back-patch
however.
Previously, the code used a node label of zero both for strings that
contain no bytes beyond the inner tuple's prefix, and for cases where an
"allTheSame" inner tuple has to be split to allow a string with a different
next byte to be inserted into it. Failing to distinguish these cases meant
that if a string ending with the current prefix needed to be inserted into
an allTheSame tuple, we got into an infinite loop, because after splitting
the tuple we'd descend into the child allTheSame tuple and then find we
need to split again.
To fix, instead use -1 and -2 as the node labels for these two cases.
This requires widening the node label type from "char" to int2, but
fortunately SPGiST stores all pass-by-value node label types in their
Datum representation, which means that this change is transparently upward
compatible so far as the on-disk representation goes. We continue to
recognize zero as a dummy node label for reading purposes, but will not
attempt to push new index entries down into such a label, so that the loop
won't occur even when dealing with an existing index.
Per report from Teodor Sigaev. Back-patch to 9.2 where the faulty
code was introduced.
In a50d976254 I already changed this, but got it wrong for the case
where the number of members is larger than the number of entries that
fit in the last page of the last segment.
As reported by Serge Negodyuck in a followup to bug #8673.
A ReorderBufferTransaction's end_lsn, the sentPtr advocated by
walsender keepalive messages, and the end location remembered by the
decoding get_*changes* SQL functions all use the location of the last
read record + 1. I.e. the LSN points to the beginning of the next
record. That cannot realistically be changed without changing the
replication protocol because that's how keepalive messages have worked
since 9.0.
The bug is that the logic inside the snapshot builder, which decides
whether a transaction's contents should be decoded, assumed the start
location would point towards the last byte of the last record. The
reason this didn't actually cause visible problems is that currently
that decision is only made for commit records. Since interesting
transactions always have at least one additional record - containing
actual data - we'd never skip a transaction.
But if there ever were transactions, or other events, with just one
record containing important information, we'd skip them after stopping
and restarting logical decoding.
It's critical that the backend's idea of LOBLKSIZE match the way data has
actually been divided up in pg_largeobject. While we don't provide any
direct way to adjust that value, doing so is a one-line source code change
and various people have expressed interest recently in changing it. So,
just as with TOAST_MAX_CHUNK_SIZE, it seems prudent to record the value in
pg_control and cross-check that the backend's compiled-in setting matches
the on-disk data.
Also tweak the code in inv_api.c so that fetches from pg_largeobject
explicitly verify that the length of the data field is not more than
LOBLKSIZE. Formerly we just had Asserts() for that, which is no protection
at all in production builds. In some of the call sites an overlength data
value would translate directly to a security-relevant stack clobber, so it
seems worth one extra runtime comparison to be sure.
In the back branches, we can't change the contents of pg_control; but we
can still make the extra checks in inv_api.c, which will offer some amount
of protection against running with the wrong value of LOBLKSIZE.
Previously there's been a mix between 'slotname' and 'slot_name'. It's
not nice to be unneccessarily inconsistent in a new feature. As a post
beta1 initdb now is required in the wake of eeca4cd35e, fix the
inconsistencies.
Most the changes won't affect usage of replication slots because the
majority of changes is around function parameter names. The prominent
exception to that is that the recovery.conf parameter
'primary_slotname' is now named 'primary_slot_name'.
The way the code was written, the padding was copied from uninitialized
memory areas.. Because the structs are local variables in the code where
the WAL records are constructed, making them larger and zeroing the padding
bytes would not make the code very pretty, so rather than fixing this
directly by zeroing out the padding bytes, it seems more clear to not try to
align the tuples in the WAL records. The redo functions are taught to copy
the tuple header to a local variable to avoid unaligned access.
Stable-branches have the same problem, but we can't change the WAL format
there, so fix in master only. Reading a few random extra bytes at the stack
is harmless in practice, so it's not worth crafting a different
back-patchable fix.
Per reports from Kevin Grittner and Andres Freund, using clang static
analyzer and Valgrind, respectively.
This is needed to allow ORDER BY, DISTINCT, etc to work as expected for
pg_lsn values.
We had previously decided to put this off for 9.5, but in view of commit
eeca4cd35e there's no reason to avoid a
catversion bump for 9.4beta2, and this does make a pretty significant
usability difference for pg_lsn.
Michael Paquier, with fixes from Andres Freund and Tom Lane
HeapTupleSatisfiesVacuum() didn't properly discern between
DELETE_IN_PROGRESS and INSERT_IN_PROGRESS for rows that have been
inserted in the current transaction and deleted in a aborted
subtransaction of the current backend. At the very least that caused
problems for CLUSTER and CREATE INDEX in transactions that had
aborting subtransactions producing rows, leading to warnings like:
WARNING: concurrent delete in progress within table "..."
possibly in an endless, uninterruptible, loop.
Instead of treating *InProgress xmins the same as *IsCurrent ones,
treat them as being distinct like the other visibility routines. As
implemented this separatation can cause a behaviour change for rows
that have been inserted and deleted in another, still running,
transaction. HTSV will now return INSERT_IN_PROGRESS instead of
DELETE_IN_PROGRESS for those. That's both, more in line with the other
visibility routines and arguably more correct. The latter because a
INSERT_IN_PROGRESS will make callers look at/wait for xmin, instead of
xmax.
The only current caller where that's possibly worse than the old
behaviour is heap_prune_chain() which now won't mark the page as
prunable if a row has concurrently been inserted and deleted. That's
harmless enough.
As a cautionary measure also insert a interrupt check before the gotos
in IndexBuildHeapScan() that lead to the uninterruptible loop. There
are other possible causes, like a row that several sessions try to
update and all fail, for repeated loops and the cost of doing so in
the retry case is low.
As this bug goes back all the way to the introduction of
subtransactions in 573a71a5da backpatch to all supported releases.
Reported-By: Sandro Santilli
187492b6c2 changed pgstat.c so that
the stats files were saved into $PGDATA/pg_stat directory when the server
was shutdowned. But it accidentally forgot to change the location of
pg_stat_statements permanent stats file. This commit fixes pg_stat_statements
so that its stats file is also saved into $PGDATA/pg_stat at shutdown.
Since this fix changes the file layout, we don't back-patch it to 9.3
where this oversight was introduced.
Per gripe from Peter Eisentraut and Tom Lane.
The output is slightly different, but still ISO 8601 compliant: to_char
doesn't output the minutes when time zone offset is an integer number of
hours, while EncodeDateTime outputs ":00".
The code is slightly adapted from code in xml.c
Previously, any backslash in text being escaped for JSON was doubled so
that the result was still valid JSON. However, this led to some perverse
results in the case of Unicode sequences, These are now detected and the
initial backslash is no longer escaped. All other backslashes are
still escaped. No validity check is performed, all that is looked for is
\uXXXX where X is a hexidecimal digit.
This is a change from the 9.2 and 9.3 behaviour as noted in the Release
notes.
Per complaint from Teodor Sigaev.
Many JSON processors require timestamp strings in ISO 8601 format in
order to convert the strings. When converting a timestamp, with or
without timezone, to a JSON datum we therefore now use such a format
rather than the type's default text output, in functions such as
to_json().
This is a change in behaviour from 9.2 and 9.3, as noted in the release
notes.
Because RecoveryConflictInterrupt() didn't set the process latch
anything using the latter to wait for events didn't get notified about
recovery conflicts. Most latch users are never the target of recovery
conflicts, which explains the lack of reports about this until
now.
Since 9.3 two possible affected users exist though: The sql callable
pg_sleep() now uses latches to wait and background workers are
expected to use latches in their main loop. Both would currently wait
until the end of WaitLatch's timeout.
Fix by adding a SetLatch() to RecoveryConflictInterrupt(). It'd also
be possible to fix the issue by having each latch user set
set_latch_on_sigusr1. That seems failure prone and though, as most of
these callsites won't often receive recovery conflicts and thus will
likely only be tested against normal query cancels et al. It'd also be
unnecessarily verbose.
Backpatch to 9.1 where latches were introduced. Arguably 9.3 would be
sufficient, because that's where pg_sleep() was converted to waiting
on the latch and background workers got introduced; but there could be
user level code making use of the latch pre 9.3.
Instead of iterating over jsonb structures, use the inbuilt functions
findJsonbValueFromContainerLen() and getIthJsonbValueFromContainer() to
extract values directly. These functions use algorithms that are O(n log
n) and O(1) respectively, whereas iterating is O(n), so we should see
considerable speedup here.
Teodor Sigaev.
This reverts commit 45b7abe59e.
It turns out that the %name-prefix syntax without "=" does not work
at all in pre-2.4 Bison. We are not prepared to make such a large
jump in minimum required Bison version just to suppress a warning
message in a version hardly any developers are using yet.
When 3.0 gets more popular, we'll figure out a way to deal with this.
In the meantime, BISONFLAGS=-Wno-deprecated is recommendable for
anyone using 3.0 who doesn't want to see the warning.
Sometimes CREATE_REPLICATION_SLOT ... LOGICAL ... needs to wait for
further WAL using WalSndWaitForWal(). That used to always respect
wal_sender_timeout and kill the session when waiting long enough
because no feedback/ping messages can be sent while the slot is still
being created.
Introduce the notion that last_reply_timestamp = 0 means that the
walsender currently doesn't need timeout processing to avoid that
problem. Use that notion for CREATE_REPLICATION_SLOT ... LOGICAL.
Bugreport and initial patch by Steve Singer, revised by me.
The only caller of compareJsonbScalarValue that needed locale-sensitive
comparison of strings was also the only caller that didn't just check for
equality. Separate the two cases for clarity: compareJsonbScalarValue now
does locale-sensitive comparison, and a new function,
equalsJsonbScalarValue, just checks for equality.
Fix an over-zealous assertion, which didn't take into account that sometimes
a scalar element can be compared against an array/object element.
Avoid comparing possibly-uninitialized local variables when end-of-array or
end-of-object is reached. Also fix and enhance comments a bit.
Peter Geoghegan, per reports by Pavel Stehule and me.
%name-prefix doesn't use an "=" sign according to the Bison docs, but it
silently accepted one anyway, until Bison 3.0. This was originally a
typo of mine in commit 012abebab1, and we
seem to have slavishly copied the error into all the other grammar files.
Per report from Vik Fearing; analysis by Peter Eisentraut.
Back-patch to all active branches, since somebody might try to build
a back branch with up-to-date tools.
Move the code that sends the initial status information as well as the
calculation of paths inside the ENSURE_ERROR_CLEANUP block. If this code
failed, we would "leak" a counter of number of concurrent backups, thereby
making the system always believe it was in backup mode. This could happen
if the sending failed (which it probably never did given that the small
amount of data to send would never cause a flush) or if the psprintf calls
ran out of memory. Both are very low risk, but all operations after
do_pg_start_backup should be protected.
The new page deletion code didn't cope with the case the target page's
right sibling was marked half-dead. It failed a sanity check which checked
that the downlinks in the parent page match the lower level, because a
half-dead page has no downlink. To cope, check for that condition, and
just give up on the deletion if it happens. The vacuum will finish the
deletion of the half-dead page when it gets there, and on the next vacuum
after that the empty can be deleted.
Reported by Jeff Janes.
HeapTupleHeaderGetCmax() asserts that it is only used if the tuple has
been updated by the current transaction. That check is correct and
sensible but requires allocating memory if xmax is a multixact. When
wal_level is set to logical cmax needs to be included in a wal record
, generated inside a critical section, which can trigger the assertion
added in 4a170ee9e.
Reported-By: Steve Singer
Define padding bytes in SharedInvalidationMessage structs to be
defined. Otherwise the sinvaladt.c ringbuffer, which is accessed by
multiple processes, will cause spurious valgrind warnings about
undefined memory being used. That's because valgrind remembers the
undefined bytes from the last local process's store, not realizing
that another process has written since, filling the previously
uninitialized bytes.
Commit af7914c662, which introduced the
EXPLAIN (TIMING) option, for some reason coded explain.c to look at
planstate->instrument->need_timer rather than es->timing to decide
whether to print timing info. However, the former flag might get set
as a result of contrib/auto_explain wanting timing information. We
certainly don't want activation of auto_explain to change user-visible
statement behavior, so fix that.
Also fix an independent bug introduced in the same patch: in the code
path for a never-executed node with a machine-friendly output format,
if timing was selected, it would fail to print the Actual Rows and Actual
Loops items.
Per bug #10404 from Tomonari Katsumata. Back-patch to 9.2 where the
faulty code was introduced.
I got the backup block numbers off-by-one in the commit that changed the
way incomplete-splits are handled. I blame the comments, which said
"backup block 1" and "backup block 2", even though the backup blocks
are numbered starting from 0, in the macros and functions used in replay.
Fix the comments and the code.
Per Jeff Janes' bug report about corruption caused by torn page writes.
The incorrect code is new in git master, but backpatch the comment change
down to 9.0, where the numbering in the redo-side macros was changed.
That's what I get for not fully retesting the final version of the patch.
The replace_allowed cross-check needs an additional special case for
bootstrapping.
RelationCacheInsert() ignored the possibility that hash_search(HASH_ENTER)
might find a hashtable entry already present for the same OID. However,
that can in fact occur during recursive relcache load scenarios. When it
did happen, we overwrote the pointer to the pre-existing Relation, causing
a session-lifespan leakage of that entire structure. As far as is known,
the pre-existing Relation would always have reference count zero by the
time we arrive back at the outer insertion, so add code that deletes the
pre-existing Relation if so. If by some chance its refcount is positive,
elog a WARNING and allow the pre-existing Relation to be leaked as before.
Also, AttrDefaultFetch() was sloppy about leaking the cstring form of the
pg_attrdef.adbin value it's copying into the relcache structure. This is
only a query-lifespan leakage, and normally not very significant, but it
adds up during CLOBBER_CACHE testing.
These bugs are of very ancient vintage, but I'll refrain from back-patching
since there's no evidence that these leaks amount to anything in ordinary
usage.
The fallback implementation involves acquiring and releasing a spinlock
variable that is otherwise unreferenced --- not even to the extent of
initializing it. This accidentally fails to fail on platforms where
spinlocks should be initialized to zeroes, but elsewhere it results in
a "stuck spinlock" failure during startup.
I griped about this last July, and put in a hack that worked for gcc
on HPPA, but didn't get around to fixing the general case. Per the
discussion back then, the best thing to do seems to be to initialize
dummy_spinlock in main.c.
The xl_heap_header_len structures in an XLOG_HEAP_UPDATE record aren't
necessarily aligned adequately. The regular replay function for these
records is aware of that, but decode.c didn't get the memo. I'm not
sure why the buildfarm failed to catch this; the test_decoding test
certainly blows up real good on my old HPPA box.
Also, I'm pretty sure that the address arithmetic was wrong for the
case of XLOG_HEAP_CONTAINS_OLD and not XLOG_HEAP_CONTAINS_NEW_TUPLE,
though this apparently can't happen when logical decoding is active.
transam/README explained how B-tree incomplete splits were tracked and
fixed after recovery, as an example of handling complex actions that need
multiple WAL records, but that's not how it works anymore. Explain the new
paradigm.
Several years ago we changed chr(int) so that if the database encoding is
UTF8, it would interpret its argument as a Unicode code point and expand it
into the appropriate multibyte sequence. However, we weren't sufficiently
careful about checking validity of the input. According to RFC3629, UTF8
disallows code points above U+10FFFF (note that the predecessor standard
RFC2279 was more liberal). Also, both versions of the UTF8 spec agree
that Unicode surrogate-pair codes should never appear in UTF8. Because
our encoding validity checks follow RFC3629, our failure to enforce these
restrictions in chr() means it could be used to produce text strings that
will be rejected when the database is dumped and reloaded. To ensure
consistency with the input functions, let's actually apply
pg_utf8_islegal() to the proposed output of chr().
Per discussion, this seems like too much of a behavioral change to
back-patch, but it's not too late to squeeze it into 9.4.
The decoding of prepared transaction commits accidentally used the XID of
the transaction performing the COMMIT PREPARED, not the XID of the prepared
transaction. Before bb38fb0d43 that lead to those transactions not being
decoded, afterwards to a assertion failure.
Commit dd428c79 added dbId and tsId to the xl_xact_commit struct but missed
that prepared transaction commits reuse that struct. Fix that.
Because those fields were left unitialized, replaying a commit prepared WAL
record in a hot standby node would fail to remove the relcache init file.
That can lead to "could not open file" errors on the standby. Relcache init
file only needs to be removed when a system table/index is rewritten in the
transaction using two phase commit, so that should be rare in practice. In
HEAD, the incorrect dbId/tsId values are also used for filtering in logical
replication code, causing the transaction to always be filtered out.
Analysis and fix by Andres Freund. Backpatch to 9.0 where hot standby was
introduced.
In yesterday's commit 2dc4f011fd, I tried
to force buffering of stdout/stderr in initdb to be what it is by
default when the program is run interactively on Unix (since that's how
most manual testing is done). This tripped over the fact that Windows
doesn't support _IOLBF mode. We dealt with that a long time ago in
syslogger.c by falling back to unbuffered mode on Windows. Export that
solution in port.h and use it in initdb.
Back-patch to 8.4, like the previous commit.
The proc array can contain duplicate XIDs, when a transaction is just being
prepared for two-phase commit. To cope, remove any duplicates in
txid_current_snapshot(). Also ignore duplicates in the input functions, so
that if e.g. you have an old pg_dump file that already contains duplicates,
it will be accepted.
Report and fix by Jan Wieck. Backpatch to all supported versions.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
rd_replidindex should be managed the same as rd_oidindex, and rd_keyattr
and rd_idattr should be managed like rd_indexattr. Omissions in this area
meant that the bitmapsets computed for rd_keyattr and rd_idattr would be
leaked during any relcache flush, resulting in a slow but permanent leak in
CacheMemoryContext. There was also a tiny probability of relcache entry
corruption if we ran out of memory at just the wrong point in
RelationGetIndexAttrBitmap. Otherwise, the fields were not zeroed where
expected, which would not bother the code any AFAICS but could greatly
confuse anyone examining the relcache entry while debugging.
Also, create an API function RelationGetReplicaIndex rather than letting
non-relcache code be intimate with the mechanisms underlying caching of
that value (we won't even mention the memory leak there).
Also, fix a relcache flush hazard identified by Andres Freund:
RelationGetIndexAttrBitmap must not assume that rd_replidindex stays valid
across index_open.
The aspects of this involving rd_keyattr date back to 9.3, so back-patch
those changes.
When cache invalidations arrive while ri_LoadConstraintInfo() is busy
filling a new cache entry, InvalidateConstraintCacheCallBack() compares
the - not yet initialized - oidHashValue field with the to-be-invalidated
hash value. To fix, check whether the entry is already marked as invalid.
Andres Freund
The original coding for ALTER SYSTEM made a fundamentally bogus assumption
that postgresql.auto.conf could be sought relative to the main config file
if we hadn't yet determined the value of data_directory. This fails for
common arrangements with the config file elsewhere, as reported by
Christoph Berg.
The simplest fix is to not try to read postgresql.auto.conf until after
SelectConfigFiles has chosen (and locked down) the data_directory setting.
Because of the logic in ProcessConfigFile for handling resetting of GUCs
that've been removed from the config file, we cannot easily read the main
and auto config files separately; so this patch adopts a brute force
approach of reading the main config file twice during postmaster startup.
That's a tad ugly, but the actual time cost is likely to be negligible,
and there's no time for a more invasive redesign before beta.
With this patch, any attempt to set data_directory via ALTER SYSTEM
will be silently ignored. It would probably be better to throw an
error, but that can be dealt with later. This bug, however, would
prevent any testing of ALTER SYSTEM by a significant fraction of the
userbase, so it seems important to get it fixed before beta.
There's no longer much pressure to switch the default GIN opclass for
jsonb, but there was still some unhappiness with the name "jsonb_hash_ops",
since hashing is no longer a distinguishing property of that opclass,
and anyway it seems like a relatively minor detail. At the suggestion of
Heikki Linnakangas, we'll use "jsonb_path_ops" instead; that captures the
important characteristic that each index entry depends on the entire path
from the document root to the indexed value.
Also add a user-facing explanation of the implementation properties of
these two opclasses.
Per discussion, this seems like a more consistent choice of name.
Fabrízio de Royes Mello, after a suggestion by Peter Eisentraut;
some additional documentation wordsmithing by me
When returning rows from a bitmap, as done with partial match queries, we
would get stuck in an infinite loop if the bitmap contained a lossy page
reference.
This bug is new in master, it was introduced by the patch to allow skipping
items refuted by other entries in GIN scans.
Report and fix by Alexander Korotkov
reserveFromBuffer() failed to consider the possibility that it needs to
more-than-double the current buffer size. Beyond that, it seems likely
that we'd someday need to worry about integer overflow of the buffer
length variable. Rather than reinvent the logic that's already been
debugged in stringinfo.c, let's go back to using that logic. We can
still have the same targeted API, but we'll rely on stringinfo.c to
manage reallocation.
Per report from Alexander Korotkov.
These functions were relying on typcategory to identify arrays and
composites, which is not reliable and not the normal way to do it.
Using typcategory to identify boolean, numeric types, and json itself is
also pretty questionable, though the code in those cases didn't seem to be
at risk of anything worse than wrong output. Instead, use the standard
lsyscache functions to identify arrays and composites, and rely on a direct
check of the type OID for the other cases.
In HEAD, also be sure to look through domains so that a domain is treated
the same as its base type for conversions to JSON. However, this is a
small behavioral change; given the lack of field complaints, we won't
back-patch it.
In passing, refactor so that there's only one copy of the code that decides
which conversion strategy to apply, not multiple copies that could (and
have) gotten out of sync.
Post-commit review identified a number of places where addition was
used instead of multiplication or memory wasn't zeroed where it should
have been. This commit also fixes one case where a structure member
was mis-initialized, and moves another memory allocation closer to
the place where the allocated storage is used for clarity.
Andres Freund
This code really needs to be refactored so that there aren't so many copies
that can diverge. Not to mention that this whole approach is probably
wrong. But for the moment I'll just stick my finger in the dike.
Per report from Michael Paquier.
Fix JSONB_MAX_ELEMS and JSONB_MAX_PAIRS macros to use CB_MASK in the
calculation. JENTRY_POSMASK happens to have the same value at the moment,
but that's just coincidental.
Refactor jsonb iterator functions, for readability.
Get rid of the JENTRY_ISFIRST flag. Whenever we handle JEntrys, we have
access to the whole array and have enough context information to know
which entry is the first. This frees up one bit in the JEntry header for
future use. While we're at it, shuffle the JEntry bits so that boolean
true and false go together, for aesthetic reasons.
Bump catalog version as this changes the on-disk format slightly.
Change the key representation so that values that would exceed 127 bytes
are hashed into short strings, and so that the original JSON datatype of
each value is recorded in the index. The hashing rule eliminates the major
objection to having this opclass be the default for jsonb, namely that it
could fail for plausible input data (due to GIN's restrictions on maximum
key length). Preserving datatype information doesn't really buy us much
right now, but it requires no extra space compared to the previous way,
and it might be useful later.
Also, change the consistency-checking functions to request recheck for
exists (jsonb ? text) and related operators. The original analysis that
this is an exactly checkable query was incorrect, since the index does
not preserve information about whether a key appears at top level in
the indexed JSON object. Add a test case demonstrating the problem.
Make some other, mostly cosmetic improvements to the code in jsonb_gin.c
as well.
catversion bump due to on-disk data format change in jsonb_ops indexes.
Move the functions around to group related functions together. Remove
binequal argument from lengthCompareJsonbStringValue, moving that
responsibility to lengthCompareJsonbPair. Fix typo in comment.
Per discussion, the old value of 128MB is ridiculously small on modern
machines; in fact, it's not even any larger than the default value of
shared_buffers, which it certainly should be. Increase to 4GB, which
is unlikely to be any worse than the old default for anyone, and should
be noticeably better for most. Eventually we might have an autotuning
scheme for this setting, but the recent attempt crashed and burned,
so for now just do this.
This reverts commit ee1e5662d8, as well as
a remarkably large number of followup commits, which were mostly concerned
with the fact that the implementation didn't work terribly well. It still
doesn't: we probably need some rather basic work in the GUC infrastructure
if we want to fully support GUCs whose default varies depending on the
value of another GUC. Meanwhile, it also emerged that there wasn't really
consensus in favor of the definition the patch tried to implement (ie,
effective_cache_size should default to 4 times shared_buffers). So whack
it all back to where it was. In a followup commit, I'll do what was
recently agreed to, which is to simply change the default to a higher
value.
To-be-deleted list pages contain no useful information, as they are being
deleted, but we must still protect the writes from being torn by a crash
after a partial write. To do that, re-initialize the pages on WAL replay.
Jeff Janes caught this with a test program to test partial writes.
Backpatch to all supported versions.
Just as we would start bgworkers immediately after an initial startup
of the server, we should restart them immediately when reinitializing.
Petr Jelinek and Robert Haas
The main target of this cleanup is the convertJsonb() function, but I also
touched a lot of other things that I spotted into in the process.
The new convertToJsonb() function uses an output buffer that's resized on
demand, so the code to estimate of the size of JsonbValue is removed.
The on-disk format was not changed, even though I refactored the structs
used to handle it. The term "superheader" is replaced with "container".
The jsonb_exists_any and jsonb_exists_all functions no longer sort the input
array. That was a premature optimization, the idea being that if there are
duplicates in the input array, you only need to check them once. Also,
sorting the array saves some effort in the binary search used to find a key
within an object. But there were drawbacks too: the sorting and
deduplicating obviously isn't free, and in the typical case there are no
duplicates to remove, and the gain in the binary search was minimal. Remove
all that, which makes the code simpler too.
This includes a bug-fix; the total length of the elements in a jsonb array
or object mustn't exceed 2^28. That is now checked.
Since the postmaster won't perform a crash-and-restart sequence
for background workers which don't request shared memory access,
we'd better make sure that they can't corrupt shared memory.
Patch by me, review by Tom Lane.
ActiveSnapshot needs to be set when we call ExecutorRewind because some
plan node types may execute user-defined functions during their ReScan
calls (nodeLimit.c does so, at least). The wisdom of that is somewhat
debatable, perhaps, but for now the simplest fix is to make sure the
required context is valid. Failure to do this typically led to a
null-pointer-dereference core dump, though it's possible that in more
complex cases a function could be executed with the wrong snapshot
leading to very subtle misbehavior.
Per report from Leif Jensen. It's been broken for a long time, so
back-patch to all active branches.
The motivation for a crash and restart cycle when a backend dies is
that it might have corrupted shared memory on the way down; and we
can't recover reliably except by reinitializing everything. But that
doesn't apply to processes that don't touch shared memory. Currently,
there's nothing to prevent a background worker that doesn't request
shared memory access from touching shared memory anyway, but that's a
separate bug.
Previous to this commit, the coding in postmaster.c was inconsistent:
an exit status other than 0 or 1 didn't provoke a crash-and-restart,
but failure to release the postmaster child slot did. This change
makes those cases consistent.
The coding in JsonbHashScalarValue might have accidentally failed to fail
given current representational choices, but the key word there would be
"accidental". Insert the appropriate datatype conversion macro. And
use the right conversion macro for hash_numeric's result, too.
In passing make the code a bit cleaner and less repetitive by factoring
out the xor step from the switch.
Index-only scans avoid taking a lock on the VM buffer, which would
cause a lot of contention. To be correct, that requires some intricate
assumptions that weren't completely documented in the previous
comment.
Reviewed by Robert Haas.
The previous coding would potentially cause attaching to segment A to
fail if segment B was at the same time in the process of going away.
Andres Freund, with a comment tweak by me
Commit fad153ec45 modified sinval.c to reduce
the number of calls into sinvaladt.c (which require taking a shared lock)
by keeping a local buffer of collected-but-not-yet-processed messages.
However, if processing of the last message in a batch resulted in a
recursive call to ReceiveSharedInvalidMessages, we could overwrite that
message with a new one while the outer invalidation function was still
working on it. This would be likely to lead to invalidation of the wrong
cache entry, allowing subsequent processing to use stale cache data.
The fix is just to make a local copy of each message while we're processing
it.
Spotted by Andres Freund. Back-patch to 8.4 where the bug was introduced.
If they were not 'oldtup.t_data' would be dereferenced while set to NULL
in case of a full page image for block 0.
Do so primarily to silence coverity; but also to make sure this prerequisite
isn't changed without adapting the replay routine as that would appear to
work in many cases.
Andres Freund
ruleutils.c tries to cope with additions/deletions/renamings of columns in
tables referenced by views, by means of adding machine-generated aliases to
the printed form of a view when needed to preserve the original semantics.
A recent blog post by Marko Tiikkaja pointed out a case I'd missed though:
if one input of a join with USING is itself a join, there is nothing to
stop the user from adding a column of the same name as the USING column to
whichever side of the sub-join didn't provide the USING column. And then
there'll be an error when the view is re-parsed, since now the sub-join
exposes two columns matching the USING specification. We were catching a
lot of related cases, but not this one, so add some logic to cope with it.
Back-patch to 9.3, which is the first release that makes any serious
attempt to cope with such cases (cf commit 2ffa740be and follow-ons).
If we have an array of records stored on disk, the individual record fields
cannot contain out-of-line TOAST pointers: the tuptoaster.c mechanisms are
only prepared to deal with TOAST pointers appearing in top-level fields of
a stored row. The same applies for ranges over composite types, nested
composites, etc. However, the existing code only took care of expanding
sub-field TOAST pointers for the case of nested composites, not for other
structured types containing composites. For example, given a command such
as
UPDATE tab SET arraycol = ARRAY[(ROW(x,42)::mycompositetype] ...
where x is a direct reference to a field of an on-disk tuple, if that field
is long enough to be toasted out-of-line then the TOAST pointer would be
inserted as-is into the array column. If the source record for x is later
deleted, the array field value would become a dangling pointer, leading
to errors along the line of "missing chunk number 0 for toast value ..."
when the value is referenced. A reproducible test case for this was
provided by Jan Pecek, but it seems likely that some of the "missing chunk
number" reports we've heard in the past were caused by similar issues.
Code-wise, the problem is that PG_DETOAST_DATUM() is not adequate to
produce a self-contained Datum value if the Datum is of composite type.
Seen in this light, the problem is not just confined to arrays and ranges,
but could also affect some other places where detoasting is done in that
way, for example form_index_tuple().
I tried teaching the array code to apply toast_flatten_tuple_attribute()
along with PG_DETOAST_DATUM() when the array element type is composite,
but this was messy and imposed extra cache lookup costs whether or not any
TOAST pointers were present, indeed sometimes when the array element type
isn't even composite (since sometimes it takes a typcache lookup to find
that out). The idea of extending that approach to all the places that
currently use PG_DETOAST_DATUM() wasn't attractive at all.
This patch instead solves the problem by decreeing that composite Datum
values must not contain any out-of-line TOAST pointers in the first place;
that is, we expand out-of-line fields at the point of constructing a
composite Datum, not at the point where we're about to insert it into a
larger tuple. This rule is applied only to true composite Datums, not
to tuples that are being passed around the system as tuples, so it's not
as invasive as it might sound at first. With this approach, the amount
of code that has to be touched for a full solution is greatly reduced,
and added cache lookup costs are avoided except when there actually is
a TOAST pointer that needs to be inlined.
The main drawback of this approach is that we might sometimes dereference
a TOAST pointer that will never actually be used by the query, imposing a
rather large cost that wasn't there before. On the other side of the coin,
if the field value is used multiple times then we'll come out ahead by
avoiding repeat detoastings. Experimentation suggests that common SQL
coding patterns are unaffected either way, though. Applications that are
very negatively affected could be advised to modify their code to not fetch
columns they won't be using.
In future, we might consider reverting this solution in favor of detoasting
only at the point where data is about to be stored to disk, using some
method that can drill down into multiple levels of nested structured types.
That will require defining new APIs for structured types, though, so it
doesn't seem feasible as a back-patchable fix.
Note that this patch changes HeapTupleGetDatum() from a macro to a function
call; this means that any third-party code using that macro will not get
protection against creating TOAST-pointer-containing Datums until it's
recompiled. The same applies to any uses of PG_RETURN_HEAPTUPLEHEADER().
It seems likely that this is not a big problem in practice: most of the
tuple-returning functions in core and contrib produce outputs that could
not possibly be toasted anyway, and the same probably holds for third-party
extensions.
This bug has existed since TOAST was invented, so back-patch to all
supported branches.
Be more clear about failure cases in relfilenode->relation lookup,
and fix some other places that were inconsistent or not per our
message style guidelines.
Andres Freund and Tom Lane
This was intended to work always, but the previous code only allowed
it if at least one message was successfully read by the receiver
before the sender detached the queue.
Report by Petr Jelinek. Patch by me.
Commit a730183926 created rather a mess by
putting dependencies on backend-only include files into include/common.
We really shouldn't do that. To clean it up:
* Move TABLESPACE_VERSION_DIRECTORY back to its longtime home in
catalog/catalog.h. We won't consider this symbol part of the FE/BE API.
* Push enum ForkNumber from relfilenode.h into relpath.h. We'll consider
relpath.h as the source of truth for fork numbers, since relpath.c was
already partially serving that function, and anyway relfilenode.h was
kind of a random place for that enum.
* So, relfilenode.h now includes relpath.h rather than vice-versa. This
direction of dependency is fine. (That allows most, but not quite all,
of the existing explicit #includes of relpath.h to go away again.)
* Push forkname_to_number from catalog.c to relpath.c, just to centralize
fork number stuff a bit better.
* Push GetDatabasePath from catalog.c to relpath.c; it was rather odd
that the previous commit didn't keep this together with relpath().
* To avoid needing relfilenode.h in common/, redefine the underlying
function (now called GetRelationPath) as taking separate OID arguments,
and make the APIs using RelFileNode or RelFileNodeBackend into macro
wrappers. (The macros have a potential multiple-eval risk, but none of
the existing call sites have an issue with that; one of them had such a
risk already anyway.)
* Fix failure to follow the directions when "init" fork type was added;
specifically, the errhint in forkname_to_number wasn't updated, and neither
was the SGML documentation for pg_relation_size().
* Fix tablespace-path-too-long check in CreateTableSpace() to account for
fork-name component of maximum-length pathnames. This requires putting
FORKNAMECHARS into a header file, but it was rather useless (and
actually unreferenced) where it was.
The last couple of items are potentially back-patchable bug fixes,
if anyone is sufficiently excited about them; but personally I'm not.
Per a gripe from Christoph Berg about how include/common wasn't
self-contained.
Since ruleutils.c recurses, it could be driven to stack overflow by
deeply nested constructs. Very large queries might also take long
enough to deparse that a check for interrupts seems like a good idea.
Stick appropriate tests into a couple of key places.
Noted by Greg Stark. Back-patch to all supported branches.
A query such as "SELECT x UNION SELECT y UNION SELECT z UNION ..."
produces a left-deep nested parse tree, which we formerly showed in its
full nested glory and with all the possible parentheses. This does little
for readability, though, and long UNION lists resulting in excessive
indentation are common. Instead, let's omit parentheses and indent all
the subqueries at the same level in such cases.
This patch skips indentation/parenthesization whenever the lefthand input
of a SetOperationStmt is another SetOperationStmt of the same kind and
ALL/DISTINCT property. We could teach the code the exact syntactic
precedence of set operations and thereby avoid parenthesization in some
more cases, but it's not clear that that'd be a readability win: it seems
better to parenthesize if the set operation changes. (As an example,
if there's one UNION in a long list of UNION ALL, it now stands out like
a sore thumb, which seems like a good thing.)
Back-patch to 9.3. This completes our response to a complaint from Greg
Stark that since commit 62e666400d there's a performance problem in pg_dump
for views containing long UNION sequences (or other types of deeply nested
constructs). The previous commit 0601cb54da
handles the general problem, but this one makes the specific case of UNION
lists look a lot nicer.
Continuing to indent no matter how deeply nested we get doesn't really
do anything for readability; what's worse, it results in O(N^2) total
whitespace, which can become a performance and memory-consumption issue.
To address this, once we get past 40 characters of indentation, reduce
the indentation step distance 4x, and also limit the maximum indentation
by reducing it modulo 40. This latter choice is a bit weird at first
glance, but it seems to preserve readability better than a simple cap
would do.
Back-patch to 9.3, because since commit 62e666400d the performance issue
is a hazard for pg_dump.
Greg Stark and Tom Lane
The code attempted to outdent JOIN clauses further left than the parent
FROM keyword, which was odd in any case, and led to inconsistent formatting
since in simple cases the clauses couldn't be moved any further left than
that. And it left a permanent decrement of the indentation level, causing
subsequent lines to be much further left than they should be (again, this
couldn't be seen in simple cases for lack of indentation to give up).
After a little experimentation I chose to make it indent JOIN keywords
two spaces from the parent FROM, which is one space more than the join's
lefthand input in cases where that appears on a different line from FROM.
Back-patch to 9.3. This is a purely cosmetic change, and the bug is quite
old, so that may seem arbitrary; but we are going to be making some other
changes to the indentation behavior in both HEAD and 9.3, so it seems
reasonable to include this in 9.3 too. I committed this one first because
its effects are more visible in the regression test results as they
currently stand than they will be later.
In general we can't discard constant-NULL inputs, since they could change
the result of the AND/OR to be NULL. But at top level of WHERE, we do not
need to distinguish a NULL result from a FALSE result, so it's okay to
treat NULL as FALSE and then simplify AND/OR accordingly.
This is a very ancient oversight, but in 9.2 and later it can lead to
failure to optimize queries that previous releases did optimize, as a
result of more aggressive parameter substitution rules making it possible
to reduce more subexpressions to NULL constants. This is the root cause of
bug #10171 from Arnold Scheffler. We could alternatively have fixed that
by teaching orclauses.c to ignore constant-NULL OR arms, but it seems
better to get rid of them globally.
I resisted the temptation to back-patch this change into all active
branches, but it seems appropriate to back-patch as far as 9.2 so that
there will not be performance regressions of the kind shown in this bug.
In writeListPage, never take a full-page image of the page, because we
have all the information required to re-initialize in the WAL record
anyway. Before this fix, a full-page image was always generated, unless
full_page_writes=off, because when the page is initialized its LSN is
always 0. In stable-branches, keep the code to restore the backup blocks
if they exist, in case that the WAL is generated with an older minor
version, but in master Assert that there are no full-page images.
In the redo routine, add missing "off++". Otherwise the tuples are added
to the page in reverse order. That happens to be harmless because we
always scan and remove all the tuples together, but it was clearly wrong.
Also, it was masked by the first bug unless full_page_writes=off, because
the page was always restored from a full-page image.
Backpatch to all supported versions.
As noted some time ago, the original coding had a typo ("|" for "^")
that made the result less unique than intended. Even the intended
behavior is obsolete since it was based on wanting to produce a
usable value even if we didn't have int64 arithmetic --- a limitation
we stopped supporting years ago. Instead, let's redefine the system
identifier as tv_sec in the upper 32 bits (same as before), tv_usec
in the next 20 bits, and the low 12 bits of getpid() in the remaining
bits. This is still hardly guaranteed-universally-unique, but it's
noticeably better than before. Per my proposal at
<29019.1374535940@sss.pgh.pa.us>
We should use exprTypmod() to extract the typmod of the expression,
instead of just blindly storing -1. This seems to have been an aboriginal
oversight in commit fc8d970cbc which
introduced general-expression indexes. The consequences are only cosmetic
at present, since the index machinery doesn't really look at typmod for
index columns; but still it seems best to describe the column type as
precisely as we can. Per off-list complaint from Thomas Fanghaenel.
Original coding failed to enlarge the array as required if
the requested tranche_id was equal to LWLockTranchesAllocated.
In passing, fix poor style of not casting the result of (re)palloc.
If a tuple is locked, and this lock is later upgraded either to an
update or to a stronger lock, and in the meantime some other process
tries to lock, update or delete the same tuple, it (the tuple) could end
up being updated twice, or having conflicting locks held.
The reason for this is that the second updater checks for a change in
Xmax value, or in the HEAP_XMAX_IS_MULTI infomask bit, after noticing
the first lock; and if there's a change, it restarts and re-evaluates
its ability to update the tuple. But it neglected to check for changes
in lock strength or in lock-vs-update status when those two properties
stayed the same. This would lead it to take the wrong decision and
continue with its own update, when in reality it shouldn't do so but
instead restart from the top.
This could lead to either an assertion failure much later (when a
multixact containing multiple updates is detected), or duplicate copies
of tuples.
To fix, make sure to compare the other relevant infomask bits alongside
the Xmax value and HEAP_XMAX_IS_MULTI bit, and restart from the top if
necessary.
Also, in the belt-and-suspenders spirit, add a check to
MultiXactCreateFromMembers that a multixact being created does not have
two or more members that are claimed to be updates. This should protect
against other bugs that might cause similar bogus situations.
Backpatch to 9.3, where the possibility of multixacts containing updates
was introduced. (In prior versions it was possible to have the tuple
lock upgraded from shared to exclusive, and an update would not restart
from the top; yet we're protected against a bug there because there's
always a sleep to wait for the locking transaction to complete before
continuing to do anything. Really, the fact that tuple locks always
conflicted with concurrent updates is what protected against bugs here.)
Per report from Andrew Dunstan and Josh Berkus in thread at
http://www.postgresql.org/message-id/534C8B33.9050807@pgexperts.com
Bug analysis by Andres Freund.
Once we've completed a PREPARE, our session is not running a transaction,
so its entry in pg_stat_activity should show xact_start as null, rather
than leaving the value as the start time of the now-prepared transaction.
I think possibly this oversight was triggered by faulty extrapolation
from the adjacent comment that says PrepareTransaction should not call
AtEOXact_PgStat, so tweak the wording of that comment.
Noted by Andres Freund while considering bug #10123 from Maxim Boguk,
although this error doesn't seem to explain that report.
Back-patch to all active branches.
Before 9.4, such an aggregate couldn't be declared, because its final
function would have to have polymorphic result type but no polymorphic
argument, which CREATE FUNCTION would quite properly reject. The
ordered-set-aggregate patch found a workaround: allow the final function
to be declared as accepting additional dummy arguments that have types
matching the aggregate's regular input arguments. However, we failed
to notice that this problem applies just as much to regular aggregates,
despite the fact that we had a built-in regular aggregate array_agg()
that was known to be undeclarable in SQL because its final function
had an illegal signature. So what we should have done, and what this
patch does, is to decouple the extra-dummy-arguments behavior from
ordered-set aggregates and make it generally available for all aggregate
declarations. We have to put this into 9.4 rather than waiting till
later because it slightly alters the rules for declaring ordered-set
aggregates.
The patch turned out a bit bigger than I'd hoped because it proved
necessary to record the extra-arguments option in a new pg_aggregate
column. I'd thought we could just look at the final function's pronargs
at runtime, but that didn't work well for variadic final functions.
It's probably just as well though, because it simplifies life for pg_dump
to record the option explicitly.
While at it, fix array_agg() to have a valid final-function signature,
and add an opr_sanity test to notice future deviations from polymorphic
consistency. I also marked the percentile_cont() aggregates as not
needing extra arguments, since they don't.
When marking a branch as half-dead, a pointer to the top of the branch is
stored in the leaf block's hi-key. During normal operation, the high key
was left in place, and the block number was just stored in the ctid field
of the high key tuple, but in WAL replay, the high key was recreated as a
truncated tuple with zero columns. For the sake of easier debugging, also
truncate the tuple in normal operation, so that the page is identical
after WAL replay. Also, rename the 'downlink' field in the WAL record to
'topparent', as that seems like a more descriptive name. And make sure
it's set to invalid when unlinking the leaf page.
Some ancient comments claimed that fn_nargs could be -1 to indicate a
variable number of input arguments; but this was never implemented, and
is at variance with what we ultimately did with "variadic" functions.
Update the comments.
Forgot to update LSN of left sibling's page, when creating a new root.
I fixed this for regular insertions and page splits earlier, but missed
new root creation.
The README incorrectly claimed that GIN posting tree pages contain an array
of uncompressed items in addition to compressed posting lists. Earlier
versions of the GIN posting list compression patch worked that way, but not
the one that was committed.
Don't use simple_heap_insert to insert the tuple to a sequence relation.
simple_heap_insert creates a heap insertion WAL record, and replaying that
will create a regular heap page without the special area containing the
sequence magic constant, which is wrong for a sequence. That was not a bug
because we always created a sequence WAL record after that, and replaying
that overwrote the bogus heap page, and the transient state could never be
seen by another backend because it was only done when creating a new
sequence relation. But it's simpler and cleaner to avoid that in the first
place.
If we set the all-visible flag after writing WAL record, and XLogInsert
takes a full-page image of the page, the image would not include the flag.
We will then proceed to set the VM bit, which would then be set without the
corresponding all-visible flag on the heap page.
Found by comparing page images on master and standby, after writing/replaying
each WAL record. (There is still a discrepancy: the all-visible flag won't
be set after replaying the HEAP_CLEAN record, even though it is set in the
master. However, it will be set when replaying the HEAP2_VISIBLE record and
setting the VM bit, so the all-visible flag and VM bit are always consistent
on the standby, even though they are momentarily out-of-sync with master)
Backpatch to 9.3 where this code was introduced.
Now that EXPLAIN also outputs a "planning time" measurement, the use of
"total" here seems rather confusing: it sounds like it might include the
planning time which of course it doesn't. Majority opinion was that
"execution time" is a better label, so we'll call it that.
This should be noted as a backwards incompatibility for tools that examine
EXPLAIN ANALYZE output.
In passing, I failed to resist the temptation to do a little editing on the
materialized-view example affected by this change.
According to the Single Unix Spec and assorted man pages, you're supposed
to use the constants named AF_xxx when setting ai_family for a getaddrinfo
call. In a few places we were using PF_xxx instead. Use of PF_xxx
appears to be an ancient BSD convention that was not adopted by later
standardization. On BSD and most later Unixen, it doesn't matter much
because those constants have equivalent values anyway; but nonetheless
this code is not per spec.
In the same vein, replace PF_INET by AF_INET in one socket() call, which
wasn't even consistent with the other socket() call in the same function
let alone the remainder of our code.
Per investigation of a Cygwin trouble report from Marco Atzeri. It's
probably a long shot that this will fix his issue, but it's wrong in
any case.
These are natural complements to the functions added by commit
0886fc6a5c, but they weren't included
in the original patch for some reason. Add them.
Patch by me, per a complaint by Tom Lane. Review by Tatsuo
Ishii.
Apparently, Windows can sometimes return an error code even when the
operation actually worked just fine. Rearrange the order of checks
according to what appear to be the best practices in this area.
Amit Kapila
Previously, in some places, socket creation errors were checked for
negative values, which is not true for Windows because sockets are
unsigned. This masked socket creation errors on Windows.
Backpatch through 9.0. 8.4 doesn't have the infrastructure to fix this.
I mixed up BLCKSZ and XLOG_BLCKSZ when I changed the way the buffer is
allocated a couple of weeks ago. With the default settings, they are both
8k, but they can be changed at compile-time.
This allows squeezing out the unused space in full-page writes. And more
importantly, it can be a useful debugging aid.
In hindsight we should've done this back when GIN was added - we wouldn't
need the 'maxoff' field in the page opaque struct if we had used pd_lower
and pd_upper like on normal pages. But as long as there can be pages in the
index that have been binary-upgraded from pre-9.4 versions, we can't rely
on that, and have to continue using 'maxoff'.
Most of the code churn comes from renaming some macros, now that they're
used on internal pages, too.
This change is completely backwards-compatible, no effect on pg_upgrade.
Make sure we throw an error instead of silently doing the wrong thing when
fed a strategy number we don't recognize. Also, in the places that did
already throw an error, spell the error message in a way more consistent
with our message style guidelines.
Per report from Paul Jones. Although this is a bug, it won't occur unless
a superuser tries to do something he shouldn't, so it doesn't seem worth
back-patching.
The entry B-tree pages all follow the standard page layout. The 9.3 code has
this right. I inadvertently changed this at some point during the big
refactorings in git master.
Repositioning the tuplestore seek pointer in window_gettupleslot() turns
out to be a very significant expense when the window frame is sizable and
the frame end can move. To fix, introduce a tuplestore function for
skipping an arbitrary number of tuples in one call, parallel to the one we
introduced for tuplesort objects in commit 8d65da1f. This reduces the cost
of window_gettupleslot() to O(1) if the tuplestore has not spilled to disk.
As in the previous commit, I didn't try to do any real optimization of
tuplestore_skiptuples for the case where the tuplestore has spilled to
disk. There is probably no practical way to get the cost to less than O(N)
anyway, but perhaps someone can think of something later.
Also fix PersistHoldablePortal() to make use of this API now that we have
it.
Based on a suggestion by Dean Rasheed, though this turns out not to look
much like his patch.
Given that ALTER TABLESPACE has moved on from just existing for
general purpose rename/owner changes, it deserves its own top-level
production in the grammar. This also cleans up the RenameStmt to
only ever be used for actual RENAMEs again- it really wasn't
appropriate to hide non-RENAME productions under there.
Noted by Alvaro.
Views which are marked as security_barrier must have their quals
applied before any user-defined quals are called, to prevent
user-defined functions from being able to see rows which the
security barrier view is intended to prevent them from seeing.
Remove the restriction on security barrier views being automatically
updatable by adding a new securityQuals list to the RTE structure
which keeps track of the quals from security barrier views at each
level, independently of the user-supplied quals. When RTEs are
later discovered which have securityQuals populated, they are turned
into subquery RTEs which are marked as security_barrier to prevent
any user-supplied quals being pushed down (modulo LEAKPROOF quals).
Dean Rasheed, reviewed by Craig Ringer, Simon Riggs, KaiGai Kohei
First installment of the promised moving-aggregate support in built-in
aggregates: count(), sum(), avg(), stddev() and variance() for
assorted datatypes, though not for float4/float8.
In passing, remove a 2001-vintage kluge in interval_accum(): interval
array elements have been properly aligned since around 2003, but
nobody remembered to take out this workaround. Also, fix a thinko
in the opr_sanity tests for moving-aggregate catalog entries.
David Rowley and Florian Pflug, reviewed by Dean Rasheed
Until now, when executing an aggregate function as a window function
within a window with moving frame start (that is, any frame start mode
except UNBOUNDED PRECEDING), we had to recalculate the aggregate from
scratch each time the frame head moved. This patch allows an aggregate
definition to include an alternate "moving aggregate" implementation
that includes an inverse transition function for removing rows from
the aggregate's running state. As long as this can be done successfully,
runtime is proportional to the total number of input rows, rather than
to the number of input rows times the average frame length.
This commit includes the core infrastructure, documentation, and regression
tests using user-defined aggregates. Follow-on commits will update some
of the built-in aggregates to use this feature.
David Rowley and Florian Pflug, reviewed by Dean Rasheed; additional
hacking by me
There were a couple of bugs here. First, if the fuzzy limit was exceeded,
the loop in entryGetItem might drop out too soon if a whole block needs to
be skipped because it's < advancePast ("continue" in a while-loop checks the
loop condition too). Secondly, the loop checked when stepping to a new page
that there is at least one offset on the page < advancePast, but we cannot
rely on that on subsequent calls of entryGetItem, because advancePast might
change in between. That caused the skipping loop to read bogus items in the
TbmIterateResult's offset array.
First item and fix by Alexander Korotkov, second bug pointed out by Fabrízio
de Royes Mello, by a small variation of Alexander's test query.
This operator class can accelerate subnet/supernet tests as well as
btree-equivalent ordered comparisons. It also handles a new network
operator inet && inet (overlaps, a/k/a "is supernet or subnet of"),
which is expected to be useful in exclusion constraints.
Ideally this opclass would be the default for GiST with inet/cidr data,
but we can't mark it that way until we figure out how to do a more or
less graceful transition from the current situation, in which the
really-completely-bogus inet/cidr opclasses in contrib/btree_gist are
marked as default. Having the opclass in core and not default is better
than not having it at all, though.
While at it, add new documentation sections to allow us to officially
document GiST/GIN/SP-GiST opclasses, something there was never a clear
place to do before. I filled these in with some simple tables listing
the existing opclasses and the operators they support, but there's
certainly scope to put more information there.
Emre Hasegeli, reviewed by Andreas Karlsson, further hacking by me
Instead of storing the ID of the dynamic shared memory control
segment in a file within the data directory, store it in the main
control segment. This avoids a number of nasty corner cases,
most seriously that doing an online backup and then using it on
the same machine (e.g. to fire up a standby) would result in the
standby clobbering all of the master's dynamic shared memory
segments.
Per complaints from Heikki Linnakangas, Fujii Masao, and Tom
Lane.
These functions won't throw an error if the object doesn't exist,
or if (for functions and operators) there's more than one matching
object.
Yugo Nagata and Nozomi Anzai, reviewed by Amit Khandekar, Marti
Raudsepp, Amit Kapila, and me.
Don't reset the rightlink of a page when replaying a page update record.
This was a leftover from pre-hot standby days, when it was not possible to
have scans concurrent with WAL replay. Resetting the right-link was not
necessary back then either, but it was done for the sake of tidiness. But
with hot standby, it's wrong, because a concurrent scan might still need it.
Backpatch all versions with hot standby, 9.0 and above.
The one existing assertion of this type has tripped a few times in the
buildfarm lately, but it's not clear whether the problem is really
originating there or whether it's leftovers from a trip through one
of the other two paths that lack a matching assertion. So add one.
Since the same bug(s) most likely exist(s) in the back-branches also,
back-patch to 9.2, where the fast-path lock mechanism was added.
This doesn't seem to be useful any more, and it's not really worth the
effort to keep updating it every time relevant dependencies or calling
signatures in the shared memory or semaphore code change.
Forgot to set the incomplete-split flag on the left page half, in redo of a
page split.
Spotted this by comparing the page contents on master and standby, after
inserting/applying each WAL record.
VALIDATE CONSTRAINT
CLUSTER ON
SET WITHOUT CLUSTER
ALTER COLUMN SET STATISTICS
ALTER COLUMN SET ()
ALTER COLUMN RESET ()
All other sub-commands use AccessExclusiveLock
Simon Riggs and Noah Misch
Reviews by Robert Haas and Andres Freund
Formerly, we set up the postmaster's signal handling only when we were
about to start launching subprocesses. This is a bad idea though, as
it means that for example a SIGINT arriving before that will kill the
postmaster instantly, perhaps leaving lockfiles, socket files, shared
memory, etc laying about. We'd rather that such a signal caused orderly
postmaster termination including releasing of those resources. A simple
fix is to move the PostmasterMain stanza that initializes signal handling
to an earlier point, before we've created any such resources. Then, an
early-arriving signal will be blocked until we're ready to deal with it
in the usual way. (The only part that really needs to be moved up is
blocking of signals, but it seems best to keep the signal handler
installation calls together with that; for one thing this ensures the
kernel won't drop any signals we wished to get. The handlers won't get
invoked in any case until we unblock signals in ServerLoop.)
Per a report from MauMau. He proposed changing the way "pg_ctl stop"
works to deal with this, but that'd just be masking one symptom not
fixing the core issue.
It's been like this since forever, so back-patch to all supported branches.
Also add a regression test for a GIN index with enough items with the same
key, so that a GIN posting tree gets created. Apparently none of the
existing GIN tests were large enough for that.
This code is new, no backpatching required.
EXEC_BACKEND builds (i.e., Windows) failed to absorb values of PGC_BACKEND
parameters if they'd been changed post-startup via the config file. This
for example prevented log_connections from working if it were turned on
post-startup. The mechanism for handling this case has always been a bit
of a kluge, and it wasn't revisited when we implemented EXEC_BACKEND.
While in a normal forking environment new backends will inherit the
postmaster's value of such settings, EXEC_BACKEND backends have to read
the settings from the CONFIG_EXEC_PARAMS file, and they were mistakenly
rejecting them. So this case has always been broken in the Windows port;
so back-patch to all supported branches.
Amit Kapila
The code segment that removes the old symlink (if present) wasn't clued
into the fact that on Windows, symlinks are junction points which have
to be removed with rmdir().
Backpatch to 9.0, where the failing code was introduced.
MauMau, reviewed by Muhammad Asif Naeem and Amit Kapila
There's no really compelling reason to refuse to do these read-only,
non-server-starting options as root, and there's at least one good
reason to allow -C: pg_ctl uses -C to find out the true data directory
location when pointed at a config-only directory. On Windows, this is
done before dropping administrator privileges, which means that pg_ctl
fails for administrators if and only if a config-only layout is used.
Since the root-privilege check is done so early in startup, it's a bit
awkward to check for these switches. Make the somewhat arbitrary
decision that we'll only skip the root check if -C is the first switch.
This is not just to make the code a bit simpler: it also guarantees that
we can't misinterpret a --boot mode switch. (While AuxiliaryProcessMain
doesn't currently recognize any such switch, it might have one in the
future.) This is no particular problem for pg_ctl, and since the whole
behavior is undocumented anyhow, it's not a documentation issue either.
(--describe-config only works as the first switch anyway, so this is
no restriction for that case either.)
Back-patch to 9.2 where pg_ctl first began to use -C.
MauMau, heavily edited by me
This is needed because Windows services may get started with a different
current directory than where pg_ctl is executed. We want relative -D
paths to be interpreted relative to pg_ctl's CWD, similarly to what
happens on other platforms.
In support of this, move the backend's make_absolute_path() function
into src/port/path.c (where it probably should have been long since)
and get rid of the rather inferior version in pg_regress.
Kumar Rajeev Rastogi, reviewed by MauMau
The displayed sendtime and receipttime were always exactly equal, because
somebody forgot that timestamptz_to_str returns a static buffer (thereby
simplifying life for most callers, at the cost of complicating it for those
who need two results concurrently). Apply the same pstrdup solution used
by the other call sites with this issue. Back-patch to 9.2 where the
faulty code was introduced. Per bug #9849 from Haruka Takatsuka, though
this is not exactly his patch.
Possibly we should change timestamptz_to_str's API, but I wouldn't want
to do so in the back branches.
GetVirtualXIDsDelayingChkpt calls palloc, which isn't safe in a critical
section. I thought I covered this case with the exemption for the
checkpointer, but CreateCheckPoint is also called from the startup process.
For variadic functions (other than VARIADIC ANY), the syntaxes foo(x,y,...)
and foo(VARIADIC ARRAY[x,y,...]) should be considered equivalent, since the
former is converted to the latter at parse time. They have indeed been
equivalent, in all releases before 9.3. However, commit 75b39e790 made an
ill-considered decision to record which syntax had been used in FuncExpr
nodes, and then to make equal() test that in checking node equality ---
which caused the syntaxes to not be seen as equivalent by the planner.
This is the underlying cause of bug #9817 from Dmitry Ryabov.
It might seem that a quick fix would be to make equal() disregard
FuncExpr.funcvariadic, but the same commit made that untenable, because
the field actually *is* semantically significant for some VARIADIC ANY
functions. This patch instead adopts the approach of redefining
funcvariadic (and aggvariadic, in HEAD) as meaning that the last argument
is a variadic array, whether it got that way by parser intervention or was
supplied explicitly by the user. Therefore the value will always be true
for non-ANY variadic functions, restoring the principle of equivalence.
(However, the planner will continue to consider use of VARIADIC as a
meaningful difference for VARIADIC ANY functions, even though some such
functions might disregard it.)
In HEAD, this change lets us simplify the decompilation logic in
ruleutils.c, since the funcvariadic/aggvariadic flag tells directly whether
to print VARIADIC. However, in 9.3 we have to continue to cope with
existing stored rules/views that might contain the previous definition.
Fortunately, this just means no change in ruleutils.c, since its existing
behavior effectively ignores funcvariadic for all cases other than VARIADIC
ANY functions.
In HEAD, bump catversion to reflect the fact that FuncExpr.funcvariadic
changed meanings; this is sort of pro forma, since I don't believe any
built-in views are affected.
Unfortunately, this patch doesn't magically fix everything for affected
9.3 users. After installing 9.3.5, they might need to recreate their
rules/views/indexes containing variadic function calls in order to get
everything consistent with the new definition. As in the cited bug,
the symptom of a problem would be failure to use a nominally matching
index that has a variadic function call in its definition. We'll need
to mention this in the 9.3.5 release notes.
Memory allocation can fail if you run out of memory, and inside a critical
section that will lead to a PANIC. Use conservatively-sized arrays in stack
instead.
There was previously no explicit limit on the number of pages a GiST split
can produce, it was only limited by the number of LWLocks that can be held
simultaneously (100 at the moment). This patch adds an explicit limit of 75
pages. That should be plenty, a typical split shouldn't produce more than
2-3 page halves.
The bug has been there forever, but only backpatch down to 9.1. The code
was changed significantly in 9.1, and it doesn't seem worth the risk or
trouble to adapt this for 9.0 and 8.4.
The code for matching clients to pg_hba.conf lines that specify host names
(instead of IP address ranges) failed to complain if reverse DNS lookup
failed; instead it silently didn't match, so that you might end up getting
a surprising "no pg_hba.conf entry for ..." error, as seen in bug #9518
from Mike Blackwell. Since we don't want to make this a fatal error in
situations where pg_hba.conf contains a mixture of host names and IP
addresses (clients matching one of the numeric entries should not have to
have rDNS data), remember the lookup failure and mention it as DETAIL if
we get to "no pg_hba.conf entry". Apply the same approach to forward-DNS
lookup failures, too, rather than treating them as immediate hard errors.
Along the way, fix a couple of bugs that prevented us from detecting an
rDNS lookup error reliably, and make sure that we make only one rDNS lookup
attempt; formerly, if the lookup attempt failed, the code would try again
for each host name entry in pg_hba.conf. Since more or less the whole
point of this design is to ensure there's only one lookup attempt not one
per entry, the latter point represents a performance bug that seems
sufficient justification for back-patching.
Also, adjust src/port/getaddrinfo.c so that it plays as well as it can
with this code. Which is not all that well, since it does not have actual
support for rDNS lookup, but at least it should return the expected (and
required by spec) error codes so that the main code correctly perceives the
lack of functionality as a lookup failure. It's unlikely that PG is still
being used in production on any machines that require our getaddrinfo.c,
so I'm not excited about working harder than this.
To keep the code in the various branches similar, this includes
back-patching commits c424d0d105 and
1997f34db4 into 9.2 and earlier.
Back-patch to 9.1 where the facility for hostnames in pg_hba.conf was
introduced.
Initialization of this field was not being done according to the
st_changecount protocol (it has to be done within the changecount increment
range, not outside). And the test to see if the value should be reported
as null was wrong. Noted while perusing uses of Port.remote_hostname.
This was wrong from the introduction of this code (commit 4a25bc145),
so back-patch to 9.1.
Inserting a downlink to an internal page clears the incomplete-split flag
of the child's left sibling, so the left sibling's LSN also needs to be
updated and it needs to be marked dirty. The codepath for an insertion got
this right, but the case where the internal node is split because of
inserting the new downlink missed that.
Inserting a downlink to an internal page clears the incomplete-split flag
of the child's left sibling, so the left sibling's LSN also needs to be
updated.
Otherwise, the compiler might decide to move modifications to data
within this structure outside the enclosing SpinLockAcquire /
SpinLockRelease pair, leading to shared memory corruption.
This may or may not explain a recent lmgr-related buildfarm failure
on prairiedog, but it needs to be fixed either way.
Previously, such buffers weren't counted, with the possible result that
EXPLAIN (BUFFERS) and pg_stat_statements would understate the true
number of blocks dirtied by an SQL statement.
Back-patch to 9.2, where this counter was introduced.
Amit Kapila
Inserting (in retail) into the new 9.4 format GIN posting tree created much
larger WAL records than in 9.3. The previous strategy to WAL logging was
basically to log the whole page on each change, with the exception of
completely unmodified segments up to the first modified one. That was not
too bad when appending to the end of the page, as only the last segment had
to be WAL-logged, but per Fujii Masao's testing, even that produced 2x the
WAL volume that 9.3 did.
The new strategy is to keep track of changes to the posting lists in a more
fine-grained fashion, and also make the repacking" code smarter to avoid
decoding and re-encoding segments unnecessarily.
The original coding of EquivalenceClasses didn't foresee that appendrel
child relations might themselves be appendrels; but this is possible for
example when a UNION ALL subquery scans a table with inheritance children.
The oversight led to failure to optimize ordering-related issues very well
for the grandchild tables. After some false starts involving explicitly
flattening the appendrel representation, we found that this could be fixed
easily by removing a few implicit assumptions about appendrel parent rels
not being children themselves.
Kyotaro Horiguchi and Tom Lane, reviewed by Noah Misch
Commit 613c6d26bd sloppily replaced a
lookup of the UID obtained from getpeereid() with a lookup of the
server's own user name, thus totally destroying peer authentication.
Revert. Per report from Christoph Berg.
In passing, make sure get_user_name() zeroes *errstr on success on
Windows as well as non-Windows. I don't think any callers actually
depend on this ATM, but we should be consistent across platforms.
If you compile with WAL_DEBUG and enable it with wal_debug=on, we used to
only pass the first XLogRecData entry to the rm_desc routine. I think the
original assumprion was that the first XLogRecData entry contains all the
necessary information for the rm_desc routine, but that's a pretty shaky
assumption. At least standby_redo didn't get the memo.
To fix, piece together all the data in a temporary buffer, and pass that to
the rm_desc routine.
It's been like this forever, but the patch didn't apply cleanly to
back-branches. Probably wouldn't be hard to fix the conflicts, but it's
not worth the trouble.
Set function parameter names and defaults. Add jsonb versions (which the
code already provided for so the actual new code is trivial). Add jsonb
regression tests and docs.
Bump catalog version (which I apparently forgot to do when jsonb was
committed).
Assert errors were thrown for functions being passed invalid encodings,
while the main code handled it just fine.
Also document that libpq's PQclientEncoding() returns -1 for an encoding
lookup failure.
Per report from Peter Geoghegan
The new format accepts exactly the same data as the json type. However, it is
stored in a format that does not require reparsing the orgiginal text in order
to process it, making it much more suitable for indexing and other operations.
Insignificant whitespace is discarded, and the order of object keys is not
preserved. Neither are duplicate object keys kept - the later value for a given
key is the only one stored.
The new type has all the functions and operators that the json type has,
with the exception of the json generation functions (to_json, json_agg etc.)
and with identical semantics. In addition, there are operator classes for
hash and btree indexing, and two classes for GIN indexing, that have no
equivalent in the json type.
This feature grew out of previous work by Oleg Bartunov and Teodor Sigaev, which
was intended to provide similar facilities to a nested hstore type, but which
in the end proved to have some significant compatibility issues.
Authors: Oleg Bartunov, Teodor Sigaev, Peter Geoghegan and Andrew Dunstan.
Review: Andres Freund
This covers all the SQL-standard trigger types supported for regular
tables; it does not cover constraint triggers. The approach for
acquiring the old row mirrors that for view INSTEAD OF triggers. For
AFTER ROW triggers, we spool the foreign tuples to a tuplestore.
This changes the FDW API contract; when deciding which columns to
populate in the slot returned from data modification callbacks, writable
FDWs will need to check for AFTER ROW triggers in addition to checking
for a RETURNING clause.
In support of the feature addition, refactor the TriggerFlags bits and
the assembly of old tuples in ModifyTable.
Ronan Dunklau, reviewed by KaiGai Kohei; some additional hacking by me.
equalTupleDescs() neglected both of these ConstrCheck fields, and
CreateTupleDescCopyConstr() neglected ccnoinherit. At this time, the
only known behavior defect resulting from these omissions is constraint
exclusion disregarding a CHECK constraint validated by an ALTER TABLE
VALIDATE CONSTRAINT statement issued earlier in the same transaction.
Back-patch to 9.2, where these fields were introduced.
Also fix the name of the dtrace probe for LWLockAcquireOrWait(). The
function was renamed from LWLockWaitUntilFree to LWLockAqcuireOrWait, but
the dtrace probe was neglected.
Pointed out by Andres Freund and the buildfarm.
Clear errno before calling readdir() and handle old MinGW errno bug
while adding full test coverage for readdir/closedir failures.
Backpatch through 8.4.
The special feature the XLogInsert slots had over regular LWLocks is the
insertingAt value that was updated atomically with releasing backends
waiting on it. Add new functions to the LWLock API to do that, and replace
the slots with LWLocks. This reduces the amount of duplicated code.
(There's still some duplication, but at least it's all in lwlock.c now.)
Reviewed by Andres Freund.
The previous method was overly complex and underly correct; in particular,
by assigning the default value with PGC_S_OVERRIDE, it prevented later
attempts to change the setting in postgresql.conf, as noted by Jeff Janes.
We should just assign the default value with source PGC_S_DYNAMIC_DEFAULT,
which will have the desired priority relative to the boot_val as well as
user-set values.
There is still a gap in this method: if there's an explicit assignment of
effective_cache_size = -1 in the postgresql.conf file, and that assignment
appears before shared_buffers is assigned, the code will substitute 4 times
the bootstrap default for shared_buffers, and that value will then persist
(since it will have source PGC_S_FILE). I don't see any very nice way
to avoid that though, and it's not a case to be expected in practice.
The existing comments in guc-file.l look forward to a redesign of the
DYNAMIC_DEFAULT mechanism; if that ever happens, we should consider this
case as one of the things we'd like to improve.
With this in place, a session blocking behind another one because of
tuple locks will get a context line mentioning the relation name, tuple
TID, and operation being done on tuple. For example:
LOG: process 11367 still waiting for ShareLock on transaction 717 after 1000.108 ms
DETAIL: Process holding the lock: 11366. Wait queue: 11367.
CONTEXT: while updating tuple (0,2) in relation "foo"
STATEMENT: UPDATE foo SET value = 3;
Most usefully, the new line is displayed by log entries due to
log_lock_waits, although of course it will be printed by any other log
message as well.
Author: Christian Kruse, some tweaks by Álvaro Herrera
Reviewed-by: Amit Kapila, Andres Freund, Tom Lane, Robert Haas
For a regex containing backrefs, pg_regexec() might fail to free all the
sub-DFAs that were created during execution, resulting in a permanent
(session lifespan) memory leak. Problem was introduced by me in commit
587359479a. Per report from Sandro Santilli;
diagnosis by Greg Stark.
It is no longer used, none of the resource managers have multi-record
actions that would make it unsafe to perform a restartpoint.
Also don't allow rm_cleanup to write WAL records, it's also no longer
required. Move the call to rm_cleanup routines to make it more symmetric
with rm_startup.
Splitting a page consists of two separate steps: splitting the child page,
and inserting the downlink for the new right page to the parent. Previously,
we handled the case that you crash in between those steps with a cleanup
routine after the WAL recovery had finished, which finished the incomplete
split. However, that doesn't help if the page split is interrupted but the
database doesn't crash, so that you don't perform WAL recovery. That could
happen for example if you run out of disk space.
Remove the end-of-recovery cleanup step. Instead, when a page is split, the
left page is marked with a new INCOMPLETE_SPLIT flag, and when the downlink
is inserted to the parent, the flag is cleared again. If an insertion sees
a page with the flag set, it knows that the split was interrupted for some
reason, and inserts the missing downlink before proceeding.
I used the same approach to fix GIN and GiST split algorithms earlier. This
was the last WAL cleanup routine, so we could get rid of that whole
machinery now, but I'll leave that for a separate patch.
Reviewed by Peter Geoghegan.
One path through the loop over indexes forgot to do index_close(). Rather
than adding a fourth call, restructure slightly so that there's only one.
In passing, get rid of an unnecessary syscache lookup: the pg_index struct
for the index is already available from its relcache entry.
Per report from YAMAMOTO Takashi, though this is a bit different from his
suggested patch. This is new code in HEAD, so no need for back-patch.
Revise the original decision to expose a uint64-based interface and
use Size everywhere possible. Avoid assuming that MAXIMUM_ALIGNOF is
8, or making any assumption about the relationship between that value
and sizeof(Size). If MAXIMUM_ALIGNOF is bigger, we'll now insert
padding after the length word; if it's smaller, we are now prepared
to read and write the length word in chunks.
Per discussion with Tom Lane.
The new function dsm_detach_all() can be used either by postmaster
children that don't wish to take any risk of accidentally corrupting
shared memory; or by forked children of regular backends with
the same need. This patch also updates the postmaster children that
already do PGSharedMemoryDetach() to do dsm_detach_all() as well.
Per discussion with Tom Lane.
The recently-fixed bug in WAL replay could result in not finding a parent
tuple for a heap-only tuple. The existing code would either Assert or
generate an invalid index entry, neither of which is desirable. Throw a
regular error instead.
On clean shutdown, walsender waits for all WAL to be replicated to a standby,
and exits. It determined whether that replication had been completed by
checking whether its sent location had been equal to a standby's flush
location. Unfortunately this condition never becomes true when the standby
such as pg_receivexlog which always returns an invalid flush location is
connecting to walsender, and then walsender waits forever.
This commit changes walsender so that it just checks a standby's write
location if a flush location is invalid.
Back-patch to 9.1 where enough infrastructure for this exists.
krb_srvname is actually not available anymore as a parameter server-side, since
with gssapi we accept all principals in our keytab. It's still used in libpq for
client side specification.
In passing remove declaration of krb_server_hostname, where all the functionality
was already removed.
Noted by Stephen Frost, though a different solution than his suggestion
In short, we don't allow a page to be deleted if it's the rightmost child
of its parent, but that situation can change after we check for it.
Problem
-------
We check that the page to be deleted is not the rightmost child of its
parent, and then lock its left sibling, the page itself, its right sibling,
and the parent, in that order. However, if the parent page is split after
the check but before acquiring the locks, the target page might become the
rightmost child, if the split happens at the right place. That leads to an
error in vacuum (I reproduced this by setting a breakpoint in debugger):
ERROR: failed to delete rightmost child 41 of block 3 in index "foo_pkey"
We currently re-check that the page is still the rightmost child, and throw
the above error if it's not. We could easily just give up rather than throw
an error, but that approach doesn't scale to half-dead pages. To recap,
although we don't normally allow deleting the rightmost child, if the page
is the *only* child of its parent, we delete the child page and mark the
parent page as half-dead in one atomic operation. But before we do that, we
check that the parent can later be deleted, by checking that it in turn is
not the rightmost child of the grandparent (potentially recursing all the
way up to the root). But the same situation can arise there - the
grandparent can be split while we're not holding the locks. We end up with
a half-dead page that we cannot delete.
To make things worse, the keyspace of the deleted page has already been
transferred to its right sibling. As the README points out, the keyspace at
the grandparent level is "out-of-whack" until the half-dead page is deleted,
and if enough tuples with keys in the transferred keyspace are inserted, the
page might get split and a downlink might be inserted into the grandparent
that is out-of-order. That might not cause any serious problem if it's
transient (as the README ponders), but is surely bad if it stays that way.
Solution
--------
This patch changes the page deletion algorithm to avoid that problem. After
checking that the topmost page in the chain of to-be-deleted pages is not
the rightmost child of its parent, and then deleting the pages from bottom
up, unlink the pages from top to bottom. This way, the intermediate stages
are similar to the intermediate stages in page splitting, and there is no
transient stage where the keyspace is "out-of-whack". The topmost page in
the to-be-deleted chain doesn't have a downlink pointing to it, like a page
split before the downlink has been inserted.
This also allows us to get rid of the cleanup step after WAL recovery, if we
crash during page deletion. The deletion will be continued at next VACUUM,
but the tree is consistent for searches and insertions at every step.
This bug is old, all supported versions are affected, but this patch is too
big to back-patch (and changes the WAL record formats of related records).
We have not heard any reports of the bug from users, so clearly it's not
easy to bump into. Maybe backpatch later, after this has had some field
testing.
Reviewed by Kevin Grittner and Peter Geoghegan.
This should eliminate the risk of recursive entry to syslog(3), which
appears to be the cause of the hang reported in bug #9551 from James
Morton.
Arguably, the real problem here is auth.c's willingness to turn on
ImmediateInterruptOK while executing fairly wide swaths of backend code.
We may well need to work at narrowing the code ranges in which the
authentication_timeout interrupt is enabled. For the moment, though,
this is a cheap and reasonably noninvasive fix for a field-reported
failure; the other approach would be complex and not necessarily
bug-free itself.
Back-patch to all supported branches.
Use TransactionIdIsInProgress, then TransactionIdDidCommit, to distinguish
whether a NOTIFY message's originating transaction is in progress,
committed, or aborted. The previous coding could accept a message from a
transaction that was still in-progress according to the PGPROC array;
if the client were fast enough at starting a new transaction, it might fail
to see table rows added/updated by the message-sending transaction. Which
of course would usually be the point of receiving the message. We noted
this type of race condition long ago in tqual.c, but async.c overlooked it.
The race condition probably cannot occur unless there are multiple NOTIFY
senders in action, since an individual backend doesn't send NOTIFY signals
until well after it's done committing. But if two senders commit in close
succession, it's certainly possible that we could see the second sender's
message within the race condition window while responding to the signal
from the first one.
Per bug #9557 from Marko Tiikkaja. This patch is slightly more invasive
than what he proposed, since it removes the now-redundant
TransactionIdDidAbort call.
Back-patch to 9.0, where the current NOTIFY implementation was introduced.
When a row is updated, and the new tuple version is put on the same page as
the old one, only WAL-log the part of the new tuple that's not identical to
the old. This saves significantly on the amount of WAL that needs to be
written, in the common case that most fields are not modified.
Amit Kapila, with a lot of back and forth with me, Robert Haas, and others.