postgresql/src/backend/commands/analyze.c

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

3040 lines
90 KiB
C
Raw Normal View History

2000-05-29 19:44:17 +02:00
/*-------------------------------------------------------------------------
*
* analyze.c
* the Postgres statistics generator
2000-05-29 19:44:17 +02:00
*
* Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
2000-05-29 19:44:17 +02:00
* Portions Copyright (c) 1994, Regents of the University of California
*
*
* IDENTIFICATION
2010-09-20 22:08:53 +02:00
* src/backend/commands/analyze.c
2000-05-29 19:44:17 +02:00
*
*-------------------------------------------------------------------------
*/
#include "postgres.h"
#include <math.h>
2000-05-29 19:44:17 +02:00
#include "access/detoast.h"
#include "access/genam.h"
Improve concurrency of foreign key locking This patch introduces two additional lock modes for tuples: "SELECT FOR KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each other, in contrast with already existing "SELECT FOR SHARE" and "SELECT FOR UPDATE". UPDATE commands that do not modify the values stored in the columns that are part of the key of the tuple now grab a SELECT FOR NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently with tuple locks of the FOR KEY SHARE variety. Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this means the concurrency improvement applies to them, which is the whole point of this patch. The added tuple lock semantics require some rejiggering of the multixact module, so that the locking level that each transaction is holding can be stored alongside its Xid. Also, multixacts now need to persist across server restarts and crashes, because they can now represent not only tuple locks, but also tuple updates. This means we need more careful tracking of lifetime of pg_multixact SLRU files; since they now persist longer, we require more infrastructure to figure out when they can be removed. pg_upgrade also needs to be careful to copy pg_multixact files over from the old server to the new, or at least part of multixact.c state, depending on the versions of the old and new servers. Tuple time qualification rules (HeapTupleSatisfies routines) need to be careful not to consider tuples with the "is multi" infomask bit set as being only locked; they might need to look up MultiXact values (i.e. possibly do pg_multixact I/O) to find out the Xid that updated a tuple, whereas they previously were assured to only use information readily available from the tuple header. This is considered acceptable, because the extra I/O would involve cases that would previously cause some commands to block waiting for concurrent transactions to finish. Another important change is the fact that locking tuples that have previously been updated causes the future versions to be marked as locked, too; this is essential for correctness of foreign key checks. This causes additional WAL-logging, also (there was previously a single WAL record for a locked tuple; now there are as many as updated copies of the tuple there exist.) With all this in place, contention related to tuples being checked by foreign key rules should be much reduced. As a bonus, the old behavior that a subtransaction grabbing a stronger tuple lock than the parent (sub)transaction held on a given tuple and later aborting caused the weaker lock to be lost, has been fixed. Many new spec files were added for isolation tester framework, to ensure overall behavior is sane. There's probably room for several more tests. There were several reviewers of this patch; in particular, Noah Misch and Andres Freund spent considerable time in it. Original idea for the patch came from Simon Riggs, after a problem report by Joel Jacobson. Most code is from me, with contributions from Marti Raudsepp, Alexander Shulgin, Noah Misch and Andres Freund. This patch was discussed in several pgsql-hackers threads; the most important start at the following message-ids: AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com 1290721684-sup-3951@alvh.no-ip.org 1294953201-sup-2099@alvh.no-ip.org 1320343602-sup-2290@alvh.no-ip.org 1339690386-sup-8927@alvh.no-ip.org 4FE5FF020200002500048A3D@gw.wicourts.gov 4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
#include "access/multixact.h"
#include "access/relation.h"
#include "access/table.h"
#include "access/tableam.h"
#include "access/transam.h"
#include "access/tupconvert.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
#include "catalog/index.h"
2000-05-29 19:44:17 +02:00
#include "catalog/indexing.h"
#include "catalog/pg_inherits.h"
#include "commands/dbcommands.h"
#include "commands/progress.h"
#include "commands/tablecmds.h"
2000-05-29 19:44:17 +02:00
#include "commands/vacuum.h"
#include "common/pg_prng.h"
#include "executor/executor.h"
#include "foreign/fdwapi.h"
2000-05-29 19:44:17 +02:00
#include "miscadmin.h"
#include "nodes/nodeFuncs.h"
2000-05-29 19:44:17 +02:00
#include "parser/parse_oper.h"
#include "parser/parse_relation.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
Implement multivariate n-distinct coefficients Add support for explicitly declared statistic objects (CREATE STATISTICS), allowing collection of statistics on more complex combinations that individual table columns. Companion commands DROP STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are added too. All this DDL has been designed so that more statistic types can be added later on, such as multivariate most-common-values and multivariate histograms between columns of a single table, leaving room for permitting columns on multiple tables, too, as well as expressions. This commit only adds support for collection of n-distinct coefficient on user-specified sets of columns in a single table. This is useful to estimate number of distinct groups in GROUP BY and DISTINCT clauses; estimation errors there can cause over-allocation of memory in hashed aggregates, for instance, so it's a worthwhile problem to solve. A new special pseudo-type pg_ndistinct is used. (num-distinct estimation was deemed sufficiently useful by itself that this is worthwhile even if no further statistic types are added immediately; so much so that another version of essentially the same functionality was submitted by Kyotaro Horiguchi: https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp though this commit does not use that code.) Author: Tomas Vondra. Some code rework by Álvaro. Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes, Ideriha Takeshi Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
#include "statistics/extended_stats_internal.h"
#include "statistics/statistics.h"
#include "storage/bufmgr.h"
#include "storage/procarray.h"
#include "utils/attoptcache.h"
#include "utils/datum.h"
#include "utils/guc.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/pg_rusage.h"
#include "utils/sampling.h"
#include "utils/sortsupport.h"
#include "utils/spccache.h"
2000-05-29 19:44:17 +02:00
#include "utils/syscache.h"
#include "utils/timestamp.h"
2000-05-29 19:44:17 +02:00
/* Per-index data for ANALYZE */
typedef struct AnlIndexData
{
IndexInfo *indexInfo; /* BuildIndexInfo result */
double tupleFract; /* fraction of rows for partial index */
VacAttrStats **vacattrstats; /* index attrs to analyze */
int attr_cnt;
} AnlIndexData;
/* Default statistics target (GUC parameter) */
int default_statistics_target = 100;
/* A few variables that don't seem worth passing around as parameters */
static MemoryContext anl_context = NULL;
static BufferAccessStrategy vac_strategy;
static void do_analyze_rel(Relation onerel,
VacuumParams *params, List *va_cols,
AcquireSampleRowsFunc acquirefunc, BlockNumber relpages,
bool inh, bool in_outer_xact, int elevel);
static void compute_index_stats(Relation onerel, double totalrows,
AnlIndexData *indexdata, int nindexes,
HeapTuple *rows, int numrows,
MemoryContext col_context);
static VacAttrStats *examine_attribute(Relation onerel, int attnum,
Node *index_expr);
static int acquire_sample_rows(Relation onerel, int elevel,
HeapTuple *rows, int targrows,
double *totalrows, double *totaldeadrows);
static int compare_rows(const void *a, const void *b, void *arg);
static int acquire_inherited_sample_rows(Relation onerel, int elevel,
HeapTuple *rows, int targrows,
double *totalrows, double *totaldeadrows);
static void update_attstats(Oid relid, bool inh,
int natts, VacAttrStats **vacattrstats);
static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
2000-05-29 19:44:17 +02:00
/*
* analyze_rel() -- analyze one relation
Fix inadequate locking during get_rel_oids(). get_rel_oids used to not take any relation locks at all, but that stopped being a good idea with commit 3c3bb9933, which inserted a syscache lookup into the function. A concurrent DROP TABLE could now produce "cache lookup failed", which we don't want to have happen in normal operation. The best solution seems to be to transiently take a lock on the relation named by the RangeVar (which also makes the result of RangeVarGetRelid a lot less spongy). But we shouldn't hold the lock beyond this function, because we don't want VACUUM to lock more than one table at a time. (That would not be a big problem right now, but it will become one after the pending feature patch to allow multiple tables to be named in VACUUM.) In passing, adjust vacuum_rel and analyze_rel to document that we don't trust the passed RangeVar to be accurate, and allow the RangeVar to possibly be NULL --- which it is anyway for a whole-database VACUUM, though we accidentally didn't crash for that case. The passed RangeVar is in fact inaccurate when dealing with a child partition, as of v10, and it has been wrong for a whole long time in the case of vacuum_rel() recursing to a TOAST table. None of these things present visible bugs up to now, because the passed RangeVar is in fact only consulted for autovacuum logging, and in that particular context it's always accurate because autovacuum doesn't let vacuum.c expand partitions nor recurse to toast tables. Still, this seems like trouble waiting to happen, so let's nail the door at least partly shut. (Further cleanup is planned, in HEAD only, as part of the pending feature patch.) Fix some sadly inaccurate/obsolete comments too. Back-patch to v10. Michael Paquier and Tom Lane Discussion: https://postgr.es/m/25023.1506107590@sss.pgh.pa.us
2017-09-29 22:26:21 +02:00
*
* relid identifies the relation to analyze. If relation is supplied, use
* the name therein for reporting any failure to open/lock the rel; do not
* use it once we've successfully opened the rel, since it might be stale.
2000-05-29 19:44:17 +02:00
*/
void
analyze_rel(Oid relid, RangeVar *relation,
VacuumParams *params, List *va_cols, bool in_outer_xact,
BufferAccessStrategy bstrategy)
2000-05-29 19:44:17 +02:00
{
Relation onerel;
int elevel;
AcquireSampleRowsFunc acquirefunc = NULL;
BlockNumber relpages = 0;
/* Select logging level */
if (params->options & VACOPT_VERBOSE)
elevel = INFO;
else
elevel = DEBUG2;
/* Set up static variables */
vac_strategy = bstrategy;
/*
* Check for user-requested abort.
2000-05-29 19:44:17 +02:00
*/
CHECK_FOR_INTERRUPTS();
2000-05-29 19:44:17 +02:00
/*
* Open the relation, getting ShareUpdateExclusiveLock to ensure that two
* ANALYZEs don't run on it concurrently. (This also locks out a
* concurrent VACUUM, which doesn't matter much at the moment but might
* matter if we ever try to accumulate stats on dead tuples.) If the rel
* has been dropped since we last saw it, we don't need to process it.
*
* Make sure to generate only logs for ANALYZE in this case.
2000-05-29 19:44:17 +02:00
*/
onerel = vacuum_open_relation(relid, relation, params->options & ~(VACOPT_VACUUM),
params->log_min_duration >= 0,
ShareUpdateExclusiveLock);
/* leave if relation could not be opened or locked */
if (!onerel)
return;
2001-03-22 05:01:46 +01:00
2000-05-29 19:44:17 +02:00
/*
* Check if relation needs to be skipped based on privileges. This check
Improve VACUUM and ANALYZE by avoiding early lock queue A caller of VACUUM can perform early lookup obtention which can cause other sessions to block on the request done, causing potentially DOS attacks as even a non-privileged user can attempt a vacuum fill of a critical catalog table to block even all incoming connection attempts. Contrary to TRUNCATE, a client could attempt a system-wide VACUUM after building the list of relations to VACUUM, which can cause vacuum_rel() or analyze_rel() to try to lock the relation but the operation would just block. When the client specifies a list of relations and the relation needs to be skipped, ownership checks are done when building the list of relations to work on, preventing a later lock attempt. vacuum_rel() already had the sanity checks needed, except that those were applied too late. This commit refactors the code so as relation skips are checked beforehand, making it safer to avoid too early locks, for both manual VACUUM with and without a list of relations specified. An isolation test is added emulating the fact that early locks do not happen anymore, issuing a WARNING message earlier if the user calling VACUUM is not a relation owner. When a partitioned table is listed in a manual VACUUM or ANALYZE command, its full list of partitions is fetched, all partitions get added to the list to work on, and then each one of them is processed one by one, with ownership checks happening at the later phase of vacuum_rel() or analyze_rel(). Trying to do early ownership checks for each partition is proving to be tedious as this would result in deadlock risks with lock upgrades, and skipping all partitions if the listed partitioned table is not owned would result in a behavior change compared to how Postgres 10 has implemented vacuum for partitioned tables. The original problem reported related to early lock queue for critical relations is fixed anyway, so priority is given to avoiding a backward-incompatible behavior. Reported-by: Lloyd Albin, Jeremy Schneider Author: Michael Paquier Reviewed by: Nathan Bossart, Kyotaro Horiguchi Discussion: https://postgr.es/m/152512087100.19803.12733865831237526317@wrigleys.postgresql.org Discussion: https://postgr.es/m/20180812222142.GA6097@paquier.xyz
2018-08-27 02:11:12 +02:00
* happens also when building the relation list to analyze for a manual
* operation, and needs to be done additionally here as ANALYZE could
* happen across multiple transactions where privileges could have changed
* in-between. Make sure to generate only logs for ANALYZE in this case.
2000-05-29 19:44:17 +02:00
*/
if (!vacuum_is_permitted_for_relation(RelationGetRelid(onerel),
onerel->rd_rel,
params->options & ~VACOPT_VACUUM))
2000-05-29 19:44:17 +02:00
{
relation_close(onerel, ShareUpdateExclusiveLock);
2000-05-29 19:44:17 +02:00
return;
}
/*
* Silently ignore tables that are temp tables of other backends ---
* trying to analyze these is rather pointless, since their contents are
* probably not up-to-date on disk. (We don't throw a warning here; it
* would just lead to chatter during a database-wide ANALYZE.)
*/
if (RELATION_IS_OTHER_TEMP(onerel))
{
relation_close(onerel, ShareUpdateExclusiveLock);
return;
}
/*
* We can ANALYZE any table except pg_statistic. See update_attstats
*/
if (RelationGetRelid(onerel) == StatisticRelationId)
{
relation_close(onerel, ShareUpdateExclusiveLock);
return;
}
/*
* Check that it's of an analyzable relkind, and set up appropriately.
*/
if (onerel->rd_rel->relkind == RELKIND_RELATION ||
onerel->rd_rel->relkind == RELKIND_MATVIEW)
{
/* Regular table, so we'll use the regular row acquisition function */
acquirefunc = acquire_sample_rows;
/* Also get regular table's size */
relpages = RelationGetNumberOfBlocks(onerel);
}
else if (onerel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
{
/*
* For a foreign table, call the FDW's hook function to see whether it
* supports analysis.
*/
FdwRoutine *fdwroutine;
bool ok = false;
fdwroutine = GetFdwRoutineForRelation(onerel, false);
if (fdwroutine->AnalyzeForeignTable != NULL)
ok = fdwroutine->AnalyzeForeignTable(onerel,
&acquirefunc,
&relpages);
if (!ok)
{
ereport(WARNING,
(errmsg("skipping \"%s\" --- cannot analyze this foreign table",
RelationGetRelationName(onerel))));
relation_close(onerel, ShareUpdateExclusiveLock);
return;
}
}
else if (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
/*
* For partitioned tables, we want to do the recursive ANALYZE below.
*/
}
else
2000-05-29 19:44:17 +02:00
{
/* No need for a WARNING if we already complained during VACUUM */
if (!(params->options & VACOPT_VACUUM))
ereport(WARNING,
(errmsg("skipping \"%s\" --- cannot analyze non-tables or special system tables",
RelationGetRelationName(onerel))));
relation_close(onerel, ShareUpdateExclusiveLock);
return;
}
/*
* OK, let's do it. First, initialize progress reporting.
*/
pgstat_progress_start_command(PROGRESS_COMMAND_ANALYZE,
RelationGetRelid(onerel));
/*
* Do the normal non-recursive ANALYZE. We can skip this for partitioned
* tables, which don't contain any rows.
*/
if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
do_analyze_rel(onerel, params, va_cols, acquirefunc,
relpages, false, in_outer_xact, elevel);
/*
* If there are child tables, do recursive ANALYZE.
*/
if (onerel->rd_rel->relhassubclass)
do_analyze_rel(onerel, params, va_cols, acquirefunc, relpages,
true, in_outer_xact, elevel);
/*
* Close source relation now, but keep lock so that no one deletes it
* before we commit. (If someone did, they'd fail to clean up the entries
* we made in pg_statistic. Also, releasing the lock before commit would
* expose us to concurrent-update failures in update_attstats.)
*/
relation_close(onerel, NoLock);
pgstat_progress_end_command();
}
/*
* do_analyze_rel() -- analyze one relation, recursively or not
*
* Note that "acquirefunc" is only relevant for the non-inherited case.
Allow foreign tables to participate in inheritance. Foreign tables can now be inheritance children, or parents. Much of the system was already ready for this, but we had to fix a few things of course, mostly in the area of planner and executor handling of row locks. As side effects of this, allow foreign tables to have NOT VALID CHECK constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to disallow these things would've required bizarre and inconsistent special cases in inheritance behavior. Since foreign tables don't enforce CHECK constraints anyway, a NOT VALID one is a complete no-op, but that doesn't mean we shouldn't allow it. And it's possible that some FDWs might have use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops for most. An additional change in support of this is that when a ModifyTable node has multiple target tables, they will all now be explicitly identified in EXPLAIN output, for example: Update on pt1 (cost=0.00..321.05 rows=3541 width=46) Update on pt1 Foreign Update on ft1 Foreign Update on ft2 Update on child3 -> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46) -> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46) -> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46) -> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46) This was done mainly to provide an unambiguous place to attach "Remote SQL" fields, but it is useful for inherited updates even when no foreign tables are involved. Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
* For the inherited case, acquire_inherited_sample_rows() determines the
* appropriate acquirefunc for each child table.
*/
static void
do_analyze_rel(Relation onerel, VacuumParams *params,
List *va_cols, AcquireSampleRowsFunc acquirefunc,
BlockNumber relpages, bool inh, bool in_outer_xact,
int elevel)
{
int attr_cnt,
tcnt,
i,
ind;
Relation *Irel;
int nindexes;
bool hasindex;
VacAttrStats **vacattrstats;
AnlIndexData *indexdata;
int targrows,
numrows,
minrows;
double totalrows,
totaldeadrows;
HeapTuple *rows;
PGRUsage ru0;
TimestampTz starttime = 0;
MemoryContext caller_context;
Oid save_userid;
int save_sec_context;
int save_nestlevel;
int64 AnalyzePageHit = VacuumPageHit;
int64 AnalyzePageMiss = VacuumPageMiss;
int64 AnalyzePageDirty = VacuumPageDirty;
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;
if (inh)
ereport(elevel,
(errmsg("analyzing \"%s.%s\" inheritance tree",
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel))));
else
ereport(elevel,
(errmsg("analyzing \"%s.%s\"",
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel))));
/*
* Set up a working context so that we can easily free whatever junk gets
* created.
*/
anl_context = AllocSetContextCreate(CurrentMemoryContext,
"Analyze",
Add macros to make AllocSetContextCreate() calls simpler and safer. I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls had typos in the context-sizing parameters. While none of these led to especially significant problems, they did create minor inefficiencies, and it's now clear that expecting people to copy-and-paste those calls accurately is not a great idea. Let's reduce the risk of future errors by introducing single macros that encapsulate the common use-cases. Three such macros are enough to cover all but two special-purpose contexts; those two calls can be left as-is, I think. While this patch doesn't in itself improve matters for third-party extensions, it doesn't break anything for them either, and they can gradually adopt the simplified notation over time. In passing, change TopMemoryContext to use the default allocation parameters. Formerly it could only be extended 8K at a time. That was probably reasonable when this code was written; but nowadays we create many more contexts than we did then, so that it's not unusual to have a couple hundred K in TopMemoryContext, even without considering various dubious code that sticks other things there. There seems no good reason not to let it use growing blocks like most other contexts. Back-patch to 9.6, mostly because that's still close enough to HEAD that it's easy to do so, and keeping the branches in sync can be expected to avoid some future back-patching pain. The bugs fixed by these changes don't seem to be significant enough to justify fixing them further back. Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
ALLOCSET_DEFAULT_SIZES);
caller_context = MemoryContextSwitchTo(anl_context);
/*
* Switch to the table owner's userid, so that any index functions are run
2009-12-09 22:57:51 +01:00
* as that user. Also lock down security-restricted operations and
* arrange to make GUC variable changes local to this command.
*/
2009-12-09 22:57:51 +01:00
GetUserIdAndSecContext(&save_userid, &save_sec_context);
SetUserIdAndSecContext(onerel->rd_rel->relowner,
save_sec_context | SECURITY_RESTRICTED_OPERATION);
save_nestlevel = NewGUCNestLevel();
RestrictSearchPath();
/* measure elapsed time iff autovacuum logging requires it */
if (AmAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
{
if (track_io_timing)
{
startreadtime = pgStatBlockReadTime;
startwritetime = pgStatBlockWriteTime;
}
pg_rusage_init(&ru0);
starttime = GetCurrentTimestamp();
}
/*
* Determine which columns to analyze
*
* Note that system attributes are never analyzed, so we just reject them
* at the lookup stage. We also reject duplicate column mentions. (We
* could alternatively ignore duplicates, but analyzing a column twice
* won't work; we'd end up making a conflicting update in pg_statistic.)
*/
if (va_cols != NIL)
2000-05-29 19:44:17 +02:00
{
Bitmapset *unique_cols = NULL;
ListCell *le;
2000-05-29 19:44:17 +02:00
vacattrstats = (VacAttrStats **) palloc(list_length(va_cols) *
sizeof(VacAttrStats *));
tcnt = 0;
foreach(le, va_cols)
2000-05-29 19:44:17 +02:00
{
char *col = strVal(lfirst(le));
2000-05-29 19:44:17 +02:00
i = attnameAttNum(onerel, col, false);
if (i == InvalidAttrNumber)
ereport(ERROR,
(errcode(ERRCODE_UNDEFINED_COLUMN),
errmsg("column \"%s\" of relation \"%s\" does not exist",
col, RelationGetRelationName(onerel))));
if (bms_is_member(i, unique_cols))
ereport(ERROR,
(errcode(ERRCODE_DUPLICATE_COLUMN),
errmsg("column \"%s\" of relation \"%s\" appears more than once",
col, RelationGetRelationName(onerel))));
unique_cols = bms_add_member(unique_cols, i);
vacattrstats[tcnt] = examine_attribute(onerel, i, NULL);
if (vacattrstats[tcnt] != NULL)
tcnt++;
}
attr_cnt = tcnt;
}
else
{
attr_cnt = onerel->rd_att->natts;
vacattrstats = (VacAttrStats **)
palloc(attr_cnt * sizeof(VacAttrStats *));
tcnt = 0;
for (i = 1; i <= attr_cnt; i++)
{
vacattrstats[tcnt] = examine_attribute(onerel, i, NULL);
if (vacattrstats[tcnt] != NULL)
tcnt++;
2000-05-29 19:44:17 +02:00
}
attr_cnt = tcnt;
}
/*
* Open all indexes of the relation, and see if there are any analyzable
* columns in the indexes. We do not analyze index columns if there was
* an explicit column list in the ANALYZE command, however.
*
* If we are doing a recursive scan, we don't want to touch the parent's
* indexes at all. If we're processing a partitioned table, we need to
* know if there are any indexes, but we don't want to process them.
*/
if (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
List *idxs = RelationGetIndexList(onerel);
Irel = NULL;
nindexes = 0;
hasindex = idxs != NIL;
list_free(idxs);
}
else if (!inh)
{
vac_open_indexes(onerel, AccessShareLock, &nindexes, &Irel);
hasindex = nindexes > 0;
}
else
{
Irel = NULL;
nindexes = 0;
hasindex = false;
}
indexdata = NULL;
if (nindexes > 0)
{
indexdata = (AnlIndexData *) palloc0(nindexes * sizeof(AnlIndexData));
for (ind = 0; ind < nindexes; ind++)
{
AnlIndexData *thisdata = &indexdata[ind];
IndexInfo *indexInfo;
thisdata->indexInfo = indexInfo = BuildIndexInfo(Irel[ind]);
thisdata->tupleFract = 1.0; /* fix later if partial */
if (indexInfo->ii_Expressions != NIL && va_cols == NIL)
{
ListCell *indexpr_item = list_head(indexInfo->ii_Expressions);
thisdata->vacattrstats = (VacAttrStats **)
palloc(indexInfo->ii_NumIndexAttrs * sizeof(VacAttrStats *));
tcnt = 0;
for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
{
int keycol = indexInfo->ii_IndexAttrNumbers[i];
if (keycol == 0)
{
/* Found an index expression */
Node *indexkey;
if (indexpr_item == NULL) /* shouldn't happen */
elog(ERROR, "too few entries in indexprs list");
indexkey = (Node *) lfirst(indexpr_item);
Represent Lists as expansible arrays, not chains of cons-cells. Originally, Postgres Lists were a more or less exact reimplementation of Lisp lists, which consist of chains of separately-allocated cons cells, each having a value and a next-cell link. We'd hacked that once before (commit d0b4399d8) to add a separate List header, but the data was still in cons cells. That makes some operations -- notably list_nth() -- O(N), and it's bulky because of the next-cell pointers and per-cell palloc overhead, and it's very cache-unfriendly if the cons cells end up scattered around rather than being adjacent. In this rewrite, we still have List headers, but the data is in a resizable array of values, with no next-cell links. Now we need at most two palloc's per List, and often only one, since we can allocate some values in the same palloc call as the List header. (Of course, extending an existing List may require repalloc's to enlarge the array. But this involves just O(log N) allocations not O(N).) Of course this is not without downsides. The key difficulty is that addition or deletion of a list entry may now cause other entries to move, which it did not before. For example, that breaks foreach() and sister macros, which historically used a pointer to the current cons-cell as loop state. We can repair those macros transparently by making their actual loop state be an integer list index; the exposed "ListCell *" pointer is no longer state carried across loop iterations, but is just a derived value. (In practice, modern compilers can optimize things back to having just one loop state value, at least for simple cases with inline loop bodies.) In principle, this is a semantics change for cases where the loop body inserts or deletes list entries ahead of the current loop index; but I found no such cases in the Postgres code. The change is not at all transparent for code that doesn't use foreach() but chases lists "by hand" using lnext(). The largest share of such code in the backend is in loops that were maintaining "prev" and "next" variables in addition to the current-cell pointer, in order to delete list cells efficiently using list_delete_cell(). However, we no longer need a previous-cell pointer to delete a list cell efficiently. Keeping a next-cell pointer doesn't work, as explained above, but we can improve matters by changing such code to use a regular foreach() loop and then using the new macro foreach_delete_current() to delete the current cell. (This macro knows how to update the associated foreach loop's state so that no cells will be missed in the traversal.) There remains a nontrivial risk of code assuming that a ListCell * pointer will remain good over an operation that could now move the list contents. To help catch such errors, list.c can be compiled with a new define symbol DEBUG_LIST_MEMORY_USAGE that forcibly moves list contents whenever that could possibly happen. This makes list operations significantly more expensive so it's not normally turned on (though it is on by default if USE_VALGRIND is on). There are two notable API differences from the previous code: * lnext() now requires the List's header pointer in addition to the current cell's address. * list_delete_cell() no longer requires a previous-cell argument. These changes are somewhat unfortunate, but on the other hand code using either function needs inspection to see if it is assuming anything it shouldn't, so it's not all bad. Programmers should be aware of these significant performance changes: * list_nth() and related functions are now O(1); so there's no major access-speed difference between a list and an array. * Inserting or deleting a list element now takes time proportional to the distance to the end of the list, due to moving the array elements. (However, it typically *doesn't* require palloc or pfree, so except in long lists it's probably still faster than before.) Notably, lcons() used to be about the same cost as lappend(), but that's no longer true if the list is long. Code that uses lcons() and list_delete_first() to maintain a stack might usefully be rewritten to push and pop at the end of the list rather than the beginning. * There are now list_insert_nth...() and list_delete_nth...() functions that add or remove a list cell identified by index. These have the data-movement penalty explained above, but there's no search penalty. * list_concat() and variants now copy the second list's data into storage belonging to the first list, so there is no longer any sharing of cells between the input lists. The second argument is now declared "const List *" to reflect that it isn't changed. This patch just does the minimum needed to get the new implementation in place and fix bugs exposed by the regression tests. As suggested by the foregoing, there's a fair amount of followup work remaining to do. Also, the ENABLE_LIST_COMPAT macros are finally removed in this commit. Code using those should have been gone a dozen years ago. Patch by me; thanks to David Rowley, Jesper Pedersen, and others for review. Discussion: https://postgr.es/m/11587.1550975080@sss.pgh.pa.us
2019-07-15 19:41:58 +02:00
indexpr_item = lnext(indexInfo->ii_Expressions,
indexpr_item);
thisdata->vacattrstats[tcnt] =
examine_attribute(Irel[ind], i + 1, indexkey);
if (thisdata->vacattrstats[tcnt] != NULL)
tcnt++;
}
}
thisdata->attr_cnt = tcnt;
}
}
}
/*
* Determine how many rows we need to sample, using the worst case from
* all analyzable columns. We use a lower bound of 100 rows to avoid
* possible overflow in Vitter's algorithm. (Note: that will also be the
* target in the corner case where there are no analyzable columns.)
*/
targrows = 100;
2000-05-29 19:44:17 +02:00
for (i = 0; i < attr_cnt; i++)
{
if (targrows < vacattrstats[i]->minrows)
targrows = vacattrstats[i]->minrows;
}
for (ind = 0; ind < nindexes; ind++)
{
AnlIndexData *thisdata = &indexdata[ind];
for (i = 0; i < thisdata->attr_cnt; i++)
{
if (targrows < thisdata->vacattrstats[i]->minrows)
targrows = thisdata->vacattrstats[i]->minrows;
}
}
/*
* Look at extended statistics objects too, as those may define custom
* statistics target. So we may need to sample more rows and then build
* the statistics with enough detail.
*/
minrows = ComputeExtStatisticsRows(onerel, attr_cnt, vacattrstats);
if (targrows < minrows)
targrows = minrows;
/*
* Acquire the sample rows
*/
rows = (HeapTuple *) palloc(targrows * sizeof(HeapTuple));
pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE,
inh ? PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS_INH :
PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS);
if (inh)
numrows = acquire_inherited_sample_rows(onerel, elevel,
rows, targrows,
&totalrows, &totaldeadrows);
else
numrows = (*acquirefunc) (onerel, elevel,
rows, targrows,
&totalrows, &totaldeadrows);
2000-05-29 19:44:17 +02:00
/*
* Compute the statistics. Temporary results during the calculations for
* each column are stored in a child context. The calc routines are
* responsible to make sure that whatever they store into the VacAttrStats
* structure is allocated in anl_context.
*/
if (numrows > 0)
{
MemoryContext col_context,
old_context;
pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE,
PROGRESS_ANALYZE_PHASE_COMPUTE_STATS);
col_context = AllocSetContextCreate(anl_context,
"Analyze Column",
Add macros to make AllocSetContextCreate() calls simpler and safer. I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls had typos in the context-sizing parameters. While none of these led to especially significant problems, they did create minor inefficiencies, and it's now clear that expecting people to copy-and-paste those calls accurately is not a great idea. Let's reduce the risk of future errors by introducing single macros that encapsulate the common use-cases. Three such macros are enough to cover all but two special-purpose contexts; those two calls can be left as-is, I think. While this patch doesn't in itself improve matters for third-party extensions, it doesn't break anything for them either, and they can gradually adopt the simplified notation over time. In passing, change TopMemoryContext to use the default allocation parameters. Formerly it could only be extended 8K at a time. That was probably reasonable when this code was written; but nowadays we create many more contexts than we did then, so that it's not unusual to have a couple hundred K in TopMemoryContext, even without considering various dubious code that sticks other things there. There seems no good reason not to let it use growing blocks like most other contexts. Back-patch to 9.6, mostly because that's still close enough to HEAD that it's easy to do so, and keeping the branches in sync can be expected to avoid some future back-patching pain. The bugs fixed by these changes don't seem to be significant enough to justify fixing them further back. Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
ALLOCSET_DEFAULT_SIZES);
old_context = MemoryContextSwitchTo(col_context);
for (i = 0; i < attr_cnt; i++)
2000-05-29 19:44:17 +02:00
{
VacAttrStats *stats = vacattrstats[i];
AttributeOpts *aopt;
stats->rows = rows;
stats->tupDesc = onerel->rd_att;
stats->compute_stats(stats,
std_fetch_func,
numrows,
totalrows);
/*
* If the appropriate flavor of the n_distinct option is
* specified, override with the corresponding value.
*/
aopt = get_attribute_options(onerel->rd_id, stats->tupattnum);
if (aopt != NULL)
{
float8 n_distinct;
2010-02-26 03:01:40 +01:00
n_distinct = inh ? aopt->n_distinct_inherited : aopt->n_distinct;
if (n_distinct != 0.0)
stats->stadistinct = n_distinct;
}
MemoryContextReset(col_context);
2000-05-29 19:44:17 +02:00
}
if (nindexes > 0)
compute_index_stats(onerel, totalrows,
indexdata, nindexes,
rows, numrows,
col_context);
MemoryContextSwitchTo(old_context);
MemoryContextDelete(col_context);
/*
* Emit the completed stats rows into pg_statistic, replacing any
* previous statistics for the target columns. (If there are stats in
* pg_statistic for columns we didn't process, we leave them alone.)
*/
update_attstats(RelationGetRelid(onerel), inh,
attr_cnt, vacattrstats);
for (ind = 0; ind < nindexes; ind++)
{
AnlIndexData *thisdata = &indexdata[ind];
update_attstats(RelationGetRelid(Irel[ind]), false,
thisdata->attr_cnt, thisdata->vacattrstats);
}
Implement multivariate n-distinct coefficients Add support for explicitly declared statistic objects (CREATE STATISTICS), allowing collection of statistics on more complex combinations that individual table columns. Companion commands DROP STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are added too. All this DDL has been designed so that more statistic types can be added later on, such as multivariate most-common-values and multivariate histograms between columns of a single table, leaving room for permitting columns on multiple tables, too, as well as expressions. This commit only adds support for collection of n-distinct coefficient on user-specified sets of columns in a single table. This is useful to estimate number of distinct groups in GROUP BY and DISTINCT clauses; estimation errors there can cause over-allocation of memory in hashed aggregates, for instance, so it's a worthwhile problem to solve. A new special pseudo-type pg_ndistinct is used. (num-distinct estimation was deemed sufficiently useful by itself that this is worthwhile even if no further statistic types are added immediately; so much so that another version of essentially the same functionality was submitted by Kyotaro Horiguchi: https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp though this commit does not use that code.) Author: Tomas Vondra. Some code rework by Álvaro. Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes, Ideriha Takeshi Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
Add stxdinherit flag to pg_statistic_ext_data Add pg_statistic_ext_data.stxdinherit flag, so that for each extended statistics definition we can store two versions of data - one for the relation alone, one for the whole inheritance tree. This is analogous to pg_statistic.stainherit, but we failed to include such flag in catalogs for extended statistics, and we had to work around it (see commits 859b3003de, 36c4bc6e72 and 20b9fa308e). This changes the relationship between the two catalogs storing extended statistics objects (pg_statistic_ext and pg_statistic_ext_data). Until now, there was a simple 1:1 mapping - for each definition there was one pg_statistic_ext_data row, and this row was inserted while creating the statistics (and then updated during ANALYZE). With the stxdinherit flag, we don't know how many rows there will be (child relations may be added after the statistics object is defined), so there may be up to two rows. We could make CREATE STATISTICS to always create both rows, but that seems wasteful - without partitioning we only need stxdinherit=false rows, and declaratively partitioned tables need only stxdinherit=true. So we no longer initialize pg_statistic_ext_data in CREATE STATISTICS, and instead make that a responsibility of ANALYZE. Which is what we do for regular statistics too. Patch by me, with extensive improvements and fixes by Justin Pryzby. Author: Tomas Vondra, Justin Pryzby Reviewed-by: Tomas Vondra, Justin Pryzby Discussion: https://postgr.es/m/20210923212624.GI831%40telsasoft.com
2022-01-16 13:37:56 +01:00
/* Build extended statistics (if there are any). */
BuildRelationExtStatistics(onerel, inh, totalrows, numrows, rows,
attr_cnt, vacattrstats);
}
pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE,
PROGRESS_ANALYZE_PHASE_FINALIZE_ANALYZE);
/*
* Update pages/tuples stats in pg_class ... but not if we're doing
* inherited stats.
VACUUM ANALYZE: Always update pg_class.reltuples. vacuumlazy.c sometimes fails to update pg_class entries for each index (to ensure that pg_class.reltuples is current), even though analyze.c assumed that that must have happened during VACUUM ANALYZE. There are at least a couple of reasons for this. For example, vacuumlazy.c could fail to update pg_class when the index AM indicated that its statistics are merely an estimate, per the contract for amvacuumcleanup() routines established by commit e57345975cf back in 2006. Stop assuming that pg_class must have been updated with accurate statistics within VACUUM ANALYZE -- update pg_class for indexes at the same time as the table relation in all cases. That way VACUUM ANALYZE will never fail to keep pg_class.reltuples reasonably accurate. The only downside of this approach (compared to the old approach) is that it might inaccurately set pg_class.reltuples for indexes whose heap relation ends up with the same inaccurate value anyway. This doesn't seem too bad. We already consistently called vac_update_relstats() (to update pg_class) for the heap/table relation twice during any VACUUM ANALYZE -- once in vacuumlazy.c, and once in analyze.c. We now make sure that we call vac_update_relstats() at least once (though often twice) for each index. This is follow up work to commit 9f3665fb, which dealt with issues in btvacuumcleanup(). Technically this fixes an unrelated issue, though. btvacuumcleanup() no longer provides an accurate num_index_tuples value following commit 9f3665fb (when there was no btbulkdelete() call during the VACUUM operation in question), but hashvacuumcleanup() has worked in the same way for many years now. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAH2-WzknxdComjhqo4SUxVFk_Q1171GJO2ZgHZ1Y6pion6u8rA@mail.gmail.com Backpatch: 13-, just like commit 9f3665fb.
2021-03-11 02:07:57 +01:00
*
* We assume that VACUUM hasn't set pg_class.reltuples already, even
* during a VACUUM ANALYZE. Although VACUUM often updates pg_class,
* exceptions exist. A "VACUUM (ANALYZE, INDEX_CLEANUP OFF)" command will
VACUUM ANALYZE: Always update pg_class.reltuples. vacuumlazy.c sometimes fails to update pg_class entries for each index (to ensure that pg_class.reltuples is current), even though analyze.c assumed that that must have happened during VACUUM ANALYZE. There are at least a couple of reasons for this. For example, vacuumlazy.c could fail to update pg_class when the index AM indicated that its statistics are merely an estimate, per the contract for amvacuumcleanup() routines established by commit e57345975cf back in 2006. Stop assuming that pg_class must have been updated with accurate statistics within VACUUM ANALYZE -- update pg_class for indexes at the same time as the table relation in all cases. That way VACUUM ANALYZE will never fail to keep pg_class.reltuples reasonably accurate. The only downside of this approach (compared to the old approach) is that it might inaccurately set pg_class.reltuples for indexes whose heap relation ends up with the same inaccurate value anyway. This doesn't seem too bad. We already consistently called vac_update_relstats() (to update pg_class) for the heap/table relation twice during any VACUUM ANALYZE -- once in vacuumlazy.c, and once in analyze.c. We now make sure that we call vac_update_relstats() at least once (though often twice) for each index. This is follow up work to commit 9f3665fb, which dealt with issues in btvacuumcleanup(). Technically this fixes an unrelated issue, though. btvacuumcleanup() no longer provides an accurate num_index_tuples value following commit 9f3665fb (when there was no btbulkdelete() call during the VACUUM operation in question), but hashvacuumcleanup() has worked in the same way for many years now. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAH2-WzknxdComjhqo4SUxVFk_Q1171GJO2ZgHZ1Y6pion6u8rA@mail.gmail.com Backpatch: 13-, just like commit 9f3665fb.
2021-03-11 02:07:57 +01:00
* never update pg_class entries for index relations. It's also possible
* that an individual index's pg_class entry won't be updated during
* VACUUM if the index AM returns NULL from its amvacuumcleanup() routine.
*/
Fix VACUUM so that it always updates pg_class.reltuples/relpages. When we added the ability for vacuum to skip heap pages by consulting the visibility map, we made it just not update the reltuples/relpages statistics if it skipped any pages. But this could leave us with extremely out-of-date stats for a table that contains any unchanging areas, especially for TOAST tables which never get processed by ANALYZE. In particular this could result in autovacuum making poor decisions about when to process the table, as in recent report from Florian Helmberger. And in general it's a bad idea to not update the stats at all. Instead, use the previous values of reltuples/relpages as an estimate of the tuple density in unvisited pages. This approach results in a "moving average" estimate of reltuples, which should converge to the correct value over multiple VACUUM and ANALYZE cycles even when individual measurements aren't very good. This new method for updating reltuples is used by both VACUUM and ANALYZE, with the result that we no longer need the grotty interconnections that caused ANALYZE to not update the stats depending on what had happened in the parent VACUUM command. Also, fix the logic for skipping all-visible pages during VACUUM so that it looks ahead rather than behind to decide what to do, as per a suggestion from Greg Stark. This eliminates useless scanning of all-visible pages at the start of the relation or just after a not-all-visible page. In particular, the first few pages of the relation will not be invariably included in the scanned pages, which seems to help in not overweighting them in the reltuples estimate. Back-patch to 8.4, where the visibility map was introduced.
2011-05-30 23:05:26 +02:00
if (!inh)
{
BlockNumber relallvisible;
if (RELKIND_HAS_STORAGE(onerel->rd_rel->relkind))
visibilitymap_count(onerel, &relallvisible, NULL);
else
relallvisible = 0;
VACUUM ANALYZE: Always update pg_class.reltuples. vacuumlazy.c sometimes fails to update pg_class entries for each index (to ensure that pg_class.reltuples is current), even though analyze.c assumed that that must have happened during VACUUM ANALYZE. There are at least a couple of reasons for this. For example, vacuumlazy.c could fail to update pg_class when the index AM indicated that its statistics are merely an estimate, per the contract for amvacuumcleanup() routines established by commit e57345975cf back in 2006. Stop assuming that pg_class must have been updated with accurate statistics within VACUUM ANALYZE -- update pg_class for indexes at the same time as the table relation in all cases. That way VACUUM ANALYZE will never fail to keep pg_class.reltuples reasonably accurate. The only downside of this approach (compared to the old approach) is that it might inaccurately set pg_class.reltuples for indexes whose heap relation ends up with the same inaccurate value anyway. This doesn't seem too bad. We already consistently called vac_update_relstats() (to update pg_class) for the heap/table relation twice during any VACUUM ANALYZE -- once in vacuumlazy.c, and once in analyze.c. We now make sure that we call vac_update_relstats() at least once (though often twice) for each index. This is follow up work to commit 9f3665fb, which dealt with issues in btvacuumcleanup(). Technically this fixes an unrelated issue, though. btvacuumcleanup() no longer provides an accurate num_index_tuples value following commit 9f3665fb (when there was no btbulkdelete() call during the VACUUM operation in question), but hashvacuumcleanup() has worked in the same way for many years now. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAH2-WzknxdComjhqo4SUxVFk_Q1171GJO2ZgHZ1Y6pion6u8rA@mail.gmail.com Backpatch: 13-, just like commit 9f3665fb.
2021-03-11 02:07:57 +01:00
/* Update pg_class for table relation */
vac_update_relstats(onerel,
relpages,
totalrows,
relallvisible,
hasindex,
Improve concurrency of foreign key locking This patch introduces two additional lock modes for tuples: "SELECT FOR KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each other, in contrast with already existing "SELECT FOR SHARE" and "SELECT FOR UPDATE". UPDATE commands that do not modify the values stored in the columns that are part of the key of the tuple now grab a SELECT FOR NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently with tuple locks of the FOR KEY SHARE variety. Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this means the concurrency improvement applies to them, which is the whole point of this patch. The added tuple lock semantics require some rejiggering of the multixact module, so that the locking level that each transaction is holding can be stored alongside its Xid. Also, multixacts now need to persist across server restarts and crashes, because they can now represent not only tuple locks, but also tuple updates. This means we need more careful tracking of lifetime of pg_multixact SLRU files; since they now persist longer, we require more infrastructure to figure out when they can be removed. pg_upgrade also needs to be careful to copy pg_multixact files over from the old server to the new, or at least part of multixact.c state, depending on the versions of the old and new servers. Tuple time qualification rules (HeapTupleSatisfies routines) need to be careful not to consider tuples with the "is multi" infomask bit set as being only locked; they might need to look up MultiXact values (i.e. possibly do pg_multixact I/O) to find out the Xid that updated a tuple, whereas they previously were assured to only use information readily available from the tuple header. This is considered acceptable, because the extra I/O would involve cases that would previously cause some commands to block waiting for concurrent transactions to finish. Another important change is the fact that locking tuples that have previously been updated causes the future versions to be marked as locked, too; this is essential for correctness of foreign key checks. This causes additional WAL-logging, also (there was previously a single WAL record for a locked tuple; now there are as many as updated copies of the tuple there exist.) With all this in place, contention related to tuples being checked by foreign key rules should be much reduced. As a bonus, the old behavior that a subtransaction grabbing a stronger tuple lock than the parent (sub)transaction held on a given tuple and later aborting caused the weaker lock to be lost, has been fixed. Many new spec files were added for isolation tester framework, to ensure overall behavior is sane. There's probably room for several more tests. There were several reviewers of this patch; in particular, Noah Misch and Andres Freund spent considerable time in it. Original idea for the patch came from Simon Riggs, after a problem report by Joel Jacobson. Most code is from me, with contributions from Marti Raudsepp, Alexander Shulgin, Noah Misch and Andres Freund. This patch was discussed in several pgsql-hackers threads; the most important start at the following message-ids: AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com 1290721684-sup-3951@alvh.no-ip.org 1294953201-sup-2099@alvh.no-ip.org 1320343602-sup-2290@alvh.no-ip.org 1339690386-sup-8927@alvh.no-ip.org 4FE5FF020200002500048A3D@gw.wicourts.gov 4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
InvalidTransactionId,
InvalidMultiXactId,
Add VACUUM instrumentation for scanned pages, relfrozenxid. Report on scanned pages within VACUUM VERBOSE and autovacuum logging. These are pages that were physically examined during the VACUUM operation. Note that this can include a small number of pages that were marked all-visible in the visibility map by some earlier VACUUM operation. VACUUM won't skip all-visible pages that aren't part of a range of all-visible pages that's at least 32 blocks in length (partly to avoid missing out on opportunities to advance relfrozenxid during non-aggressive VACUUMs). Commit 44fa8488 simplified the definition of scanned pages. It became the complement of the pages (of those pages from rel_pages) that were skipped using the visibility map. And so scanned pages precisely indicates how effective the visibility map was at saving work. (Before now we displayed the number of pages skipped via the visibility map when happened to be frozen pages, but not when they were merely all-visible, which was less useful to users.) Rename the user-visible OldestXmin output field to "removal cutoff", and show some supplementary information: how far behind the cutoff is (number of XIDs behind) by the time the VACUUM operation finished. This will help users to figure out what's _not_ working in extreme cases where VACUUM is fundamentally unable to remove dead tuples or freeze older tuples (e.g., due to a leaked replication slot). Also report when relfrozenxid is advanced by VACUUM in output that immediately follows "removal cutoff". This structure is intended to highlight the relationship between the new relfrozenxid value for the table, and the VACUUM operation's removal cutoff. Finally, add instrumentation of "missed dead tuples", and the number of pages that had at least one such tuple. These are fully DEAD (not just RECENTLY_DEAD) tuples with storage that could not be pruned due to failure to acquire a cleanup lock on a heap page. This is a replacement for the "skipped due to pin" instrumentation removed by commit 44fa8488. It shows more details than before for pages where failing to get a cleanup lock actually resulted in VACUUM missing out on useful work, but usually shows nothing at all instead (the mere fact that we couldn't get a cleanup lock is usually of no consequence whatsoever now). Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAH2-Wznp=c=Opj8Z7RMR3G=ec3_JfGYMN_YvmCEjoPCHzWbx0g@mail.gmail.com
2022-02-12 01:48:40 +01:00
NULL, NULL,
in_outer_xact);
VACUUM ANALYZE: Always update pg_class.reltuples. vacuumlazy.c sometimes fails to update pg_class entries for each index (to ensure that pg_class.reltuples is current), even though analyze.c assumed that that must have happened during VACUUM ANALYZE. There are at least a couple of reasons for this. For example, vacuumlazy.c could fail to update pg_class when the index AM indicated that its statistics are merely an estimate, per the contract for amvacuumcleanup() routines established by commit e57345975cf back in 2006. Stop assuming that pg_class must have been updated with accurate statistics within VACUUM ANALYZE -- update pg_class for indexes at the same time as the table relation in all cases. That way VACUUM ANALYZE will never fail to keep pg_class.reltuples reasonably accurate. The only downside of this approach (compared to the old approach) is that it might inaccurately set pg_class.reltuples for indexes whose heap relation ends up with the same inaccurate value anyway. This doesn't seem too bad. We already consistently called vac_update_relstats() (to update pg_class) for the heap/table relation twice during any VACUUM ANALYZE -- once in vacuumlazy.c, and once in analyze.c. We now make sure that we call vac_update_relstats() at least once (though often twice) for each index. This is follow up work to commit 9f3665fb, which dealt with issues in btvacuumcleanup(). Technically this fixes an unrelated issue, though. btvacuumcleanup() no longer provides an accurate num_index_tuples value following commit 9f3665fb (when there was no btbulkdelete() call during the VACUUM operation in question), but hashvacuumcleanup() has worked in the same way for many years now. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAH2-WzknxdComjhqo4SUxVFk_Q1171GJO2ZgHZ1Y6pion6u8rA@mail.gmail.com Backpatch: 13-, just like commit 9f3665fb.
2021-03-11 02:07:57 +01:00
/* Same for indexes */
for (ind = 0; ind < nindexes; ind++)
{
AnlIndexData *thisdata = &indexdata[ind];
double totalindexrows;
totalindexrows = ceil(thisdata->tupleFract * totalrows);
vac_update_relstats(Irel[ind],
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
false,
Improve concurrency of foreign key locking This patch introduces two additional lock modes for tuples: "SELECT FOR KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each other, in contrast with already existing "SELECT FOR SHARE" and "SELECT FOR UPDATE". UPDATE commands that do not modify the values stored in the columns that are part of the key of the tuple now grab a SELECT FOR NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently with tuple locks of the FOR KEY SHARE variety. Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this means the concurrency improvement applies to them, which is the whole point of this patch. The added tuple lock semantics require some rejiggering of the multixact module, so that the locking level that each transaction is holding can be stored alongside its Xid. Also, multixacts now need to persist across server restarts and crashes, because they can now represent not only tuple locks, but also tuple updates. This means we need more careful tracking of lifetime of pg_multixact SLRU files; since they now persist longer, we require more infrastructure to figure out when they can be removed. pg_upgrade also needs to be careful to copy pg_multixact files over from the old server to the new, or at least part of multixact.c state, depending on the versions of the old and new servers. Tuple time qualification rules (HeapTupleSatisfies routines) need to be careful not to consider tuples with the "is multi" infomask bit set as being only locked; they might need to look up MultiXact values (i.e. possibly do pg_multixact I/O) to find out the Xid that updated a tuple, whereas they previously were assured to only use information readily available from the tuple header. This is considered acceptable, because the extra I/O would involve cases that would previously cause some commands to block waiting for concurrent transactions to finish. Another important change is the fact that locking tuples that have previously been updated causes the future versions to be marked as locked, too; this is essential for correctness of foreign key checks. This causes additional WAL-logging, also (there was previously a single WAL record for a locked tuple; now there are as many as updated copies of the tuple there exist.) With all this in place, contention related to tuples being checked by foreign key rules should be much reduced. As a bonus, the old behavior that a subtransaction grabbing a stronger tuple lock than the parent (sub)transaction held on a given tuple and later aborting caused the weaker lock to be lost, has been fixed. Many new spec files were added for isolation tester framework, to ensure overall behavior is sane. There's probably room for several more tests. There were several reviewers of this patch; in particular, Noah Misch and Andres Freund spent considerable time in it. Original idea for the patch came from Simon Riggs, after a problem report by Joel Jacobson. Most code is from me, with contributions from Marti Raudsepp, Alexander Shulgin, Noah Misch and Andres Freund. This patch was discussed in several pgsql-hackers threads; the most important start at the following message-ids: AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com 1290721684-sup-3951@alvh.no-ip.org 1294953201-sup-2099@alvh.no-ip.org 1320343602-sup-2290@alvh.no-ip.org 1339690386-sup-8927@alvh.no-ip.org 4FE5FF020200002500048A3D@gw.wicourts.gov 4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
InvalidTransactionId,
InvalidMultiXactId,
Add VACUUM instrumentation for scanned pages, relfrozenxid. Report on scanned pages within VACUUM VERBOSE and autovacuum logging. These are pages that were physically examined during the VACUUM operation. Note that this can include a small number of pages that were marked all-visible in the visibility map by some earlier VACUUM operation. VACUUM won't skip all-visible pages that aren't part of a range of all-visible pages that's at least 32 blocks in length (partly to avoid missing out on opportunities to advance relfrozenxid during non-aggressive VACUUMs). Commit 44fa8488 simplified the definition of scanned pages. It became the complement of the pages (of those pages from rel_pages) that were skipped using the visibility map. And so scanned pages precisely indicates how effective the visibility map was at saving work. (Before now we displayed the number of pages skipped via the visibility map when happened to be frozen pages, but not when they were merely all-visible, which was less useful to users.) Rename the user-visible OldestXmin output field to "removal cutoff", and show some supplementary information: how far behind the cutoff is (number of XIDs behind) by the time the VACUUM operation finished. This will help users to figure out what's _not_ working in extreme cases where VACUUM is fundamentally unable to remove dead tuples or freeze older tuples (e.g., due to a leaked replication slot). Also report when relfrozenxid is advanced by VACUUM in output that immediately follows "removal cutoff". This structure is intended to highlight the relationship between the new relfrozenxid value for the table, and the VACUUM operation's removal cutoff. Finally, add instrumentation of "missed dead tuples", and the number of pages that had at least one such tuple. These are fully DEAD (not just RECENTLY_DEAD) tuples with storage that could not be pruned due to failure to acquire a cleanup lock on a heap page. This is a replacement for the "skipped due to pin" instrumentation removed by commit 44fa8488. It shows more details than before for pages where failing to get a cleanup lock actually resulted in VACUUM missing out on useful work, but usually shows nothing at all instead (the mere fact that we couldn't get a cleanup lock is usually of no consequence whatsoever now). Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAH2-Wznp=c=Opj8Z7RMR3G=ec3_JfGYMN_YvmCEjoPCHzWbx0g@mail.gmail.com
2022-02-12 01:48:40 +01:00
NULL, NULL,
in_outer_xact);
}
}
else if (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
/*
* Partitioned tables don't have storage, so we don't set any fields
* in their pg_class entries except for reltuples and relhasindex.
*/
vac_update_relstats(onerel, -1, totalrows,
0, hasindex, InvalidTransactionId,
InvalidMultiXactId,
Add VACUUM instrumentation for scanned pages, relfrozenxid. Report on scanned pages within VACUUM VERBOSE and autovacuum logging. These are pages that were physically examined during the VACUUM operation. Note that this can include a small number of pages that were marked all-visible in the visibility map by some earlier VACUUM operation. VACUUM won't skip all-visible pages that aren't part of a range of all-visible pages that's at least 32 blocks in length (partly to avoid missing out on opportunities to advance relfrozenxid during non-aggressive VACUUMs). Commit 44fa8488 simplified the definition of scanned pages. It became the complement of the pages (of those pages from rel_pages) that were skipped using the visibility map. And so scanned pages precisely indicates how effective the visibility map was at saving work. (Before now we displayed the number of pages skipped via the visibility map when happened to be frozen pages, but not when they were merely all-visible, which was less useful to users.) Rename the user-visible OldestXmin output field to "removal cutoff", and show some supplementary information: how far behind the cutoff is (number of XIDs behind) by the time the VACUUM operation finished. This will help users to figure out what's _not_ working in extreme cases where VACUUM is fundamentally unable to remove dead tuples or freeze older tuples (e.g., due to a leaked replication slot). Also report when relfrozenxid is advanced by VACUUM in output that immediately follows "removal cutoff". This structure is intended to highlight the relationship between the new relfrozenxid value for the table, and the VACUUM operation's removal cutoff. Finally, add instrumentation of "missed dead tuples", and the number of pages that had at least one such tuple. These are fully DEAD (not just RECENTLY_DEAD) tuples with storage that could not be pruned due to failure to acquire a cleanup lock on a heap page. This is a replacement for the "skipped due to pin" instrumentation removed by commit 44fa8488. It shows more details than before for pages where failing to get a cleanup lock actually resulted in VACUUM missing out on useful work, but usually shows nothing at all instead (the mere fact that we couldn't get a cleanup lock is usually of no consequence whatsoever now). Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAH2-Wznp=c=Opj8Z7RMR3G=ec3_JfGYMN_YvmCEjoPCHzWbx0g@mail.gmail.com
2022-02-12 01:48:40 +01:00
NULL, NULL,
in_outer_xact);
}
/*
* Now report ANALYZE to the cumulative stats system. For regular tables,
* we do it only if not doing inherited stats. For partitioned tables, we
* only do it for inherited stats. (We're never called for not-inherited
* stats on partitioned tables anyway.)
*
* Reset the changes_since_analyze counter only if we analyzed all
* columns; otherwise, there is still work for auto-analyze to do.
*/
if (!inh)
pgstat_report_analyze(onerel, totalrows, totaldeadrows,
(va_cols == NIL));
else if (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
pgstat_report_analyze(onerel, 0, 0, (va_cols == NIL));
Revise pgstat's tracking of tuple changes to improve the reliability of decisions about when to auto-analyze. The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples, where all three of these numbers could be bad estimates from ANALYZE itself. Even worse, in the presence of a steady flow of HOT updates and matching HOT-tuple reclamations, auto-analyze might never trigger at all, even if all three numbers are exactly right, because n_dead_tuples could hold steady. To fix, replace last_anl_tuples with an accurately tracked count of the total number of committed tuple inserts + updates + deletes since the last ANALYZE on the table. This can still be compared to the same threshold as before, but it's much more trustworthy than the old computation. Tracking this requires one more intra-transaction counter per modified table within backends, but no additional memory space in the stats collector. There probably isn't any measurable speed difference; if anything it might be a bit faster than before, since I was able to eliminate some per-tuple arithmetic operations in favor of adding sums once per (sub)transaction. Also, simplify the logic around pgstat vacuum and analyze reporting messages by not trying to fold VACUUM ANALYZE into a single pgstat message. The original thought behind this patch was to allow scheduling of analyzes on parent tables by artificially inflating their changes_since_analyze count. I've left that for a separate patch since this change seems to stand on its own merit.
2009-12-30 21:32:14 +01:00
VACUUM ANALYZE: Always update pg_class.reltuples. vacuumlazy.c sometimes fails to update pg_class entries for each index (to ensure that pg_class.reltuples is current), even though analyze.c assumed that that must have happened during VACUUM ANALYZE. There are at least a couple of reasons for this. For example, vacuumlazy.c could fail to update pg_class when the index AM indicated that its statistics are merely an estimate, per the contract for amvacuumcleanup() routines established by commit e57345975cf back in 2006. Stop assuming that pg_class must have been updated with accurate statistics within VACUUM ANALYZE -- update pg_class for indexes at the same time as the table relation in all cases. That way VACUUM ANALYZE will never fail to keep pg_class.reltuples reasonably accurate. The only downside of this approach (compared to the old approach) is that it might inaccurately set pg_class.reltuples for indexes whose heap relation ends up with the same inaccurate value anyway. This doesn't seem too bad. We already consistently called vac_update_relstats() (to update pg_class) for the heap/table relation twice during any VACUUM ANALYZE -- once in vacuumlazy.c, and once in analyze.c. We now make sure that we call vac_update_relstats() at least once (though often twice) for each index. This is follow up work to commit 9f3665fb, which dealt with issues in btvacuumcleanup(). Technically this fixes an unrelated issue, though. btvacuumcleanup() no longer provides an accurate num_index_tuples value following commit 9f3665fb (when there was no btbulkdelete() call during the VACUUM operation in question), but hashvacuumcleanup() has worked in the same way for many years now. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAH2-WzknxdComjhqo4SUxVFk_Q1171GJO2ZgHZ1Y6pion6u8rA@mail.gmail.com Backpatch: 13-, just like commit 9f3665fb.
2021-03-11 02:07:57 +01:00
/*
* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup.
*
* Note that most index AMs perform a no-op as a matter of policy for
* amvacuumcleanup() when called in ANALYZE-only mode. The only exception
* among core index AMs is GIN/ginvacuumcleanup().
*/
if (!(params->options & VACOPT_VACUUM))
{
for (ind = 0; ind < nindexes; ind++)
{
IndexBulkDeleteResult *stats;
IndexVacuumInfo ivinfo;
ivinfo.index = Irel[ind];
ivinfo.heaprel = onerel;
ivinfo.analyze_only = true;
ivinfo.estimated_count = true;
ivinfo.message_level = elevel;
ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
ivinfo.strategy = vac_strategy;
stats = index_vacuum_cleanup(&ivinfo, NULL);
if (stats)
pfree(stats);
}
}
/* Done with indexes */
vac_close_indexes(nindexes, Irel, NoLock);
/* Log the action if appropriate */
if (AmAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
{
TimestampTz endtime = GetCurrentTimestamp();
if (params->log_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params->log_min_duration))
{
long delay_in_ms;
double read_rate = 0;
double write_rate = 0;
StringInfoData buf;
/*
* Calculate the difference in the Page Hit/Miss/Dirty that
* happened as part of the analyze by subtracting out the
* pre-analyze values which we saved above.
*/
AnalyzePageHit = VacuumPageHit - AnalyzePageHit;
AnalyzePageMiss = VacuumPageMiss - AnalyzePageMiss;
AnalyzePageDirty = VacuumPageDirty - AnalyzePageDirty;
/*
* We do not expect an analyze to take > 25 days and it simplifies
* things a bit to use TimestampDifferenceMilliseconds.
*/
delay_in_ms = TimestampDifferenceMilliseconds(starttime, endtime);
/*
* Note that we are reporting these read/write rates in the same
* manner as VACUUM does, which means that while the 'average read
* rate' here actually corresponds to page misses and resulting
* reads which are also picked up by track_io_timing, if enabled,
* the 'average write rate' is actually talking about the rate of
* pages being dirtied, not being written out, so it's typical to
2021-08-07 12:09:22 +02:00
* have a non-zero 'avg write rate' while I/O timings only reports
* reads.
*
* It's not clear that an ANALYZE will ever result in
* FlushBuffer() being called, but we track and support reporting
* on I/O write time in case that changes as it's practically free
* to do so anyway.
*/
if (delay_in_ms > 0)
{
read_rate = (double) BLCKSZ * AnalyzePageMiss / (1024 * 1024) /
(delay_in_ms / 1000.0);
write_rate = (double) BLCKSZ * AnalyzePageDirty / (1024 * 1024) /
(delay_in_ms / 1000.0);
}
/*
* We split this up so we don't emit empty I/O timing values when
* track_io_timing isn't enabled.
*/
initStringInfo(&buf);
appendStringInfo(&buf, _("automatic analyze of table \"%s.%s.%s\"\n"),
get_database_name(MyDatabaseId),
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel));
if (track_io_timing)
{
double read_ms = (double) (pgStatBlockReadTime - startreadtime) / 1000;
double write_ms = (double) (pgStatBlockWriteTime - startwritetime) / 1000;
appendStringInfo(&buf, _("I/O timings: read: %.3f ms, write: %.3f ms\n"),
read_ms, write_ms);
}
appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
read_rate, write_rate);
appendStringInfo(&buf, _("buffer usage: %lld hits, %lld misses, %lld dirtied\n"),
(long long) AnalyzePageHit,
(long long) AnalyzePageMiss,
(long long) AnalyzePageDirty);
appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0));
ereport(LOG,
(errmsg_internal("%s", buf.data)));
pfree(buf.data);
}
}
2009-12-09 22:57:51 +01:00
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
/* Restore userid and security context */
SetUserIdAndSecContext(save_userid, save_sec_context);
/* Restore current context and release memory */
MemoryContextSwitchTo(caller_context);
MemoryContextDelete(anl_context);
anl_context = NULL;
}
/*
* Compute statistics about indexes of a relation
*/
static void
compute_index_stats(Relation onerel, double totalrows,
AnlIndexData *indexdata, int nindexes,
HeapTuple *rows, int numrows,
MemoryContext col_context)
{
MemoryContext ind_context,
old_context;
Datum values[INDEX_MAX_KEYS];
bool isnull[INDEX_MAX_KEYS];
int ind,
i;
ind_context = AllocSetContextCreate(anl_context,
"Analyze Index",
Add macros to make AllocSetContextCreate() calls simpler and safer. I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls had typos in the context-sizing parameters. While none of these led to especially significant problems, they did create minor inefficiencies, and it's now clear that expecting people to copy-and-paste those calls accurately is not a great idea. Let's reduce the risk of future errors by introducing single macros that encapsulate the common use-cases. Three such macros are enough to cover all but two special-purpose contexts; those two calls can be left as-is, I think. While this patch doesn't in itself improve matters for third-party extensions, it doesn't break anything for them either, and they can gradually adopt the simplified notation over time. In passing, change TopMemoryContext to use the default allocation parameters. Formerly it could only be extended 8K at a time. That was probably reasonable when this code was written; but nowadays we create many more contexts than we did then, so that it's not unusual to have a couple hundred K in TopMemoryContext, even without considering various dubious code that sticks other things there. There seems no good reason not to let it use growing blocks like most other contexts. Back-patch to 9.6, mostly because that's still close enough to HEAD that it's easy to do so, and keeping the branches in sync can be expected to avoid some future back-patching pain. The bugs fixed by these changes don't seem to be significant enough to justify fixing them further back. Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
ALLOCSET_DEFAULT_SIZES);
old_context = MemoryContextSwitchTo(ind_context);
for (ind = 0; ind < nindexes; ind++)
{
AnlIndexData *thisdata = &indexdata[ind];
IndexInfo *indexInfo = thisdata->indexInfo;
int attr_cnt = thisdata->attr_cnt;
TupleTableSlot *slot;
EState *estate;
ExprContext *econtext;
Faster expression evaluation and targetlist projection. This replaces the old, recursive tree-walk based evaluation, with non-recursive, opcode dispatch based, expression evaluation. Projection is now implemented as part of expression evaluation. This both leads to significant performance improvements, and makes future just-in-time compilation of expressions easier. The speed gains primarily come from: - non-recursive implementation reduces stack usage / overhead - simple sub-expressions are implemented with a single jump, without function calls - sharing some state between different sub-expressions - reduced amount of indirect/hard to predict memory accesses by laying out operation metadata sequentially; including the avoidance of nearly all of the previously used linked lists - more code has been moved to expression initialization, avoiding constant re-checks at evaluation time Future just-in-time compilation (JIT) has become easier, as demonstrated by released patches intended to be merged in a later release, for primarily two reasons: Firstly, due to a stricter split between expression initialization and evaluation, less code has to be handled by the JIT. Secondly, due to the non-recursive nature of the generated "instructions", less performance-critical code-paths can easily be shared between interpreted and compiled evaluation. The new framework allows for significant future optimizations. E.g.: - basic infrastructure for to later reduce the per executor-startup overhead of expression evaluation, by caching state in prepared statements. That'd be helpful in OLTPish scenarios where initialization overhead is measurable. - optimizing the generated "code". A number of proposals for potential work has already been made. - optimizing the interpreter. Similarly a number of proposals have been made here too. The move of logic into the expression initialization step leads to some backward-incompatible changes: - Function permission checks are now done during expression initialization, whereas previously they were done during execution. In edge cases this can lead to errors being raised that previously wouldn't have been, e.g. a NULL array being coerced to a different array type previously didn't perform checks. - The set of domain constraints to be checked, is now evaluated once during expression initialization, previously it was re-built every time a domain check was evaluated. For normal queries this doesn't change much, but e.g. for plpgsql functions, which caches ExprStates, the old set could stick around longer. The behavior around might still change. Author: Andres Freund, with significant changes by Tom Lane, changes by Heikki Linnakangas Reviewed-By: Tom Lane, Heikki Linnakangas Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
ExprState *predicate;
Datum *exprvals;
bool *exprnulls;
int numindexrows,
tcnt,
rowno;
double totalindexrows;
/* Ignore index if no columns to analyze and not partial */
if (attr_cnt == 0 && indexInfo->ii_Predicate == NIL)
continue;
/*
* Need an EState for evaluation of index expressions and
* partial-index predicates. Create it in the per-index context to be
* sure it gets cleaned up at the bottom of the loop.
*/
estate = CreateExecutorState();
econtext = GetPerTupleExprContext(estate);
/* Need a slot to hold the current heap tuple, too */
Introduce notion of different types of slots (without implementing them). Upcoming work intends to allow pluggable ways to introduce new ways of storing table data. Accessing those table access methods from the executor requires TupleTableSlots to be carry tuples in the native format of such storage methods; otherwise there'll be a significant conversion overhead. Different access methods will require different data to store tuples efficiently (just like virtual, minimal, heap already require fields in TupleTableSlot). To allow that without requiring additional pointer indirections, we want to have different structs (embedding TupleTableSlot) for different types of slots. Thus different types of slots are needed, which requires adapting creators of slots. The slot that most efficiently can represent a type of tuple in an executor node will often depend on the type of slot a child node uses. Therefore we need to track the type of slot is returned by nodes, so parent slots can create slots based on that. Relatedly, JIT compilation of tuple deforming needs to know which type of slot a certain expression refers to, so it can create an appropriate deforming function for the type of tuple in the slot. But not all nodes will only return one type of slot, e.g. an append node will potentially return different types of slots for each of its subplans. Therefore add function that allows to query the type of a node's result slot, and whether it'll always be the same type (whether it's fixed). This can be queried using ExecGetResultSlotOps(). The scan, result, inner, outer type of slots are automatically inferred from ExecInitScanTupleSlot(), ExecInitResultSlot(), left/right subtrees respectively. If that's not correct for a node, that can be overwritten using new fields in PlanState. This commit does not introduce the actually abstracted implementation of different kind of TupleTableSlots, that will be left for a followup commit. The different types of slots introduced will, for now, still use the same backing implementation. While this already partially invalidates the big comment in tuptable.h, it seems to make more sense to update it later, when the different TupleTableSlot implementations actually exist. Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-16 07:00:30 +01:00
slot = MakeSingleTupleTableSlot(RelationGetDescr(onerel),
&TTSOpsHeapTuple);
/* Arrange for econtext's scan tuple to be the tuple under test */
econtext->ecxt_scantuple = slot;
/* Set up execution state for predicate. */
Faster expression evaluation and targetlist projection. This replaces the old, recursive tree-walk based evaluation, with non-recursive, opcode dispatch based, expression evaluation. Projection is now implemented as part of expression evaluation. This both leads to significant performance improvements, and makes future just-in-time compilation of expressions easier. The speed gains primarily come from: - non-recursive implementation reduces stack usage / overhead - simple sub-expressions are implemented with a single jump, without function calls - sharing some state between different sub-expressions - reduced amount of indirect/hard to predict memory accesses by laying out operation metadata sequentially; including the avoidance of nearly all of the previously used linked lists - more code has been moved to expression initialization, avoiding constant re-checks at evaluation time Future just-in-time compilation (JIT) has become easier, as demonstrated by released patches intended to be merged in a later release, for primarily two reasons: Firstly, due to a stricter split between expression initialization and evaluation, less code has to be handled by the JIT. Secondly, due to the non-recursive nature of the generated "instructions", less performance-critical code-paths can easily be shared between interpreted and compiled evaluation. The new framework allows for significant future optimizations. E.g.: - basic infrastructure for to later reduce the per executor-startup overhead of expression evaluation, by caching state in prepared statements. That'd be helpful in OLTPish scenarios where initialization overhead is measurable. - optimizing the generated "code". A number of proposals for potential work has already been made. - optimizing the interpreter. Similarly a number of proposals have been made here too. The move of logic into the expression initialization step leads to some backward-incompatible changes: - Function permission checks are now done during expression initialization, whereas previously they were done during execution. In edge cases this can lead to errors being raised that previously wouldn't have been, e.g. a NULL array being coerced to a different array type previously didn't perform checks. - The set of domain constraints to be checked, is now evaluated once during expression initialization, previously it was re-built every time a domain check was evaluated. For normal queries this doesn't change much, but e.g. for plpgsql functions, which caches ExprStates, the old set could stick around longer. The behavior around might still change. Author: Andres Freund, with significant changes by Tom Lane, changes by Heikki Linnakangas Reviewed-By: Tom Lane, Heikki Linnakangas Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
/* Compute and save index expression values */
exprvals = (Datum *) palloc(numrows * attr_cnt * sizeof(Datum));
exprnulls = (bool *) palloc(numrows * attr_cnt * sizeof(bool));
numindexrows = 0;
tcnt = 0;
for (rowno = 0; rowno < numrows; rowno++)
{
HeapTuple heapTuple = rows[rowno];
vacuum_delay_point();
/*
* Reset the per-tuple context each time, to reclaim any cruft
* left behind by evaluating the predicate or index expressions.
*/
ResetExprContext(econtext);
/* Set up for predicate or expression evaluation */
ExecStoreHeapTuple(heapTuple, slot, false);
/* If index is partial, check predicate */
Faster expression evaluation and targetlist projection. This replaces the old, recursive tree-walk based evaluation, with non-recursive, opcode dispatch based, expression evaluation. Projection is now implemented as part of expression evaluation. This both leads to significant performance improvements, and makes future just-in-time compilation of expressions easier. The speed gains primarily come from: - non-recursive implementation reduces stack usage / overhead - simple sub-expressions are implemented with a single jump, without function calls - sharing some state between different sub-expressions - reduced amount of indirect/hard to predict memory accesses by laying out operation metadata sequentially; including the avoidance of nearly all of the previously used linked lists - more code has been moved to expression initialization, avoiding constant re-checks at evaluation time Future just-in-time compilation (JIT) has become easier, as demonstrated by released patches intended to be merged in a later release, for primarily two reasons: Firstly, due to a stricter split between expression initialization and evaluation, less code has to be handled by the JIT. Secondly, due to the non-recursive nature of the generated "instructions", less performance-critical code-paths can easily be shared between interpreted and compiled evaluation. The new framework allows for significant future optimizations. E.g.: - basic infrastructure for to later reduce the per executor-startup overhead of expression evaluation, by caching state in prepared statements. That'd be helpful in OLTPish scenarios where initialization overhead is measurable. - optimizing the generated "code". A number of proposals for potential work has already been made. - optimizing the interpreter. Similarly a number of proposals have been made here too. The move of logic into the expression initialization step leads to some backward-incompatible changes: - Function permission checks are now done during expression initialization, whereas previously they were done during execution. In edge cases this can lead to errors being raised that previously wouldn't have been, e.g. a NULL array being coerced to a different array type previously didn't perform checks. - The set of domain constraints to be checked, is now evaluated once during expression initialization, previously it was re-built every time a domain check was evaluated. For normal queries this doesn't change much, but e.g. for plpgsql functions, which caches ExprStates, the old set could stick around longer. The behavior around might still change. Author: Andres Freund, with significant changes by Tom Lane, changes by Heikki Linnakangas Reviewed-By: Tom Lane, Heikki Linnakangas Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
if (predicate != NULL)
{
Faster expression evaluation and targetlist projection. This replaces the old, recursive tree-walk based evaluation, with non-recursive, opcode dispatch based, expression evaluation. Projection is now implemented as part of expression evaluation. This both leads to significant performance improvements, and makes future just-in-time compilation of expressions easier. The speed gains primarily come from: - non-recursive implementation reduces stack usage / overhead - simple sub-expressions are implemented with a single jump, without function calls - sharing some state between different sub-expressions - reduced amount of indirect/hard to predict memory accesses by laying out operation metadata sequentially; including the avoidance of nearly all of the previously used linked lists - more code has been moved to expression initialization, avoiding constant re-checks at evaluation time Future just-in-time compilation (JIT) has become easier, as demonstrated by released patches intended to be merged in a later release, for primarily two reasons: Firstly, due to a stricter split between expression initialization and evaluation, less code has to be handled by the JIT. Secondly, due to the non-recursive nature of the generated "instructions", less performance-critical code-paths can easily be shared between interpreted and compiled evaluation. The new framework allows for significant future optimizations. E.g.: - basic infrastructure for to later reduce the per executor-startup overhead of expression evaluation, by caching state in prepared statements. That'd be helpful in OLTPish scenarios where initialization overhead is measurable. - optimizing the generated "code". A number of proposals for potential work has already been made. - optimizing the interpreter. Similarly a number of proposals have been made here too. The move of logic into the expression initialization step leads to some backward-incompatible changes: - Function permission checks are now done during expression initialization, whereas previously they were done during execution. In edge cases this can lead to errors being raised that previously wouldn't have been, e.g. a NULL array being coerced to a different array type previously didn't perform checks. - The set of domain constraints to be checked, is now evaluated once during expression initialization, previously it was re-built every time a domain check was evaluated. For normal queries this doesn't change much, but e.g. for plpgsql functions, which caches ExprStates, the old set could stick around longer. The behavior around might still change. Author: Andres Freund, with significant changes by Tom Lane, changes by Heikki Linnakangas Reviewed-By: Tom Lane, Heikki Linnakangas Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
if (!ExecQual(predicate, econtext))
continue;
}
numindexrows++;
if (attr_cnt > 0)
{
/*
* Evaluate the index row to compute expression values. We
* could do this by hand, but FormIndexDatum is convenient.
*/
FormIndexDatum(indexInfo,
slot,
estate,
values,
isnull);
2004-08-29 07:07:03 +02:00
/*
* Save just the columns we care about. We copy the values
* into ind_context from the estate's per-tuple context.
*/
for (i = 0; i < attr_cnt; i++)
{
VacAttrStats *stats = thisdata->vacattrstats[i];
int attnum = stats->tupattnum;
if (isnull[attnum - 1])
{
exprvals[tcnt] = (Datum) 0;
exprnulls[tcnt] = true;
}
else
{
exprvals[tcnt] = datumCopy(values[attnum - 1],
stats->attrtype->typbyval,
stats->attrtype->typlen);
exprnulls[tcnt] = false;
}
tcnt++;
}
}
}
/*
* Having counted the number of rows that pass the predicate in the
* sample, we can estimate the total number of rows in the index.
*/
thisdata->tupleFract = (double) numindexrows / (double) numrows;
totalindexrows = ceil(thisdata->tupleFract * totalrows);
/*
* Now we can compute the statistics for the expression columns.
*/
if (numindexrows > 0)
{
MemoryContextSwitchTo(col_context);
for (i = 0; i < attr_cnt; i++)
{
VacAttrStats *stats = thisdata->vacattrstats[i];
stats->exprvals = exprvals + i;
stats->exprnulls = exprnulls + i;
stats->rowstride = attr_cnt;
stats->compute_stats(stats,
ind_fetch_func,
numindexrows,
totalindexrows);
MemoryContextReset(col_context);
}
}
/* And clean up */
MemoryContextSwitchTo(ind_context);
ExecDropSingleTupleTableSlot(slot);
FreeExecutorState(estate);
MemoryContextReset(ind_context);
}
MemoryContextSwitchTo(old_context);
MemoryContextDelete(ind_context);
}
/*
* examine_attribute -- pre-analysis of a single column
*
* Determine whether the column is analyzable; if so, create and initialize
* a VacAttrStats struct for it. If not, return NULL.
*
* If index_expr isn't NULL, then we're trying to analyze an expression index,
* and index_expr is the expression tree representing the column's data.
*/
static VacAttrStats *
examine_attribute(Relation onerel, int attnum, Node *index_expr)
{
Form_pg_attribute attr = TupleDescAttr(onerel->rd_att, attnum - 1);
int attstattarget;
HeapTuple atttuple;
Datum dat;
bool isnull;
HeapTuple typtuple;
VacAttrStats *stats;
int i;
bool ok;
/* Never analyze dropped columns */
if (attr->attisdropped)
return NULL;
/*
* Get attstattarget value. Set to -1 if null. (Analyze functions expect
* -1 to mean use default_statistics_target; see for example
* std_typanalyze.)
*/
atttuple = SearchSysCache2(ATTNUM, ObjectIdGetDatum(RelationGetRelid(onerel)), Int16GetDatum(attnum));
if (!HeapTupleIsValid(atttuple))
elog(ERROR, "cache lookup failed for attribute %d of relation %u",
attnum, RelationGetRelid(onerel));
dat = SysCacheGetAttr(ATTNUM, atttuple, Anum_pg_attribute_attstattarget, &isnull);
attstattarget = isnull ? -1 : DatumGetInt16(dat);
ReleaseSysCache(atttuple);
/* Don't analyze column if user has specified not to */
if (attstattarget == 0)
return NULL;
/*
* Create the VacAttrStats struct.
*/
stats = (VacAttrStats *) palloc0(sizeof(VacAttrStats));
stats->attstattarget = attstattarget;
/*
* When analyzing an expression index, believe the expression tree's type
* not the column datatype --- the latter might be the opckeytype storage
* type of the opclass, which is not interesting for our purposes. (Note:
* if we did anything with non-expression index columns, we'd need to
* figure out where to get the correct type info from, but for now that's
* not a problem.) It's not clear whether anyone will care about the
* typmod, but we store that too just in case.
*/
if (index_expr)
{
stats->attrtypid = exprType(index_expr);
stats->attrtypmod = exprTypmod(index_expr);
Make pg_statistic and related code account more honestly for collations. When we first put in collations support, we basically punted on teaching pg_statistic, ANALYZE, and the planner selectivity functions about that. They've just used DEFAULT_COLLATION_OID independently of the actual collation of the data. It's time to improve that, so: * Add columns to pg_statistic that record the specific collation associated with each statistics slot. * Teach ANALYZE to use the column's actual collation when comparing values for statistical purposes, and record this in the appropriate slot. (Note that type-specific typanalyze functions are now expected to fill stats->stacoll with the appropriate collation, too.) * Teach assorted selectivity functions to use the actual collation of the stats they are looking at, instead of just assuming it's DEFAULT_COLLATION_OID. This should give noticeably better results in selectivity estimates for columns with nondefault collations, at least for query clauses that use that same collation (which would be the default behavior in most cases). It's still true that comparisons with explicit COLLATE clauses different from the stored data's collation won't be well-estimated, but that's no worse than before. Also, this patch does make the first step towards doing better with that, which is that it's now theoretically possible to collect stats for a collation other than the column's own collation. Patch by me; thanks to Peter Eisentraut for review. Discussion: https://postgr.es/m/14706.1544630227@sss.pgh.pa.us
2018-12-14 18:52:49 +01:00
/*
* If a collation has been specified for the index column, use that in
* preference to anything else; but if not, fall back to whatever we
* can get from the expression.
*/
if (OidIsValid(onerel->rd_indcollation[attnum - 1]))
stats->attrcollid = onerel->rd_indcollation[attnum - 1];
else
stats->attrcollid = exprCollation(index_expr);
}
else
{
stats->attrtypid = attr->atttypid;
stats->attrtypmod = attr->atttypmod;
Make pg_statistic and related code account more honestly for collations. When we first put in collations support, we basically punted on teaching pg_statistic, ANALYZE, and the planner selectivity functions about that. They've just used DEFAULT_COLLATION_OID independently of the actual collation of the data. It's time to improve that, so: * Add columns to pg_statistic that record the specific collation associated with each statistics slot. * Teach ANALYZE to use the column's actual collation when comparing values for statistical purposes, and record this in the appropriate slot. (Note that type-specific typanalyze functions are now expected to fill stats->stacoll with the appropriate collation, too.) * Teach assorted selectivity functions to use the actual collation of the stats they are looking at, instead of just assuming it's DEFAULT_COLLATION_OID. This should give noticeably better results in selectivity estimates for columns with nondefault collations, at least for query clauses that use that same collation (which would be the default behavior in most cases). It's still true that comparisons with explicit COLLATE clauses different from the stored data's collation won't be well-estimated, but that's no worse than before. Also, this patch does make the first step towards doing better with that, which is that it's now theoretically possible to collect stats for a collation other than the column's own collation. Patch by me; thanks to Peter Eisentraut for review. Discussion: https://postgr.es/m/14706.1544630227@sss.pgh.pa.us
2018-12-14 18:52:49 +01:00
stats->attrcollid = attr->attcollation;
}
typtuple = SearchSysCacheCopy1(TYPEOID,
ObjectIdGetDatum(stats->attrtypid));
if (!HeapTupleIsValid(typtuple))
elog(ERROR, "cache lookup failed for type %u", stats->attrtypid);
stats->attrtype = (Form_pg_type) GETSTRUCT(typtuple);
stats->anl_context = anl_context;
stats->tupattnum = attnum;
/*
* The fields describing the stats->stavalues[n] element types default to
* the type of the data being analyzed, but the type-specific typanalyze
* function can change them if it wants to store something else.
*/
for (i = 0; i < STATISTIC_NUM_SLOTS; i++)
{
stats->statypid[i] = stats->attrtypid;
stats->statyplen[i] = stats->attrtype->typlen;
stats->statypbyval[i] = stats->attrtype->typbyval;
stats->statypalign[i] = stats->attrtype->typalign;
}
/*
* Call the type-specific typanalyze function. If none is specified, use
* std_typanalyze().
*/
if (OidIsValid(stats->attrtype->typanalyze))
ok = DatumGetBool(OidFunctionCall1(stats->attrtype->typanalyze,
PointerGetDatum(stats)));
else
ok = std_typanalyze(stats);
if (!ok || stats->compute_stats == NULL || stats->minrows <= 0)
{
heap_freetuple(typtuple);
pfree(stats);
return NULL;
}
return stats;
}
2000-05-29 19:44:17 +02:00
/*
* Read stream callback returning the next BlockNumber as chosen by the
* BlockSampling algorithm.
*/
static BlockNumber
block_sampling_read_stream_next(ReadStream *stream,
void *callback_private_data,
void *per_buffer_data)
{
BlockSamplerData *bs = callback_private_data;
return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
}
/*
* acquire_sample_rows -- acquire a random sample of rows from the table
*
* Selected rows are returned in the caller-allocated array rows[], which
* must have at least targrows entries.
* The actual number of rows selected is returned as the function result.
* We also estimate the total numbers of live and dead rows in the table,
* and return them into *totalrows and *totaldeadrows, respectively.
*
* The returned list of tuples is in order by physical position in the table.
* (We will rely on this later to derive correlation estimates.)
*
* As of May 2004 we use a new two-stage method: Stage one selects up
* to targrows random blocks (or all blocks, if there aren't so many).
* Stage two scans these blocks and uses the Vitter algorithm to create
* a random sample of targrows rows (or less, if there are less in the
* sample of blocks). The two stages are executed simultaneously: each
* block is processed as soon as stage one returns its number and while
* the rows are read stage two controls which ones are to be inserted
* into the sample.
*
* Although every row has an equal chance of ending up in the final
* sample, this sampling method is not perfect: not every possible
* sample has an equal chance of being selected. For large relations
* the number of different blocks represented by the sample tends to be
* too small. We can live with that for now. Improvements are welcome.
*
* An important property of this sampling method is that because we do
* look at a statistically unbiased set of blocks, we should get
* unbiased estimates of the average numbers of live and dead rows per
* block. The previous sampling method put too much credence in the row
* density near the start of the table.
*/
static int
acquire_sample_rows(Relation onerel, int elevel,
HeapTuple *rows, int targrows,
double *totalrows, double *totaldeadrows)
{
int numrows = 0; /* # rows now in reservoir */
double samplerows = 0; /* total # rows collected */
double liverows = 0; /* # live rows seen */
double deadrows = 0; /* # dead rows seen */
double rowstoskip = -1; /* -1 means not set yet */
uint32 randseed; /* Seed for block sampler(s) */
BlockNumber totalblocks;
TransactionId OldestXmin;
BlockSamplerData bs;
ReservoirStateData rstate;
TupleTableSlot *slot;
TableScanDesc scan;
BlockNumber nblocks;
BlockNumber blksdone = 0;
ReadStream *stream;
Assert(targrows > 0);
totalblocks = RelationGetNumberOfBlocks(onerel);
/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
snapshot scalability: Don't compute global horizons while building snapshots. To make GetSnapshotData() more scalable, it cannot not look at at each proc's xmin: While snapshot contents do not need to change whenever a read-only transaction commits or a snapshot is released, a proc's xmin is modified in those cases. The frequency of xmin modifications leads to, particularly on higher core count systems, many cache misses inside GetSnapshotData(), despite the data underlying a snapshot not changing. That is the most significant source of GetSnapshotData() scaling poorly on larger systems. Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons / thresholds as it has so far. But we don't really have to: The horizons don't actually change that much between GetSnapshotData() calls. Nor are the horizons actually used every time a snapshot is built. The trick this commit introduces is to delay computation of accurate horizons until there use and using horizon boundaries to determine whether accurate horizons need to be computed. The use of RecentGlobal[Data]Xmin to decide whether a row version could be removed has been replaces with new GlobalVisTest* functions. These use two thresholds to determine whether a row can be pruned: 1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed are definitely still visible. 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can definitely be removed GetSnapshotData() updates definitely_needed to be the xmin of the computed snapshot. When testing whether a row can be removed (with GlobalVisTestIsRemovableXid()) and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID < definitely_needed) the boundaries can be recomputed to be more accurate. As it is not cheap to compute accurate boundaries, we limit the number of times that happens in short succession. As the boundaries used by GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by GetSnapshotData()), it is likely that further test can benefit from an earlier computation of accurate horizons. To avoid regressing performance when old_snapshot_threshold is set (as that requires an accurate horizon to be computed), heap_page_prune_opt() doesn't unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the computation of the limited horizon, and the triggering of errors (with SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove tuples. This commit just removes the accesses to PGXACT->xmin from GetSnapshotData(), but other members of PGXACT residing in the same cache line are accessed. Therefore this in itself does not result in a significant improvement. Subsequent commits will take advantage of the fact that GetSnapshotData() now does not need to access xmins anymore. Note: This contains a workaround in heap_page_prune_opt() to keep the snapshot_too_old tests working. While that workaround is ugly, the tests currently are not meaningful, and it seems best to address them separately. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-13 01:03:49 +02:00
OldestXmin = GetOldestNonRemovableTransactionId(onerel);
/* Prepare for sampling block numbers */
randseed = pg_prng_uint32(&pg_global_prng_state);
nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
/* Report sampling block numbers */
pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
nblocks);
/* Prepare for sampling rows */
reservoir_init_selection_state(&rstate, targrows);
scan = table_beginscan_analyze(onerel);
slot = table_slot_create(onerel, NULL);
stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
vac_strategy,
scan->rs_rd,
MAIN_FORKNUM,
block_sampling_read_stream_next,
&bs,
0);
/* Outer loop over blocks to sample */
while (table_scan_analyze_next_block(scan, stream))
{
vacuum_delay_point();
while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
{
/*
* The first targrows sample rows are simply copied into the
* reservoir. Then we start replacing tuples in the sample until
* we reach the end of the relation. This algorithm is from Jeff
* Vitter's paper (see full citation in utils/misc/sampling.c). It
* works by repeatedly computing the number of tuples to skip
* before selecting a tuple, which replaces a randomly chosen
* element of the reservoir (current set of tuples). At all times
* the reservoir is a true random sample of the tuples we've
* passed over so far, so when we fall off the end of the relation
* we're done.
*/
if (numrows < targrows)
rows[numrows++] = ExecCopySlotHeapTuple(slot);
else
{
/*
* t in Vitter's paper is the number of records already
* processed. If we need to compute a new S value, we must
* use the not-yet-incremented value of samplerows as t.
*/
if (rowstoskip < 0)
rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
if (rowstoskip <= 0)
{
/*
* Found a suitable tuple, so save it, replacing one old
* tuple at random
*/
int k = (int) (targrows * sampler_random_fract(&rstate.randstate));
Assert(k >= 0 && k < targrows);
heap_freetuple(rows[k]);
rows[k] = ExecCopySlotHeapTuple(slot);
}
rowstoskip -= 1;
}
samplerows += 1;
}
pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
++blksdone);
2000-05-29 19:44:17 +02:00
}
read_stream_end(stream);
ExecDropSingleTupleTableSlot(slot);
table_endscan(scan);
/*
* If we didn't find as many tuples as we wanted then we're done. No sort
* is needed, since they're already in order.
*
* Otherwise we need to sort the collected tuples by position
* (itempointer). It's not worth worrying about corner cases where the
* tuples are already sorted.
*/
if (numrows == targrows)
qsort_interruptible(rows, numrows, sizeof(HeapTuple),
compare_rows, NULL);
2000-05-29 19:44:17 +02:00
/*
When updating reltuples after ANALYZE, just extrapolate from our sample. The existing logic for updating pg_class.reltuples trusted the sampling results only for the pages ANALYZE actually visited, preferring to believe the previous tuple density estimate for all the unvisited pages. While there's some rationale for doing that for VACUUM (first that VACUUM is likely to visit a very nonrandom subset of pages, and second that we know for sure that the unvisited pages did not change), there's no such rationale for ANALYZE: by assumption, it's looked at an unbiased random sample of the table's pages. Furthermore, in a very large table ANALYZE will have examined only a tiny fraction of the table's pages, meaning it cannot slew the overall density estimate very far at all. In a table that is physically growing, this causes reltuples to increase nearly proportionally to the change in relpages, regardless of what is actually happening in the table. This has been observed to cause reltuples to become so much larger than reality that it effectively shuts off autovacuum, whose threshold for doing anything is a fraction of reltuples. (Getting to the point where that would happen seems to require some additional, not well understood, conditions. But it's undeniable that if reltuples is seriously off in a large table, ANALYZE alone will not fix it in any reasonable number of iterations, especially not if the table is continuing to grow.) Hence, restrict the use of vac_estimate_reltuples() to VACUUM alone, and in ANALYZE, just extrapolate from the sample pages on the assumption that they provide an accurate model of the whole table. If, by very bad luck, they don't, at least another ANALYZE will fix it; in the old logic a single bad estimate could cause problems indefinitely. In HEAD, let's remove vac_estimate_reltuples' is_analyze argument altogether; it was never used for anything and now it's totally pointless. But keep it in the back branches, in case any third-party code is calling this function. Per bug #15005. Back-patch to all supported branches. David Gould, reviewed by Alexander Kuzmenkov, cosmetic changes by me Discussion: https://postgr.es/m/20180117164916.3fdcf2e9@engels
2018-03-13 18:24:27 +01:00
* Estimate total numbers of live and dead rows in relation, extrapolating
* on the assumption that the average tuple density in pages we didn't
* scan is the same as in the pages we did scan. Since what we scanned is
* a random sample of the pages in the relation, this should be a good
* assumption.
*/
if (bs.m > 0)
When updating reltuples after ANALYZE, just extrapolate from our sample. The existing logic for updating pg_class.reltuples trusted the sampling results only for the pages ANALYZE actually visited, preferring to believe the previous tuple density estimate for all the unvisited pages. While there's some rationale for doing that for VACUUM (first that VACUUM is likely to visit a very nonrandom subset of pages, and second that we know for sure that the unvisited pages did not change), there's no such rationale for ANALYZE: by assumption, it's looked at an unbiased random sample of the table's pages. Furthermore, in a very large table ANALYZE will have examined only a tiny fraction of the table's pages, meaning it cannot slew the overall density estimate very far at all. In a table that is physically growing, this causes reltuples to increase nearly proportionally to the change in relpages, regardless of what is actually happening in the table. This has been observed to cause reltuples to become so much larger than reality that it effectively shuts off autovacuum, whose threshold for doing anything is a fraction of reltuples. (Getting to the point where that would happen seems to require some additional, not well understood, conditions. But it's undeniable that if reltuples is seriously off in a large table, ANALYZE alone will not fix it in any reasonable number of iterations, especially not if the table is continuing to grow.) Hence, restrict the use of vac_estimate_reltuples() to VACUUM alone, and in ANALYZE, just extrapolate from the sample pages on the assumption that they provide an accurate model of the whole table. If, by very bad luck, they don't, at least another ANALYZE will fix it; in the old logic a single bad estimate could cause problems indefinitely. In HEAD, let's remove vac_estimate_reltuples' is_analyze argument altogether; it was never used for anything and now it's totally pointless. But keep it in the back branches, in case any third-party code is calling this function. Per bug #15005. Back-patch to all supported branches. David Gould, reviewed by Alexander Kuzmenkov, cosmetic changes by me Discussion: https://postgr.es/m/20180117164916.3fdcf2e9@engels
2018-03-13 18:24:27 +01:00
{
*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
Fix VACUUM so that it always updates pg_class.reltuples/relpages. When we added the ability for vacuum to skip heap pages by consulting the visibility map, we made it just not update the reltuples/relpages statistics if it skipped any pages. But this could leave us with extremely out-of-date stats for a table that contains any unchanging areas, especially for TOAST tables which never get processed by ANALYZE. In particular this could result in autovacuum making poor decisions about when to process the table, as in recent report from Florian Helmberger. And in general it's a bad idea to not update the stats at all. Instead, use the previous values of reltuples/relpages as an estimate of the tuple density in unvisited pages. This approach results in a "moving average" estimate of reltuples, which should converge to the correct value over multiple VACUUM and ANALYZE cycles even when individual measurements aren't very good. This new method for updating reltuples is used by both VACUUM and ANALYZE, with the result that we no longer need the grotty interconnections that caused ANALYZE to not update the stats depending on what had happened in the parent VACUUM command. Also, fix the logic for skipping all-visible pages during VACUUM so that it looks ahead rather than behind to decide what to do, as per a suggestion from Greg Stark. This eliminates useless scanning of all-visible pages at the start of the relation or just after a not-all-visible page. In particular, the first few pages of the relation will not be invariably included in the scanned pages, which seems to help in not overweighting them in the reltuples estimate. Back-patch to 8.4, where the visibility map was introduced.
2011-05-30 23:05:26 +02:00
*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
When updating reltuples after ANALYZE, just extrapolate from our sample. The existing logic for updating pg_class.reltuples trusted the sampling results only for the pages ANALYZE actually visited, preferring to believe the previous tuple density estimate for all the unvisited pages. While there's some rationale for doing that for VACUUM (first that VACUUM is likely to visit a very nonrandom subset of pages, and second that we know for sure that the unvisited pages did not change), there's no such rationale for ANALYZE: by assumption, it's looked at an unbiased random sample of the table's pages. Furthermore, in a very large table ANALYZE will have examined only a tiny fraction of the table's pages, meaning it cannot slew the overall density estimate very far at all. In a table that is physically growing, this causes reltuples to increase nearly proportionally to the change in relpages, regardless of what is actually happening in the table. This has been observed to cause reltuples to become so much larger than reality that it effectively shuts off autovacuum, whose threshold for doing anything is a fraction of reltuples. (Getting to the point where that would happen seems to require some additional, not well understood, conditions. But it's undeniable that if reltuples is seriously off in a large table, ANALYZE alone will not fix it in any reasonable number of iterations, especially not if the table is continuing to grow.) Hence, restrict the use of vac_estimate_reltuples() to VACUUM alone, and in ANALYZE, just extrapolate from the sample pages on the assumption that they provide an accurate model of the whole table. If, by very bad luck, they don't, at least another ANALYZE will fix it; in the old logic a single bad estimate could cause problems indefinitely. In HEAD, let's remove vac_estimate_reltuples' is_analyze argument altogether; it was never used for anything and now it's totally pointless. But keep it in the back branches, in case any third-party code is calling this function. Per bug #15005. Back-patch to all supported branches. David Gould, reviewed by Alexander Kuzmenkov, cosmetic changes by me Discussion: https://postgr.es/m/20180117164916.3fdcf2e9@engels
2018-03-13 18:24:27 +01:00
}
else
When updating reltuples after ANALYZE, just extrapolate from our sample. The existing logic for updating pg_class.reltuples trusted the sampling results only for the pages ANALYZE actually visited, preferring to believe the previous tuple density estimate for all the unvisited pages. While there's some rationale for doing that for VACUUM (first that VACUUM is likely to visit a very nonrandom subset of pages, and second that we know for sure that the unvisited pages did not change), there's no such rationale for ANALYZE: by assumption, it's looked at an unbiased random sample of the table's pages. Furthermore, in a very large table ANALYZE will have examined only a tiny fraction of the table's pages, meaning it cannot slew the overall density estimate very far at all. In a table that is physically growing, this causes reltuples to increase nearly proportionally to the change in relpages, regardless of what is actually happening in the table. This has been observed to cause reltuples to become so much larger than reality that it effectively shuts off autovacuum, whose threshold for doing anything is a fraction of reltuples. (Getting to the point where that would happen seems to require some additional, not well understood, conditions. But it's undeniable that if reltuples is seriously off in a large table, ANALYZE alone will not fix it in any reasonable number of iterations, especially not if the table is continuing to grow.) Hence, restrict the use of vac_estimate_reltuples() to VACUUM alone, and in ANALYZE, just extrapolate from the sample pages on the assumption that they provide an accurate model of the whole table. If, by very bad luck, they don't, at least another ANALYZE will fix it; in the old logic a single bad estimate could cause problems indefinitely. In HEAD, let's remove vac_estimate_reltuples' is_analyze argument altogether; it was never used for anything and now it's totally pointless. But keep it in the back branches, in case any third-party code is calling this function. Per bug #15005. Back-patch to all supported branches. David Gould, reviewed by Alexander Kuzmenkov, cosmetic changes by me Discussion: https://postgr.es/m/20180117164916.3fdcf2e9@engels
2018-03-13 18:24:27 +01:00
{
*totalrows = 0.0;
*totaldeadrows = 0.0;
When updating reltuples after ANALYZE, just extrapolate from our sample. The existing logic for updating pg_class.reltuples trusted the sampling results only for the pages ANALYZE actually visited, preferring to believe the previous tuple density estimate for all the unvisited pages. While there's some rationale for doing that for VACUUM (first that VACUUM is likely to visit a very nonrandom subset of pages, and second that we know for sure that the unvisited pages did not change), there's no such rationale for ANALYZE: by assumption, it's looked at an unbiased random sample of the table's pages. Furthermore, in a very large table ANALYZE will have examined only a tiny fraction of the table's pages, meaning it cannot slew the overall density estimate very far at all. In a table that is physically growing, this causes reltuples to increase nearly proportionally to the change in relpages, regardless of what is actually happening in the table. This has been observed to cause reltuples to become so much larger than reality that it effectively shuts off autovacuum, whose threshold for doing anything is a fraction of reltuples. (Getting to the point where that would happen seems to require some additional, not well understood, conditions. But it's undeniable that if reltuples is seriously off in a large table, ANALYZE alone will not fix it in any reasonable number of iterations, especially not if the table is continuing to grow.) Hence, restrict the use of vac_estimate_reltuples() to VACUUM alone, and in ANALYZE, just extrapolate from the sample pages on the assumption that they provide an accurate model of the whole table. If, by very bad luck, they don't, at least another ANALYZE will fix it; in the old logic a single bad estimate could cause problems indefinitely. In HEAD, let's remove vac_estimate_reltuples' is_analyze argument altogether; it was never used for anything and now it's totally pointless. But keep it in the back branches, in case any third-party code is calling this function. Per bug #15005. Back-patch to all supported branches. David Gould, reviewed by Alexander Kuzmenkov, cosmetic changes by me Discussion: https://postgr.es/m/20180117164916.3fdcf2e9@engels
2018-03-13 18:24:27 +01:00
}
2000-05-29 19:44:17 +02:00
/*
* Emit some interesting relation info
*/
ereport(elevel,
(errmsg("\"%s\": scanned %d of %u pages, "
"containing %.0f live rows and %.0f dead rows; "
"%d rows in sample, %.0f estimated total rows",
RelationGetRelationName(onerel),
bs.m, totalblocks,
liverows, deadrows,
numrows, *totalrows)));
return numrows;
}
2000-05-29 19:44:17 +02:00
/*
* Comparator for sorting rows[] array
*/
static int
compare_rows(const void *a, const void *b, void *arg)
{
HeapTuple ha = *(const HeapTuple *) a;
HeapTuple hb = *(const HeapTuple *) b;
BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
if (ba < bb)
return -1;
if (ba > bb)
return 1;
if (oa < ob)
return -1;
if (oa > ob)
return 1;
return 0;
}
/*
* acquire_inherited_sample_rows -- acquire sample rows from inheritance tree
*
* This has the same API as acquire_sample_rows, except that rows are
* collected from all inheritance children as well as the specified table.
Allow foreign tables to participate in inheritance. Foreign tables can now be inheritance children, or parents. Much of the system was already ready for this, but we had to fix a few things of course, mostly in the area of planner and executor handling of row locks. As side effects of this, allow foreign tables to have NOT VALID CHECK constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to disallow these things would've required bizarre and inconsistent special cases in inheritance behavior. Since foreign tables don't enforce CHECK constraints anyway, a NOT VALID one is a complete no-op, but that doesn't mean we shouldn't allow it. And it's possible that some FDWs might have use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops for most. An additional change in support of this is that when a ModifyTable node has multiple target tables, they will all now be explicitly identified in EXPLAIN output, for example: Update on pt1 (cost=0.00..321.05 rows=3541 width=46) Update on pt1 Foreign Update on ft1 Foreign Update on ft2 Update on child3 -> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46) -> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46) -> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46) -> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46) This was done mainly to provide an unambiguous place to attach "Remote SQL" fields, but it is useful for inherited updates even when no foreign tables are involved. Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
* We fail and return zero if there are no inheritance children, or if all
* children are foreign tables that don't support ANALYZE.
*/
static int
acquire_inherited_sample_rows(Relation onerel, int elevel,
HeapTuple *rows, int targrows,
double *totalrows, double *totaldeadrows)
{
List *tableOIDs;
Relation *rels;
Allow foreign tables to participate in inheritance. Foreign tables can now be inheritance children, or parents. Much of the system was already ready for this, but we had to fix a few things of course, mostly in the area of planner and executor handling of row locks. As side effects of this, allow foreign tables to have NOT VALID CHECK constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to disallow these things would've required bizarre and inconsistent special cases in inheritance behavior. Since foreign tables don't enforce CHECK constraints anyway, a NOT VALID one is a complete no-op, but that doesn't mean we shouldn't allow it. And it's possible that some FDWs might have use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops for most. An additional change in support of this is that when a ModifyTable node has multiple target tables, they will all now be explicitly identified in EXPLAIN output, for example: Update on pt1 (cost=0.00..321.05 rows=3541 width=46) Update on pt1 Foreign Update on ft1 Foreign Update on ft2 Update on child3 -> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46) -> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46) -> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46) -> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46) This was done mainly to provide an unambiguous place to attach "Remote SQL" fields, but it is useful for inherited updates even when no foreign tables are involved. Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
AcquireSampleRowsFunc *acquirefuncs;
double *relblocks;
double totalblocks;
int numrows,
nrels,
i;
ListCell *lc;
bool has_child;
/* Initialize output parameters to zero now, in case we exit early */
*totalrows = 0;
*totaldeadrows = 0;
/*
* Find all members of inheritance set. We only need AccessShareLock on
* the children.
*/
tableOIDs =
find_all_inheritors(RelationGetRelid(onerel), AccessShareLock, NULL);
/*
* Check that there's at least one descendant, else fail. This could
* happen despite analyze_rel's relhassubclass check, if table once had a
* child but no longer does. In that case, we can clear the
* relhassubclass field so as not to make the same mistake again later.
* (This is safe because we hold ShareUpdateExclusiveLock.)
*/
if (list_length(tableOIDs) < 2)
{
/* CCI because we already updated the pg_class row in this command */
CommandCounterIncrement();
SetRelationHasSubclass(RelationGetRelid(onerel), false);
ereport(elevel,
(errmsg("skipping analyze of \"%s.%s\" inheritance tree --- this inheritance tree contains no child tables",
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel))));
return 0;
}
/*
Allow foreign tables to participate in inheritance. Foreign tables can now be inheritance children, or parents. Much of the system was already ready for this, but we had to fix a few things of course, mostly in the area of planner and executor handling of row locks. As side effects of this, allow foreign tables to have NOT VALID CHECK constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to disallow these things would've required bizarre and inconsistent special cases in inheritance behavior. Since foreign tables don't enforce CHECK constraints anyway, a NOT VALID one is a complete no-op, but that doesn't mean we shouldn't allow it. And it's possible that some FDWs might have use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops for most. An additional change in support of this is that when a ModifyTable node has multiple target tables, they will all now be explicitly identified in EXPLAIN output, for example: Update on pt1 (cost=0.00..321.05 rows=3541 width=46) Update on pt1 Foreign Update on ft1 Foreign Update on ft2 Update on child3 -> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46) -> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46) -> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46) -> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46) This was done mainly to provide an unambiguous place to attach "Remote SQL" fields, but it is useful for inherited updates even when no foreign tables are involved. Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
* Identify acquirefuncs to use, and count blocks in all the relations.
* The result could overflow BlockNumber, so we use double arithmetic.
*/
rels = (Relation *) palloc(list_length(tableOIDs) * sizeof(Relation));
Allow foreign tables to participate in inheritance. Foreign tables can now be inheritance children, or parents. Much of the system was already ready for this, but we had to fix a few things of course, mostly in the area of planner and executor handling of row locks. As side effects of this, allow foreign tables to have NOT VALID CHECK constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to disallow these things would've required bizarre and inconsistent special cases in inheritance behavior. Since foreign tables don't enforce CHECK constraints anyway, a NOT VALID one is a complete no-op, but that doesn't mean we shouldn't allow it. And it's possible that some FDWs might have use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops for most. An additional change in support of this is that when a ModifyTable node has multiple target tables, they will all now be explicitly identified in EXPLAIN output, for example: Update on pt1 (cost=0.00..321.05 rows=3541 width=46) Update on pt1 Foreign Update on ft1 Foreign Update on ft2 Update on child3 -> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46) -> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46) -> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46) -> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46) This was done mainly to provide an unambiguous place to attach "Remote SQL" fields, but it is useful for inherited updates even when no foreign tables are involved. Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
acquirefuncs = (AcquireSampleRowsFunc *)
palloc(list_length(tableOIDs) * sizeof(AcquireSampleRowsFunc));
relblocks = (double *) palloc(list_length(tableOIDs) * sizeof(double));
totalblocks = 0;
nrels = 0;
has_child = false;
foreach(lc, tableOIDs)
{
Oid childOID = lfirst_oid(lc);
Relation childrel;
Allow foreign tables to participate in inheritance. Foreign tables can now be inheritance children, or parents. Much of the system was already ready for this, but we had to fix a few things of course, mostly in the area of planner and executor handling of row locks. As side effects of this, allow foreign tables to have NOT VALID CHECK constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to disallow these things would've required bizarre and inconsistent special cases in inheritance behavior. Since foreign tables don't enforce CHECK constraints anyway, a NOT VALID one is a complete no-op, but that doesn't mean we shouldn't allow it. And it's possible that some FDWs might have use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops for most. An additional change in support of this is that when a ModifyTable node has multiple target tables, they will all now be explicitly identified in EXPLAIN output, for example: Update on pt1 (cost=0.00..321.05 rows=3541 width=46) Update on pt1 Foreign Update on ft1 Foreign Update on ft2 Update on child3 -> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46) -> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46) -> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46) -> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46) This was done mainly to provide an unambiguous place to attach "Remote SQL" fields, but it is useful for inherited updates even when no foreign tables are involved. Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
AcquireSampleRowsFunc acquirefunc = NULL;
BlockNumber relpages = 0;
/* We already got the needed lock */
childrel = table_open(childOID, NoLock);
/* Ignore if temp table of another backend */
if (RELATION_IS_OTHER_TEMP(childrel))
{
/* ... but release the lock on it */
Assert(childrel != onerel);
table_close(childrel, AccessShareLock);
continue;
}
Allow foreign tables to participate in inheritance. Foreign tables can now be inheritance children, or parents. Much of the system was already ready for this, but we had to fix a few things of course, mostly in the area of planner and executor handling of row locks. As side effects of this, allow foreign tables to have NOT VALID CHECK constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to disallow these things would've required bizarre and inconsistent special cases in inheritance behavior. Since foreign tables don't enforce CHECK constraints anyway, a NOT VALID one is a complete no-op, but that doesn't mean we shouldn't allow it. And it's possible that some FDWs might have use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops for most. An additional change in support of this is that when a ModifyTable node has multiple target tables, they will all now be explicitly identified in EXPLAIN output, for example: Update on pt1 (cost=0.00..321.05 rows=3541 width=46) Update on pt1 Foreign Update on ft1 Foreign Update on ft2 Update on child3 -> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46) -> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46) -> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46) -> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46) This was done mainly to provide an unambiguous place to attach "Remote SQL" fields, but it is useful for inherited updates even when no foreign tables are involved. Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
/* Check table type (MATVIEW can't happen, but might as well allow) */
if (childrel->rd_rel->relkind == RELKIND_RELATION ||
childrel->rd_rel->relkind == RELKIND_MATVIEW)
Allow foreign tables to participate in inheritance. Foreign tables can now be inheritance children, or parents. Much of the system was already ready for this, but we had to fix a few things of course, mostly in the area of planner and executor handling of row locks. As side effects of this, allow foreign tables to have NOT VALID CHECK constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to disallow these things would've required bizarre and inconsistent special cases in inheritance behavior. Since foreign tables don't enforce CHECK constraints anyway, a NOT VALID one is a complete no-op, but that doesn't mean we shouldn't allow it. And it's possible that some FDWs might have use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops for most. An additional change in support of this is that when a ModifyTable node has multiple target tables, they will all now be explicitly identified in EXPLAIN output, for example: Update on pt1 (cost=0.00..321.05 rows=3541 width=46) Update on pt1 Foreign Update on ft1 Foreign Update on ft2 Update on child3 -> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46) -> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46) -> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46) -> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46) This was done mainly to provide an unambiguous place to attach "Remote SQL" fields, but it is useful for inherited updates even when no foreign tables are involved. Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
{
/* Regular table, so use the regular row acquisition function */
acquirefunc = acquire_sample_rows;
relpages = RelationGetNumberOfBlocks(childrel);
Allow foreign tables to participate in inheritance. Foreign tables can now be inheritance children, or parents. Much of the system was already ready for this, but we had to fix a few things of course, mostly in the area of planner and executor handling of row locks. As side effects of this, allow foreign tables to have NOT VALID CHECK constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to disallow these things would've required bizarre and inconsistent special cases in inheritance behavior. Since foreign tables don't enforce CHECK constraints anyway, a NOT VALID one is a complete no-op, but that doesn't mean we shouldn't allow it. And it's possible that some FDWs might have use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops for most. An additional change in support of this is that when a ModifyTable node has multiple target tables, they will all now be explicitly identified in EXPLAIN output, for example: Update on pt1 (cost=0.00..321.05 rows=3541 width=46) Update on pt1 Foreign Update on ft1 Foreign Update on ft2 Update on child3 -> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46) -> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46) -> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46) -> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46) This was done mainly to provide an unambiguous place to attach "Remote SQL" fields, but it is useful for inherited updates even when no foreign tables are involved. Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
}
else if (childrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
{
/*
* For a foreign table, call the FDW's hook function to see
* whether it supports analysis.
*/
FdwRoutine *fdwroutine;
bool ok = false;
fdwroutine = GetFdwRoutineForRelation(childrel, false);
if (fdwroutine->AnalyzeForeignTable != NULL)
ok = fdwroutine->AnalyzeForeignTable(childrel,
&acquirefunc,
&relpages);
if (!ok)
{
/* ignore, but release the lock on it */
Assert(childrel != onerel);
table_close(childrel, AccessShareLock);
Allow foreign tables to participate in inheritance. Foreign tables can now be inheritance children, or parents. Much of the system was already ready for this, but we had to fix a few things of course, mostly in the area of planner and executor handling of row locks. As side effects of this, allow foreign tables to have NOT VALID CHECK constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to disallow these things would've required bizarre and inconsistent special cases in inheritance behavior. Since foreign tables don't enforce CHECK constraints anyway, a NOT VALID one is a complete no-op, but that doesn't mean we shouldn't allow it. And it's possible that some FDWs might have use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops for most. An additional change in support of this is that when a ModifyTable node has multiple target tables, they will all now be explicitly identified in EXPLAIN output, for example: Update on pt1 (cost=0.00..321.05 rows=3541 width=46) Update on pt1 Foreign Update on ft1 Foreign Update on ft2 Update on child3 -> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46) -> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46) -> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46) -> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46) This was done mainly to provide an unambiguous place to attach "Remote SQL" fields, but it is useful for inherited updates even when no foreign tables are involved. Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
continue;
}
}
else
{
/*
* ignore, but release the lock on it. don't try to unlock the
* passed-in relation
*/
Assert(childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
if (childrel != onerel)
table_close(childrel, AccessShareLock);
else
table_close(childrel, NoLock);
Allow foreign tables to participate in inheritance. Foreign tables can now be inheritance children, or parents. Much of the system was already ready for this, but we had to fix a few things of course, mostly in the area of planner and executor handling of row locks. As side effects of this, allow foreign tables to have NOT VALID CHECK constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to disallow these things would've required bizarre and inconsistent special cases in inheritance behavior. Since foreign tables don't enforce CHECK constraints anyway, a NOT VALID one is a complete no-op, but that doesn't mean we shouldn't allow it. And it's possible that some FDWs might have use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops for most. An additional change in support of this is that when a ModifyTable node has multiple target tables, they will all now be explicitly identified in EXPLAIN output, for example: Update on pt1 (cost=0.00..321.05 rows=3541 width=46) Update on pt1 Foreign Update on ft1 Foreign Update on ft2 Update on child3 -> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46) -> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46) -> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46) -> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46) This was done mainly to provide an unambiguous place to attach "Remote SQL" fields, but it is useful for inherited updates even when no foreign tables are involved. Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
continue;
}
/* OK, we'll process this child */
has_child = true;
rels[nrels] = childrel;
Allow foreign tables to participate in inheritance. Foreign tables can now be inheritance children, or parents. Much of the system was already ready for this, but we had to fix a few things of course, mostly in the area of planner and executor handling of row locks. As side effects of this, allow foreign tables to have NOT VALID CHECK constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to disallow these things would've required bizarre and inconsistent special cases in inheritance behavior. Since foreign tables don't enforce CHECK constraints anyway, a NOT VALID one is a complete no-op, but that doesn't mean we shouldn't allow it. And it's possible that some FDWs might have use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops for most. An additional change in support of this is that when a ModifyTable node has multiple target tables, they will all now be explicitly identified in EXPLAIN output, for example: Update on pt1 (cost=0.00..321.05 rows=3541 width=46) Update on pt1 Foreign Update on ft1 Foreign Update on ft2 Update on child3 -> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46) -> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46) -> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46) -> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46) This was done mainly to provide an unambiguous place to attach "Remote SQL" fields, but it is useful for inherited updates even when no foreign tables are involved. Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
acquirefuncs[nrels] = acquirefunc;
relblocks[nrels] = (double) relpages;
totalblocks += (double) relpages;
nrels++;
}
Allow foreign tables to participate in inheritance. Foreign tables can now be inheritance children, or parents. Much of the system was already ready for this, but we had to fix a few things of course, mostly in the area of planner and executor handling of row locks. As side effects of this, allow foreign tables to have NOT VALID CHECK constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to disallow these things would've required bizarre and inconsistent special cases in inheritance behavior. Since foreign tables don't enforce CHECK constraints anyway, a NOT VALID one is a complete no-op, but that doesn't mean we shouldn't allow it. And it's possible that some FDWs might have use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops for most. An additional change in support of this is that when a ModifyTable node has multiple target tables, they will all now be explicitly identified in EXPLAIN output, for example: Update on pt1 (cost=0.00..321.05 rows=3541 width=46) Update on pt1 Foreign Update on ft1 Foreign Update on ft2 Update on child3 -> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46) -> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46) -> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46) -> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46) This was done mainly to provide an unambiguous place to attach "Remote SQL" fields, but it is useful for inherited updates even when no foreign tables are involved. Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
/*
* If we don't have at least one child table to consider, fail. If the
* relation is a partitioned table, it's not counted as a child table.
Allow foreign tables to participate in inheritance. Foreign tables can now be inheritance children, or parents. Much of the system was already ready for this, but we had to fix a few things of course, mostly in the area of planner and executor handling of row locks. As side effects of this, allow foreign tables to have NOT VALID CHECK constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to disallow these things would've required bizarre and inconsistent special cases in inheritance behavior. Since foreign tables don't enforce CHECK constraints anyway, a NOT VALID one is a complete no-op, but that doesn't mean we shouldn't allow it. And it's possible that some FDWs might have use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops for most. An additional change in support of this is that when a ModifyTable node has multiple target tables, they will all now be explicitly identified in EXPLAIN output, for example: Update on pt1 (cost=0.00..321.05 rows=3541 width=46) Update on pt1 Foreign Update on ft1 Foreign Update on ft2 Update on child3 -> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46) -> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46) -> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46) -> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46) This was done mainly to provide an unambiguous place to attach "Remote SQL" fields, but it is useful for inherited updates even when no foreign tables are involved. Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
*/
if (!has_child)
Allow foreign tables to participate in inheritance. Foreign tables can now be inheritance children, or parents. Much of the system was already ready for this, but we had to fix a few things of course, mostly in the area of planner and executor handling of row locks. As side effects of this, allow foreign tables to have NOT VALID CHECK constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to disallow these things would've required bizarre and inconsistent special cases in inheritance behavior. Since foreign tables don't enforce CHECK constraints anyway, a NOT VALID one is a complete no-op, but that doesn't mean we shouldn't allow it. And it's possible that some FDWs might have use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops for most. An additional change in support of this is that when a ModifyTable node has multiple target tables, they will all now be explicitly identified in EXPLAIN output, for example: Update on pt1 (cost=0.00..321.05 rows=3541 width=46) Update on pt1 Foreign Update on ft1 Foreign Update on ft2 Update on child3 -> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46) -> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46) -> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46) -> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46) This was done mainly to provide an unambiguous place to attach "Remote SQL" fields, but it is useful for inherited updates even when no foreign tables are involved. Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
{
ereport(elevel,
(errmsg("skipping analyze of \"%s.%s\" inheritance tree --- this inheritance tree contains no analyzable child tables",
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel))));
return 0;
}
/*
* Now sample rows from each relation, proportionally to its fraction of
* the total block count. (This might be less than desirable if the child
* rels have radically different free-space percentages, but it's not
* clear that it's worth working harder.)
*/
pgstat_progress_update_param(PROGRESS_ANALYZE_CHILD_TABLES_TOTAL,
nrels);
numrows = 0;
for (i = 0; i < nrels; i++)
{
Relation childrel = rels[i];
Allow foreign tables to participate in inheritance. Foreign tables can now be inheritance children, or parents. Much of the system was already ready for this, but we had to fix a few things of course, mostly in the area of planner and executor handling of row locks. As side effects of this, allow foreign tables to have NOT VALID CHECK constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to disallow these things would've required bizarre and inconsistent special cases in inheritance behavior. Since foreign tables don't enforce CHECK constraints anyway, a NOT VALID one is a complete no-op, but that doesn't mean we shouldn't allow it. And it's possible that some FDWs might have use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops for most. An additional change in support of this is that when a ModifyTable node has multiple target tables, they will all now be explicitly identified in EXPLAIN output, for example: Update on pt1 (cost=0.00..321.05 rows=3541 width=46) Update on pt1 Foreign Update on ft1 Foreign Update on ft2 Update on child3 -> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46) -> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46) -> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46) -> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46) This was done mainly to provide an unambiguous place to attach "Remote SQL" fields, but it is useful for inherited updates even when no foreign tables are involved. Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
AcquireSampleRowsFunc acquirefunc = acquirefuncs[i];
double childblocks = relblocks[i];
/*
* Report progress. The sampling function will normally report blocks
* done/total, but we need to reset them to 0 here, so that they don't
* show an old value until that.
*/
{
const int progress_index[] = {
PROGRESS_ANALYZE_CURRENT_CHILD_TABLE_RELID,
PROGRESS_ANALYZE_BLOCKS_DONE,
PROGRESS_ANALYZE_BLOCKS_TOTAL
};
const int64 progress_vals[] = {
RelationGetRelid(childrel),
0,
0,
};
pgstat_progress_update_multi_param(3, progress_index, progress_vals);
}
if (childblocks > 0)
{
int childtargrows;
childtargrows = (int) rint(targrows * childblocks / totalblocks);
/* Make sure we don't overrun due to roundoff error */
childtargrows = Min(childtargrows, targrows - numrows);
if (childtargrows > 0)
{
int childrows;
double trows,
tdrows;
/* Fetch a random sample of the child's rows */
Allow foreign tables to participate in inheritance. Foreign tables can now be inheritance children, or parents. Much of the system was already ready for this, but we had to fix a few things of course, mostly in the area of planner and executor handling of row locks. As side effects of this, allow foreign tables to have NOT VALID CHECK constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to disallow these things would've required bizarre and inconsistent special cases in inheritance behavior. Since foreign tables don't enforce CHECK constraints anyway, a NOT VALID one is a complete no-op, but that doesn't mean we shouldn't allow it. And it's possible that some FDWs might have use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops for most. An additional change in support of this is that when a ModifyTable node has multiple target tables, they will all now be explicitly identified in EXPLAIN output, for example: Update on pt1 (cost=0.00..321.05 rows=3541 width=46) Update on pt1 Foreign Update on ft1 Foreign Update on ft2 Update on child3 -> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46) -> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46) -> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46) -> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46) This was done mainly to provide an unambiguous place to attach "Remote SQL" fields, but it is useful for inherited updates even when no foreign tables are involved. Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro Horiguchi, some additional hacking by me
2015-03-22 18:53:11 +01:00
childrows = (*acquirefunc) (childrel, elevel,
rows + numrows, childtargrows,
&trows, &tdrows);
/* We may need to convert from child's rowtype to parent's */
if (childrows > 0 &&
!equalRowTypes(RelationGetDescr(childrel),
RelationGetDescr(onerel)))
{
TupleConversionMap *map;
map = convert_tuples_by_name(RelationGetDescr(childrel),
RelationGetDescr(onerel));
if (map != NULL)
{
int j;
for (j = 0; j < childrows; j++)
{
HeapTuple newtup;
newtup = execute_attr_map_tuple(rows[numrows + j], map);
heap_freetuple(rows[numrows + j]);
rows[numrows + j] = newtup;
}
free_conversion_map(map);
}
}
/* And add to counts */
numrows += childrows;
*totalrows += trows;
*totaldeadrows += tdrows;
}
}
/*
* Note: we cannot release the child-table locks, since we may have
* pointers to their TOAST tables in the sampled rows.
*/
table_close(childrel, NoLock);
pgstat_progress_update_param(PROGRESS_ANALYZE_CHILD_TABLES_DONE,
i + 1);
}
return numrows;
}
/*
* update_attstats() -- update attribute statistics for one relation
*
* Statistics are stored in several places: the pg_class row for the
* relation has stats about the whole relation, and there is a
* pg_statistic row for each (non-system) attribute that has ever
* been analyzed. The pg_class values are updated by VACUUM, not here.
*
* pg_statistic rows are just added or updated normally. This means
* that pg_statistic will probably contain some deleted rows at the
* completion of a vacuum cycle, unless it happens to get vacuumed last.
*
* To keep things simple, we punt for pg_statistic, and don't try
* to compute or store rows for pg_statistic itself in pg_statistic.
* This could possibly be made to work, but it's not worth the trouble.
* Note analyze_rel() has seen to it that we won't come here when
* vacuuming pg_statistic itself.
*
* Note: there would be a race condition here if two backends could
* ANALYZE the same table concurrently. Presently, we lock that out
* by taking a self-exclusive lock on the relation in analyze_rel().
*/
static void
update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
{
Relation sd;
int attno;
CatalogIndexState indstate = NULL;
if (natts <= 0)
return; /* nothing to do */
sd = table_open(StatisticRelationId, RowExclusiveLock);
for (attno = 0; attno < natts; attno++)
{
VacAttrStats *stats = vacattrstats[attno];
HeapTuple stup,
oldtup;
int i,
k,
n;
Datum values[Natts_pg_statistic];
bool nulls[Natts_pg_statistic];
bool replaces[Natts_pg_statistic];
/* Ignore attr if we weren't able to collect stats */
if (!stats->stats_valid)
continue;
/*
* Construct a new pg_statistic tuple
*/
for (i = 0; i < Natts_pg_statistic; ++i)
{
nulls[i] = false;
replaces[i] = true;
}
values[Anum_pg_statistic_starelid - 1] = ObjectIdGetDatum(relid);
values[Anum_pg_statistic_staattnum - 1] = Int16GetDatum(stats->tupattnum);
values[Anum_pg_statistic_stainherit - 1] = BoolGetDatum(inh);
values[Anum_pg_statistic_stanullfrac - 1] = Float4GetDatum(stats->stanullfrac);
values[Anum_pg_statistic_stawidth - 1] = Int32GetDatum(stats->stawidth);
values[Anum_pg_statistic_stadistinct - 1] = Float4GetDatum(stats->stadistinct);
i = Anum_pg_statistic_stakind1 - 1;
for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
{
values[i++] = Int16GetDatum(stats->stakind[k]); /* stakindN */
}
i = Anum_pg_statistic_staop1 - 1;
for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
{
values[i++] = ObjectIdGetDatum(stats->staop[k]); /* staopN */
}
Make pg_statistic and related code account more honestly for collations. When we first put in collations support, we basically punted on teaching pg_statistic, ANALYZE, and the planner selectivity functions about that. They've just used DEFAULT_COLLATION_OID independently of the actual collation of the data. It's time to improve that, so: * Add columns to pg_statistic that record the specific collation associated with each statistics slot. * Teach ANALYZE to use the column's actual collation when comparing values for statistical purposes, and record this in the appropriate slot. (Note that type-specific typanalyze functions are now expected to fill stats->stacoll with the appropriate collation, too.) * Teach assorted selectivity functions to use the actual collation of the stats they are looking at, instead of just assuming it's DEFAULT_COLLATION_OID. This should give noticeably better results in selectivity estimates for columns with nondefault collations, at least for query clauses that use that same collation (which would be the default behavior in most cases). It's still true that comparisons with explicit COLLATE clauses different from the stored data's collation won't be well-estimated, but that's no worse than before. Also, this patch does make the first step towards doing better with that, which is that it's now theoretically possible to collect stats for a collation other than the column's own collation. Patch by me; thanks to Peter Eisentraut for review. Discussion: https://postgr.es/m/14706.1544630227@sss.pgh.pa.us
2018-12-14 18:52:49 +01:00
i = Anum_pg_statistic_stacoll1 - 1;
for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
{
values[i++] = ObjectIdGetDatum(stats->stacoll[k]); /* stacollN */
}
i = Anum_pg_statistic_stanumbers1 - 1;
for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
{
int nnum = stats->numnumbers[k];
if (nnum > 0)
{
Datum *numdatums = (Datum *) palloc(nnum * sizeof(Datum));
ArrayType *arry;
for (n = 0; n < nnum; n++)
numdatums[n] = Float4GetDatum(stats->stanumbers[k][n]);
arry = construct_array_builtin(numdatums, nnum, FLOAT4OID);
values[i++] = PointerGetDatum(arry); /* stanumbersN */
}
else
{
nulls[i] = true;
values[i++] = (Datum) 0;
}
}
i = Anum_pg_statistic_stavalues1 - 1;
for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
{
if (stats->numvalues[k] > 0)
{
ArrayType *arry;
arry = construct_array(stats->stavalues[k],
stats->numvalues[k],
stats->statypid[k],
stats->statyplen[k],
stats->statypbyval[k],
stats->statypalign[k]);
values[i++] = PointerGetDatum(arry); /* stavaluesN */
}
else
{
nulls[i] = true;
values[i++] = (Datum) 0;
}
}
/* Is there already a pg_statistic tuple for this attribute? */
oldtup = SearchSysCache3(STATRELATTINH,
ObjectIdGetDatum(relid),
Int16GetDatum(stats->tupattnum),
BoolGetDatum(inh));
/* Open index information when we know we need it */
if (indstate == NULL)
indstate = CatalogOpenIndexes(sd);
if (HeapTupleIsValid(oldtup))
{
/* Yes, replace it */
stup = heap_modify_tuple(oldtup,
RelationGetDescr(sd),
values,
nulls,
replaces);
ReleaseSysCache(oldtup);
CatalogTupleUpdateWithInfo(sd, &stup->t_self, stup, indstate);
}
else
{
/* No, insert new tuple */
stup = heap_form_tuple(RelationGetDescr(sd), values, nulls);
CatalogTupleInsertWithInfo(sd, stup, indstate);
}
heap_freetuple(stup);
}
if (indstate != NULL)
CatalogCloseIndexes(indstate);
table_close(sd, RowExclusiveLock);
}
/*
* Standard fetch function for use by compute_stats subroutines.
*
* This exists to provide some insulation between compute_stats routines
* and the actual storage of the sample data.
*/
static Datum
std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull)
{
int attnum = stats->tupattnum;
HeapTuple tuple = stats->rows[rownum];
TupleDesc tupDesc = stats->tupDesc;
return heap_getattr(tuple, attnum, tupDesc, isNull);
}
/*
* Fetch function for analyzing index expressions.
*
* We have not bothered to construct index tuples, instead the data is
* just in Datum arrays.
*/
static Datum
ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull)
{
int i;
/* exprvals and exprnulls are already offset for proper column */
i = rownum * stats->rowstride;
*isNull = stats->exprnulls[i];
return stats->exprvals[i];
}
/*==========================================================================
*
* Code below this point represents the "standard" type-specific statistics
* analysis algorithms. This code can be replaced on a per-data-type basis
* by setting a nonzero value in pg_type.typanalyze.
*
*==========================================================================
*/
/*
* To avoid consuming too much memory during analysis and/or too much space
* in the resulting pg_statistic rows, we ignore varlena datums that are wider
* than WIDTH_THRESHOLD (after detoasting!). This is legitimate for MCV
* and distinct-value calculations since a wide value is unlikely to be
* duplicated at all, much less be a most-common value. For the same reason,
* ignoring wide values will not affect our estimates of histogram bin
* boundaries very much.
*/
#define WIDTH_THRESHOLD 1024
#define swapInt(a,b) do {int _tmp; _tmp=a; a=b; b=_tmp;} while(0)
#define swapDatum(a,b) do {Datum _tmp; _tmp=a; a=b; b=_tmp;} while(0)
/*
* Extra information used by the default analysis routines
*/
typedef struct
{
int count; /* # of duplicates */
int first; /* values[] index of first occurrence */
} ScalarMCVItem;
typedef struct
{
SortSupport ssup;
int *tupnoLink;
} CompareScalarsContext;
static void compute_trivial_stats(VacAttrStatsP stats,
AnalyzeAttrFetchFunc fetchfunc,
int samplerows,
double totalrows);
static void compute_distinct_stats(VacAttrStatsP stats,
AnalyzeAttrFetchFunc fetchfunc,
int samplerows,
double totalrows);
static void compute_scalar_stats(VacAttrStatsP stats,
AnalyzeAttrFetchFunc fetchfunc,
int samplerows,
double totalrows);
static int compare_scalars(const void *a, const void *b, void *arg);
static int compare_mcvs(const void *a, const void *b, void *arg);
Improve ANALYZE's strategy for finding MCVs. Previously, a value was included in the MCV list if its frequency was 25% larger than the estimated average frequency of all nonnull values in the table. For uniform distributions, that can lead to values being included in the MCV list and significantly overestimated on the basis of relatively few (sometimes just 2) instances being seen in the sample. For non-uniform distributions, it can lead to too few values being included in the MCV list, since the overall average frequency may be dominated by a small number of very common values, while the remaining values may still have a large spread of frequencies, causing both substantial overestimation and underestimation of the remaining values. Furthermore, increasing the statistics target may have little effect because the overall average frequency will remain relatively unchanged. Instead, populate the MCV list with the largest set of common values that are statistically significantly more common than the average frequency of the remaining values. This takes into account the variance of the sample counts, which depends on the counts themselves and on the proportion of the table that was sampled. As a result, it constrains the relative standard error of estimates based on the frequencies of values in the list, reducing the chances of too many values being included. At the same time, it allows more values to be included, since the MCVs need only be more common than the remaining non-MCVs, rather than the overall average. Thus it tends to produce fewer MCVs than the previous code for uniform distributions, and more for non-uniform distributions, reducing estimation errors in both cases. In addition, the algorithm responds better to increasing the statistics target, allowing more values to be included in the MCV list when more of the table is sampled. Jeff Janes, substantially modified by me. Reviewed by John Naylor and Tomas Vondra. Discussion: https://postgr.es/m/CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com
2018-03-22 10:37:36 +01:00
static int analyze_mcv_list(int *mcv_counts,
int num_mcv,
double stadistinct,
double stanullfrac,
int samplerows,
double totalrows);
/*
* std_typanalyze -- the default type-specific typanalyze function
*/
bool
std_typanalyze(VacAttrStats *stats)
{
Oid ltopr;
Oid eqopr;
StdAnalyzeData *mystats;
/* If the attstattarget column is negative, use the default value */
if (stats->attstattarget < 0)
stats->attstattarget = default_statistics_target;
/* Look for default "<" and "=" operators for column's type */
get_sort_group_operators(stats->attrtypid,
false, false, false,
&ltopr, &eqopr, NULL,
NULL);
/* Save the operator info for compute_stats routines */
mystats = (StdAnalyzeData *) palloc(sizeof(StdAnalyzeData));
mystats->eqopr = eqopr;
mystats->eqfunc = OidIsValid(eqopr) ? get_opcode(eqopr) : InvalidOid;
mystats->ltopr = ltopr;
stats->extra_data = mystats;
/*
* Determine which standard statistics algorithm to use
*/
if (OidIsValid(eqopr) && OidIsValid(ltopr))
{
/* Seems to be a scalar datatype */
stats->compute_stats = compute_scalar_stats;
/*--------------------
* The following choice of minrows is based on the paper
* "Random sampling for histogram construction: how much is enough?"
* by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
* Proceedings of ACM SIGMOD International Conference on Management
* of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5
* says that for table size n, histogram size k, maximum relative
* error in bin size f, and error probability gamma, the minimum
* random sample size is
* r = 4 * k * ln(2*n/gamma) / f^2
* Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
* r = 305.82 * k
* Note that because of the log function, the dependence on n is
* quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
* bin size error with probability 0.99. So there's no real need to
* scale for n, which is a good thing because we don't necessarily
* know it at this point.
*--------------------
*/
stats->minrows = 300 * stats->attstattarget;
}
else if (OidIsValid(eqopr))
{
/* We can still recognize distinct values */
stats->compute_stats = compute_distinct_stats;
/* Might as well use the same minrows as above */
stats->minrows = 300 * stats->attstattarget;
}
else
{
/* Can't do much but the trivial stuff */
stats->compute_stats = compute_trivial_stats;
/* Might as well use the same minrows as above */
stats->minrows = 300 * stats->attstattarget;
}
return true;
}
/*
* compute_trivial_stats() -- compute very basic column statistics
*
* We use this when we cannot find a hash "=" operator for the datatype.
*
* We determine the fraction of non-null rows and the average datum width.
*/
static void
compute_trivial_stats(VacAttrStatsP stats,
AnalyzeAttrFetchFunc fetchfunc,
int samplerows,
double totalrows)
{
int i;
int null_cnt = 0;
int nonnull_cnt = 0;
double total_width = 0;
bool is_varlena = (!stats->attrtype->typbyval &&
stats->attrtype->typlen == -1);
bool is_varwidth = (!stats->attrtype->typbyval &&
stats->attrtype->typlen < 0);
for (i = 0; i < samplerows; i++)
{
Datum value;
bool isnull;
vacuum_delay_point();
value = fetchfunc(stats, i, &isnull);
/* Check for null/nonnull */
if (isnull)
{
null_cnt++;
continue;
}
nonnull_cnt++;
/*
* If it's a variable-width field, add up widths for average width
* calculation. Note that if the value is toasted, we use the toasted
* width. We don't bother with this calculation if it's a fixed-width
* type.
*/
if (is_varlena)
{
total_width += VARSIZE_ANY(DatumGetPointer(value));
}
else if (is_varwidth)
{
/* must be cstring */
total_width += strlen(DatumGetCString(value)) + 1;
}
}
/* We can only compute average width if we found some non-null values. */
if (nonnull_cnt > 0)
{
stats->stats_valid = true;
/* Do the simple null-frac and width stats */
stats->stanullfrac = (double) null_cnt / (double) samplerows;
if (is_varwidth)
stats->stawidth = total_width / (double) nonnull_cnt;
else
stats->stawidth = stats->attrtype->typlen;
stats->stadistinct = 0.0; /* "unknown" */
}
else if (null_cnt > 0)
{
/* We found only nulls; assume the column is entirely null */
stats->stats_valid = true;
stats->stanullfrac = 1.0;
if (is_varwidth)
stats->stawidth = 0; /* "unknown" */
else
stats->stawidth = stats->attrtype->typlen;
stats->stadistinct = 0.0; /* "unknown" */
}
}
/*
* compute_distinct_stats() -- compute column statistics including ndistinct
2000-05-29 19:44:17 +02:00
*
* We use this when we can find only an "=" operator for the datatype.
2000-05-29 19:44:17 +02:00
*
* We determine the fraction of non-null rows, the average width, the
* most common values, and the (estimated) number of distinct values.
2000-05-29 19:44:17 +02:00
*
* The most common values are determined by brute force: we keep a list
* of previously seen values, ordered by number of times seen, as we scan
* the samples. A newly seen value is inserted just after the last
* multiply-seen value, causing the bottommost (oldest) singly-seen value
* to drop off the list. The accuracy of this method, and also its cost,
* depend mainly on the length of the list we are willing to keep.
2000-05-29 19:44:17 +02:00
*/
static void
compute_distinct_stats(VacAttrStatsP stats,
AnalyzeAttrFetchFunc fetchfunc,
int samplerows,
double totalrows)
2000-05-29 19:44:17 +02:00
{
int i;
int null_cnt = 0;
int nonnull_cnt = 0;
int toowide_cnt = 0;
double total_width = 0;
bool is_varlena = (!stats->attrtype->typbyval &&
stats->attrtype->typlen == -1);
bool is_varwidth = (!stats->attrtype->typbyval &&
stats->attrtype->typlen < 0);
FmgrInfo f_cmpeq;
typedef struct
{
Datum value;
int count;
} TrackItem;
TrackItem *track;
int track_cnt,
track_max;
int num_mcv = stats->attstattarget;
StdAnalyzeData *mystats = (StdAnalyzeData *) stats->extra_data;
/*
* We track up to 2*n values for an n-element MCV list; but at least 10
*/
track_max = 2 * num_mcv;
if (track_max < 10)
track_max = 10;
track = (TrackItem *) palloc(track_max * sizeof(TrackItem));
track_cnt = 0;
fmgr_info(mystats->eqfunc, &f_cmpeq);
for (i = 0; i < samplerows; i++)
2000-05-29 19:44:17 +02:00
{
Datum value;
bool isnull;
bool match;
int firstcount1,
j;
vacuum_delay_point();
value = fetchfunc(stats, i, &isnull);
2000-05-29 19:44:17 +02:00
/* Check for null/nonnull */
2000-05-29 19:44:17 +02:00
if (isnull)
{
null_cnt++;
continue;
}
nonnull_cnt++;
/*
* If it's a variable-width field, add up widths for average width
* calculation. Note that if the value is toasted, we use the toasted
* width. We don't bother with this calculation if it's a fixed-width
* type.
*/
if (is_varlena)
2000-05-29 19:44:17 +02:00
{
total_width += VARSIZE_ANY(DatumGetPointer(value));
/*
* If the value is toasted, we want to detoast it just once to
* avoid repeated detoastings and resultant excess memory usage
* during the comparisons. Also, check to see if the value is
* excessively wide, and if so don't detoast at all --- just
* ignore the value.
*/
if (toast_raw_datum_size(value) > WIDTH_THRESHOLD)
2000-05-29 19:44:17 +02:00
{
toowide_cnt++;
continue;
2000-05-29 19:44:17 +02:00
}
value = PointerGetDatum(PG_DETOAST_DATUM(value));
}
else if (is_varwidth)
{
/* must be cstring */
total_width += strlen(DatumGetCString(value)) + 1;
}
2000-05-29 19:44:17 +02:00
/*
* See if the value matches anything we're already tracking.
*/
match = false;
firstcount1 = track_cnt;
for (j = 0; j < track_cnt; j++)
{
if (DatumGetBool(FunctionCall2Coll(&f_cmpeq,
Make pg_statistic and related code account more honestly for collations. When we first put in collations support, we basically punted on teaching pg_statistic, ANALYZE, and the planner selectivity functions about that. They've just used DEFAULT_COLLATION_OID independently of the actual collation of the data. It's time to improve that, so: * Add columns to pg_statistic that record the specific collation associated with each statistics slot. * Teach ANALYZE to use the column's actual collation when comparing values for statistical purposes, and record this in the appropriate slot. (Note that type-specific typanalyze functions are now expected to fill stats->stacoll with the appropriate collation, too.) * Teach assorted selectivity functions to use the actual collation of the stats they are looking at, instead of just assuming it's DEFAULT_COLLATION_OID. This should give noticeably better results in selectivity estimates for columns with nondefault collations, at least for query clauses that use that same collation (which would be the default behavior in most cases). It's still true that comparisons with explicit COLLATE clauses different from the stored data's collation won't be well-estimated, but that's no worse than before. Also, this patch does make the first step towards doing better with that, which is that it's now theoretically possible to collect stats for a collation other than the column's own collation. Patch by me; thanks to Peter Eisentraut for review. Discussion: https://postgr.es/m/14706.1544630227@sss.pgh.pa.us
2018-12-14 18:52:49 +01:00
stats->attrcollid,
value, track[j].value)))
2000-05-29 19:44:17 +02:00
{
match = true;
break;
2000-05-29 19:44:17 +02:00
}
if (j < firstcount1 && track[j].count == 1)
firstcount1 = j;
}
if (match)
{
/* Found a match */
track[j].count++;
/* This value may now need to "bubble up" in the track list */
while (j > 0 && track[j].count > track[j - 1].count)
2000-05-29 19:44:17 +02:00
{
swapDatum(track[j].value, track[j - 1].value);
swapInt(track[j].count, track[j - 1].count);
j--;
2000-05-29 19:44:17 +02:00
}
}
else
{
/* No match. Insert at head of count-1 list */
if (track_cnt < track_max)
track_cnt++;
for (j = track_cnt - 1; j > firstcount1; j--)
{
track[j].value = track[j - 1].value;
track[j].count = track[j - 1].count;
}
if (firstcount1 < track_cnt)
{
track[firstcount1].value = value;
track[firstcount1].count = 1;
}
}
}
/* We can only compute real stats if we found some non-null values. */
if (nonnull_cnt > 0)
{
int nmultiple,
summultiple;
stats->stats_valid = true;
/* Do the simple null-frac and width stats */
stats->stanullfrac = (double) null_cnt / (double) samplerows;
if (is_varwidth)
stats->stawidth = total_width / (double) nonnull_cnt;
else
stats->stawidth = stats->attrtype->typlen;
/* Count the number of values we found multiple times */
summultiple = 0;
for (nmultiple = 0; nmultiple < track_cnt; nmultiple++)
{
if (track[nmultiple].count == 1)
break;
summultiple += track[nmultiple].count;
}
if (nmultiple == 0)
{
Fix misestimation of n_distinct for a nearly-unique column with many nulls. If ANALYZE found no repeated non-null entries in its sample, it set the column's stadistinct value to -1.0, intending to indicate that the entries are all distinct. But what this value actually means is that the number of distinct values is 100% of the table's rowcount, and thus it was overestimating the number of distinct values by however many nulls there are. This could lead to very poor selectivity estimates, as for example in a recent report from Andreas Joseph Krogh. We should discount the stadistinct value by whatever we've estimated the nulls fraction to be. (That is what will happen if we choose to use a negative stadistinct for a column that does have repeated entries, so this code path was just inconsistent.) In addition to fixing the stadistinct entries stored by several different ANALYZE code paths, adjust the logic where get_variable_numdistinct() forces an "all distinct" estimate on the basis of finding a relevant unique index. Unique indexes don't reject nulls, so there's no reason to assume that the null fraction doesn't apply. Back-patch to all supported branches. Back-patching is a bit of a judgment call, but this problem seems to affect only a few users (else we'd have identified it long ago), and it's bad enough when it does happen that destabilizing plan choices in a worse direction seems unlikely. Patch by me, with documentation wording suggested by Dean Rasheed Report: <VisenaEmail.26.df42f82acae38a58.156463942b8@tc7-visena> Discussion: <16143.1470350371@sss.pgh.pa.us>
2016-08-08 00:52:02 +02:00
/*
* If we found no repeated non-null values, assume it's a unique
* column; but be sure to discount for any nulls we found.
*/
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
}
else if (track_cnt < track_max && toowide_cnt == 0 &&
nmultiple == track_cnt)
{
/*
* Our track list includes every value in the sample, and every
* value appeared more than once. Assume the column has just
* these values. (This case is meant to address columns with
* small, fixed sets of possible values, such as boolean or enum
* columns. If there are any values that appear just once in the
* sample, including too-wide values, we should assume that that's
* not what we're dealing with.)
*/
stats->stadistinct = track_cnt;
2000-05-29 19:44:17 +02:00
}
else
{
/*----------
* Estimate the number of distinct values using the estimator
* proposed by Haas and Stokes in IBM Research Report RJ 10025:
* n*d / (n - f1 + f1*n/N)
* where f1 is the number of distinct values that occurred
* exactly once in our sample of n rows (from a total of N),
* and d is the total number of distinct values in the sample.
* This is their Duj1 estimator; the other estimators they
* recommend are considerably more complex, and are numerically
* very unstable when n is much smaller than N.
*
* In this calculation, we consider only non-nulls. We used to
* include rows with null values in the n and N counts, but that
* leads to inaccurate answers in columns with many nulls, and
* it's intuitively bogus anyway considering the desired result is
* the number of distinct non-null values.
*
* We assume (not very reliably!) that all the multiply-occurring
* values are reflected in the final track[] list, and the other
* nonnull values all appeared but once. (XXX this usually
* results in a drastic overestimate of ndistinct. Can we do
* any better?)
*----------
*/
int f1 = nonnull_cnt - summultiple;
int d = f1 + nmultiple;
double n = samplerows - null_cnt;
double N = totalrows * (1.0 - stats->stanullfrac);
double stadistinct;
/* N == 0 shouldn't happen, but just in case ... */
if (N > 0)
stadistinct = (n * d) / ((n - f1) + f1 * n / N);
else
stadistinct = 0;
2002-09-04 22:31:48 +02:00
/* Clamp to sane range in case of roundoff error */
if (stadistinct < d)
stadistinct = d;
if (stadistinct > N)
stadistinct = N;
/* And round to integer */
stats->stadistinct = floor(stadistinct + 0.5);
}
2000-05-29 19:44:17 +02:00
/*
* If we estimated the number of distinct values at more than 10% of
* the total row count (a very arbitrary limit), then assume that
* stadistinct should scale with the row count rather than be a fixed
* value.
*/
if (stats->stadistinct > 0.1 * totalrows)
stats->stadistinct = -(stats->stadistinct / totalrows);
2000-05-29 19:44:17 +02:00
/*
* Decide how many values are worth storing as most-common values. If
* we are able to generate a complete MCV list (all the values in the
* sample will fit, and we think these are all the ones in the table),
* then do so. Otherwise, store only those values that are
Improve ANALYZE's strategy for finding MCVs. Previously, a value was included in the MCV list if its frequency was 25% larger than the estimated average frequency of all nonnull values in the table. For uniform distributions, that can lead to values being included in the MCV list and significantly overestimated on the basis of relatively few (sometimes just 2) instances being seen in the sample. For non-uniform distributions, it can lead to too few values being included in the MCV list, since the overall average frequency may be dominated by a small number of very common values, while the remaining values may still have a large spread of frequencies, causing both substantial overestimation and underestimation of the remaining values. Furthermore, increasing the statistics target may have little effect because the overall average frequency will remain relatively unchanged. Instead, populate the MCV list with the largest set of common values that are statistically significantly more common than the average frequency of the remaining values. This takes into account the variance of the sample counts, which depends on the counts themselves and on the proportion of the table that was sampled. As a result, it constrains the relative standard error of estimates based on the frequencies of values in the list, reducing the chances of too many values being included. At the same time, it allows more values to be included, since the MCVs need only be more common than the remaining non-MCVs, rather than the overall average. Thus it tends to produce fewer MCVs than the previous code for uniform distributions, and more for non-uniform distributions, reducing estimation errors in both cases. In addition, the algorithm responds better to increasing the statistics target, allowing more values to be included in the MCV list when more of the table is sampled. Jeff Janes, substantially modified by me. Reviewed by John Naylor and Tomas Vondra. Discussion: https://postgr.es/m/CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com
2018-03-22 10:37:36 +01:00
* significantly more common than the values not in the list.
*
* Note: the first of these cases is meant to address columns with
* small, fixed sets of possible values, such as boolean or enum
* columns. If we can *completely* represent the column population by
* an MCV list that will fit into the stats target, then we should do
* so and thus provide the planner with complete information. But if
* the MCV list is not complete, it's generally worth being more
* selective, and not just filling it all the way up to the stats
Improve ANALYZE's strategy for finding MCVs. Previously, a value was included in the MCV list if its frequency was 25% larger than the estimated average frequency of all nonnull values in the table. For uniform distributions, that can lead to values being included in the MCV list and significantly overestimated on the basis of relatively few (sometimes just 2) instances being seen in the sample. For non-uniform distributions, it can lead to too few values being included in the MCV list, since the overall average frequency may be dominated by a small number of very common values, while the remaining values may still have a large spread of frequencies, causing both substantial overestimation and underestimation of the remaining values. Furthermore, increasing the statistics target may have little effect because the overall average frequency will remain relatively unchanged. Instead, populate the MCV list with the largest set of common values that are statistically significantly more common than the average frequency of the remaining values. This takes into account the variance of the sample counts, which depends on the counts themselves and on the proportion of the table that was sampled. As a result, it constrains the relative standard error of estimates based on the frequencies of values in the list, reducing the chances of too many values being included. At the same time, it allows more values to be included, since the MCVs need only be more common than the remaining non-MCVs, rather than the overall average. Thus it tends to produce fewer MCVs than the previous code for uniform distributions, and more for non-uniform distributions, reducing estimation errors in both cases. In addition, the algorithm responds better to increasing the statistics target, allowing more values to be included in the MCV list when more of the table is sampled. Jeff Janes, substantially modified by me. Reviewed by John Naylor and Tomas Vondra. Discussion: https://postgr.es/m/CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com
2018-03-22 10:37:36 +01:00
* target.
*/
if (track_cnt < track_max && toowide_cnt == 0 &&
stats->stadistinct > 0 &&
track_cnt <= num_mcv)
{
/* Track list includes all values seen, and all will fit */
num_mcv = track_cnt;
}
else
{
Improve ANALYZE's strategy for finding MCVs. Previously, a value was included in the MCV list if its frequency was 25% larger than the estimated average frequency of all nonnull values in the table. For uniform distributions, that can lead to values being included in the MCV list and significantly overestimated on the basis of relatively few (sometimes just 2) instances being seen in the sample. For non-uniform distributions, it can lead to too few values being included in the MCV list, since the overall average frequency may be dominated by a small number of very common values, while the remaining values may still have a large spread of frequencies, causing both substantial overestimation and underestimation of the remaining values. Furthermore, increasing the statistics target may have little effect because the overall average frequency will remain relatively unchanged. Instead, populate the MCV list with the largest set of common values that are statistically significantly more common than the average frequency of the remaining values. This takes into account the variance of the sample counts, which depends on the counts themselves and on the proportion of the table that was sampled. As a result, it constrains the relative standard error of estimates based on the frequencies of values in the list, reducing the chances of too many values being included. At the same time, it allows more values to be included, since the MCVs need only be more common than the remaining non-MCVs, rather than the overall average. Thus it tends to produce fewer MCVs than the previous code for uniform distributions, and more for non-uniform distributions, reducing estimation errors in both cases. In addition, the algorithm responds better to increasing the statistics target, allowing more values to be included in the MCV list when more of the table is sampled. Jeff Janes, substantially modified by me. Reviewed by John Naylor and Tomas Vondra. Discussion: https://postgr.es/m/CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com
2018-03-22 10:37:36 +01:00
int *mcv_counts;
/* Incomplete list; decide how many values are worth keeping */
if (num_mcv > track_cnt)
num_mcv = track_cnt;
Improve ANALYZE's strategy for finding MCVs. Previously, a value was included in the MCV list if its frequency was 25% larger than the estimated average frequency of all nonnull values in the table. For uniform distributions, that can lead to values being included in the MCV list and significantly overestimated on the basis of relatively few (sometimes just 2) instances being seen in the sample. For non-uniform distributions, it can lead to too few values being included in the MCV list, since the overall average frequency may be dominated by a small number of very common values, while the remaining values may still have a large spread of frequencies, causing both substantial overestimation and underestimation of the remaining values. Furthermore, increasing the statistics target may have little effect because the overall average frequency will remain relatively unchanged. Instead, populate the MCV list with the largest set of common values that are statistically significantly more common than the average frequency of the remaining values. This takes into account the variance of the sample counts, which depends on the counts themselves and on the proportion of the table that was sampled. As a result, it constrains the relative standard error of estimates based on the frequencies of values in the list, reducing the chances of too many values being included. At the same time, it allows more values to be included, since the MCVs need only be more common than the remaining non-MCVs, rather than the overall average. Thus it tends to produce fewer MCVs than the previous code for uniform distributions, and more for non-uniform distributions, reducing estimation errors in both cases. In addition, the algorithm responds better to increasing the statistics target, allowing more values to be included in the MCV list when more of the table is sampled. Jeff Janes, substantially modified by me. Reviewed by John Naylor and Tomas Vondra. Discussion: https://postgr.es/m/CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com
2018-03-22 10:37:36 +01:00
if (num_mcv > 0)
{
Improve ANALYZE's strategy for finding MCVs. Previously, a value was included in the MCV list if its frequency was 25% larger than the estimated average frequency of all nonnull values in the table. For uniform distributions, that can lead to values being included in the MCV list and significantly overestimated on the basis of relatively few (sometimes just 2) instances being seen in the sample. For non-uniform distributions, it can lead to too few values being included in the MCV list, since the overall average frequency may be dominated by a small number of very common values, while the remaining values may still have a large spread of frequencies, causing both substantial overestimation and underestimation of the remaining values. Furthermore, increasing the statistics target may have little effect because the overall average frequency will remain relatively unchanged. Instead, populate the MCV list with the largest set of common values that are statistically significantly more common than the average frequency of the remaining values. This takes into account the variance of the sample counts, which depends on the counts themselves and on the proportion of the table that was sampled. As a result, it constrains the relative standard error of estimates based on the frequencies of values in the list, reducing the chances of too many values being included. At the same time, it allows more values to be included, since the MCVs need only be more common than the remaining non-MCVs, rather than the overall average. Thus it tends to produce fewer MCVs than the previous code for uniform distributions, and more for non-uniform distributions, reducing estimation errors in both cases. In addition, the algorithm responds better to increasing the statistics target, allowing more values to be included in the MCV list when more of the table is sampled. Jeff Janes, substantially modified by me. Reviewed by John Naylor and Tomas Vondra. Discussion: https://postgr.es/m/CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com
2018-03-22 10:37:36 +01:00
mcv_counts = (int *) palloc(num_mcv * sizeof(int));
for (i = 0; i < num_mcv; i++)
mcv_counts[i] = track[i].count;
num_mcv = analyze_mcv_list(mcv_counts, num_mcv,
stats->stadistinct,
stats->stanullfrac,
samplerows, totalrows);
}
}
/* Generate MCV slot entry */
if (num_mcv > 0)
2000-05-29 19:44:17 +02:00
{
MemoryContext old_context;
Datum *mcv_values;
float4 *mcv_freqs;
/* Must copy the target values into anl_context */
old_context = MemoryContextSwitchTo(stats->anl_context);
mcv_values = (Datum *) palloc(num_mcv * sizeof(Datum));
mcv_freqs = (float4 *) palloc(num_mcv * sizeof(float4));
for (i = 0; i < num_mcv; i++)
{
mcv_values[i] = datumCopy(track[i].value,
stats->attrtype->typbyval,
stats->attrtype->typlen);
mcv_freqs[i] = (double) track[i].count / (double) samplerows;
}
MemoryContextSwitchTo(old_context);
stats->stakind[0] = STATISTIC_KIND_MCV;
stats->staop[0] = mystats->eqopr;
Make pg_statistic and related code account more honestly for collations. When we first put in collations support, we basically punted on teaching pg_statistic, ANALYZE, and the planner selectivity functions about that. They've just used DEFAULT_COLLATION_OID independently of the actual collation of the data. It's time to improve that, so: * Add columns to pg_statistic that record the specific collation associated with each statistics slot. * Teach ANALYZE to use the column's actual collation when comparing values for statistical purposes, and record this in the appropriate slot. (Note that type-specific typanalyze functions are now expected to fill stats->stacoll with the appropriate collation, too.) * Teach assorted selectivity functions to use the actual collation of the stats they are looking at, instead of just assuming it's DEFAULT_COLLATION_OID. This should give noticeably better results in selectivity estimates for columns with nondefault collations, at least for query clauses that use that same collation (which would be the default behavior in most cases). It's still true that comparisons with explicit COLLATE clauses different from the stored data's collation won't be well-estimated, but that's no worse than before. Also, this patch does make the first step towards doing better with that, which is that it's now theoretically possible to collect stats for a collation other than the column's own collation. Patch by me; thanks to Peter Eisentraut for review. Discussion: https://postgr.es/m/14706.1544630227@sss.pgh.pa.us
2018-12-14 18:52:49 +01:00
stats->stacoll[0] = stats->attrcollid;
stats->stanumbers[0] = mcv_freqs;
stats->numnumbers[0] = num_mcv;
stats->stavalues[0] = mcv_values;
stats->numvalues[0] = num_mcv;
/*
* Accept the defaults for stats->statypid and others. They have
* been set before we were called (see vacuum.h)
*/
2000-05-29 19:44:17 +02:00
}
}
else if (null_cnt > 0)
{
/* We found only nulls; assume the column is entirely null */
stats->stats_valid = true;
stats->stanullfrac = 1.0;
if (is_varwidth)
stats->stawidth = 0; /* "unknown" */
else
stats->stawidth = stats->attrtype->typlen;
stats->stadistinct = 0.0; /* "unknown" */
}
/* We don't need to bother cleaning up any of our temporary palloc's */
2000-05-29 19:44:17 +02:00
}
/*
* compute_scalar_stats() -- compute column statistics
2000-05-29 19:44:17 +02:00
*
* We use this when we can find "=" and "<" operators for the datatype.
*
* We determine the fraction of non-null rows, the average width, the
* most common values, the (estimated) number of distinct values, the
* distribution histogram, and the correlation of physical to logical order.
2000-05-29 19:44:17 +02:00
*
* The desired stats can be determined fairly easily after sorting the
* data values into order.
2000-05-29 19:44:17 +02:00
*/
static void
compute_scalar_stats(VacAttrStatsP stats,
AnalyzeAttrFetchFunc fetchfunc,
int samplerows,
double totalrows)
2000-05-29 19:44:17 +02:00
{
int i;
int null_cnt = 0;
int nonnull_cnt = 0;
int toowide_cnt = 0;
double total_width = 0;
bool is_varlena = (!stats->attrtype->typbyval &&
stats->attrtype->typlen == -1);
bool is_varwidth = (!stats->attrtype->typbyval &&
stats->attrtype->typlen < 0);
double corr_xysum;
SortSupportData ssup;
ScalarItem *values;
int values_cnt = 0;
int *tupnoLink;
ScalarMCVItem *track;
int track_cnt = 0;
int num_mcv = stats->attstattarget;
int num_bins = stats->attstattarget;
StdAnalyzeData *mystats = (StdAnalyzeData *) stats->extra_data;
values = (ScalarItem *) palloc(samplerows * sizeof(ScalarItem));
tupnoLink = (int *) palloc(samplerows * sizeof(int));
track = (ScalarMCVItem *) palloc(num_mcv * sizeof(ScalarMCVItem));
memset(&ssup, 0, sizeof(ssup));
ssup.ssup_cxt = CurrentMemoryContext;
Make pg_statistic and related code account more honestly for collations. When we first put in collations support, we basically punted on teaching pg_statistic, ANALYZE, and the planner selectivity functions about that. They've just used DEFAULT_COLLATION_OID independently of the actual collation of the data. It's time to improve that, so: * Add columns to pg_statistic that record the specific collation associated with each statistics slot. * Teach ANALYZE to use the column's actual collation when comparing values for statistical purposes, and record this in the appropriate slot. (Note that type-specific typanalyze functions are now expected to fill stats->stacoll with the appropriate collation, too.) * Teach assorted selectivity functions to use the actual collation of the stats they are looking at, instead of just assuming it's DEFAULT_COLLATION_OID. This should give noticeably better results in selectivity estimates for columns with nondefault collations, at least for query clauses that use that same collation (which would be the default behavior in most cases). It's still true that comparisons with explicit COLLATE clauses different from the stored data's collation won't be well-estimated, but that's no worse than before. Also, this patch does make the first step towards doing better with that, which is that it's now theoretically possible to collect stats for a collation other than the column's own collation. Patch by me; thanks to Peter Eisentraut for review. Discussion: https://postgr.es/m/14706.1544630227@sss.pgh.pa.us
2018-12-14 18:52:49 +01:00
ssup.ssup_collation = stats->attrcollid;
ssup.ssup_nulls_first = false;
2015-05-24 03:35:49 +02:00
/*
* For now, don't perform abbreviated key conversion, because full values
* are required for MCV slot generation. Supporting that optimization
* would necessitate teaching compare_scalars() to call a tie-breaker.
*/
ssup.abbreviate = false;
PrepareSortSupportFromOrderingOp(mystats->ltopr, &ssup);
/* Initial scan to find sortable values */
for (i = 0; i < samplerows; i++)
2000-05-29 19:44:17 +02:00
{
Datum value;
bool isnull;
2000-05-29 19:44:17 +02:00
vacuum_delay_point();
value = fetchfunc(stats, i, &isnull);
2000-05-29 19:44:17 +02:00
/* Check for null/nonnull */
if (isnull)
2000-05-29 19:44:17 +02:00
{
null_cnt++;
continue;
2000-05-29 19:44:17 +02:00
}
nonnull_cnt++;
2000-05-29 19:44:17 +02:00
/*
* If it's a variable-width field, add up widths for average width
* calculation. Note that if the value is toasted, we use the toasted
* width. We don't bother with this calculation if it's a fixed-width
* type.
*/
if (is_varlena)
2000-05-29 19:44:17 +02:00
{
total_width += VARSIZE_ANY(DatumGetPointer(value));
/*
* If the value is toasted, we want to detoast it just once to
* avoid repeated detoastings and resultant excess memory usage
* during the comparisons. Also, check to see if the value is
* excessively wide, and if so don't detoast at all --- just
* ignore the value.
*/
if (toast_raw_datum_size(value) > WIDTH_THRESHOLD)
2000-05-29 19:44:17 +02:00
{
toowide_cnt++;
continue;
2000-05-29 19:44:17 +02:00
}
value = PointerGetDatum(PG_DETOAST_DATUM(value));
}
else if (is_varwidth)
{
/* must be cstring */
total_width += strlen(DatumGetCString(value)) + 1;
}
2001-03-22 05:01:46 +01:00
/* Add it to the list to be sorted */
values[values_cnt].value = value;
values[values_cnt].tupno = values_cnt;
tupnoLink[values_cnt] = values_cnt;
values_cnt++;
}
/* We can only compute real stats if we found some sortable values. */
if (values_cnt > 0)
{
int ndistinct, /* # distinct values in sample */
nmultiple, /* # that appear multiple times */
num_hist,
dups_cnt;
int slot_idx = 0;
CompareScalarsContext cxt;
/* Sort the collected values */
cxt.ssup = &ssup;
cxt.tupnoLink = tupnoLink;
qsort_interruptible(values, values_cnt, sizeof(ScalarItem),
compare_scalars, &cxt);
/*
* Now scan the values in order, find the most common ones, and also
* accumulate ordering-correlation statistics.
*
* To determine which are most common, we first have to count the
* number of duplicates of each value. The duplicates are adjacent in
* the sorted list, so a brute-force approach is to compare successive
* datum values until we find two that are not equal. However, that
* requires N-1 invocations of the datum comparison routine, which are
* completely redundant with work that was done during the sort. (The
* sort algorithm must at some point have compared each pair of items
* that are adjacent in the sorted order; otherwise it could not know
* that it's ordered the pair correctly.) We exploit this by having
* compare_scalars remember the highest tupno index that each
* ScalarItem has been found equal to. At the end of the sort, a
* ScalarItem's tupnoLink will still point to itself if and only if it
* is the last item of its group of duplicates (since the group will
* be ordered by tupno).
*/
corr_xysum = 0;
ndistinct = 0;
nmultiple = 0;
dups_cnt = 0;
for (i = 0; i < values_cnt; i++)
{
int tupno = values[i].tupno;
corr_xysum += ((double) i) * ((double) tupno);
dups_cnt++;
if (tupnoLink[tupno] == tupno)
2000-05-29 19:44:17 +02:00
{
/* Reached end of duplicates of this value */
ndistinct++;
if (dups_cnt > 1)
2000-05-29 19:44:17 +02:00
{
nmultiple++;
if (track_cnt < num_mcv ||
dups_cnt > track[track_cnt - 1].count)
{
/*
* Found a new item for the mcv list; find its
* position, bubbling down old items if needed. Loop
* invariant is that j points at an empty/ replaceable
* slot.
*/
int j;
if (track_cnt < num_mcv)
track_cnt++;
for (j = track_cnt - 1; j > 0; j--)
{
if (dups_cnt <= track[j - 1].count)
break;
track[j].count = track[j - 1].count;
track[j].first = track[j - 1].first;
}
track[j].count = dups_cnt;
track[j].first = i + 1 - dups_cnt;
}
}
dups_cnt = 0;
}
}
2000-05-29 19:44:17 +02:00
stats->stats_valid = true;
/* Do the simple null-frac and width stats */
stats->stanullfrac = (double) null_cnt / (double) samplerows;
if (is_varwidth)
stats->stawidth = total_width / (double) nonnull_cnt;
else
stats->stawidth = stats->attrtype->typlen;
2000-05-29 19:44:17 +02:00
if (nmultiple == 0)
{
Fix misestimation of n_distinct for a nearly-unique column with many nulls. If ANALYZE found no repeated non-null entries in its sample, it set the column's stadistinct value to -1.0, intending to indicate that the entries are all distinct. But what this value actually means is that the number of distinct values is 100% of the table's rowcount, and thus it was overestimating the number of distinct values by however many nulls there are. This could lead to very poor selectivity estimates, as for example in a recent report from Andreas Joseph Krogh. We should discount the stadistinct value by whatever we've estimated the nulls fraction to be. (That is what will happen if we choose to use a negative stadistinct for a column that does have repeated entries, so this code path was just inconsistent.) In addition to fixing the stadistinct entries stored by several different ANALYZE code paths, adjust the logic where get_variable_numdistinct() forces an "all distinct" estimate on the basis of finding a relevant unique index. Unique indexes don't reject nulls, so there's no reason to assume that the null fraction doesn't apply. Back-patch to all supported branches. Back-patching is a bit of a judgment call, but this problem seems to affect only a few users (else we'd have identified it long ago), and it's bad enough when it does happen that destabilizing plan choices in a worse direction seems unlikely. Patch by me, with documentation wording suggested by Dean Rasheed Report: <VisenaEmail.26.df42f82acae38a58.156463942b8@tc7-visena> Discussion: <16143.1470350371@sss.pgh.pa.us>
2016-08-08 00:52:02 +02:00
/*
* If we found no repeated non-null values, assume it's a unique
* column; but be sure to discount for any nulls we found.
*/
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
}
else if (toowide_cnt == 0 && nmultiple == ndistinct)
{
/*
* Every value in the sample appeared more than once. Assume the
* column has just these values. (This case is meant to address
* columns with small, fixed sets of possible values, such as
* boolean or enum columns. If there are any values that appear
* just once in the sample, including too-wide values, we should
* assume that that's not what we're dealing with.)
*/
stats->stadistinct = ndistinct;
}
else
{
/*----------
* Estimate the number of distinct values using the estimator
* proposed by Haas and Stokes in IBM Research Report RJ 10025:
* n*d / (n - f1 + f1*n/N)
* where f1 is the number of distinct values that occurred
* exactly once in our sample of n rows (from a total of N),
* and d is the total number of distinct values in the sample.
* This is their Duj1 estimator; the other estimators they
* recommend are considerably more complex, and are numerically
* very unstable when n is much smaller than N.
*
* In this calculation, we consider only non-nulls. We used to
* include rows with null values in the n and N counts, but that
* leads to inaccurate answers in columns with many nulls, and
* it's intuitively bogus anyway considering the desired result is
* the number of distinct non-null values.
*
* Overwidth values are assumed to have been distinct.
*----------
*/
int f1 = ndistinct - nmultiple + toowide_cnt;
int d = f1 + nmultiple;
double n = samplerows - null_cnt;
double N = totalrows * (1.0 - stats->stanullfrac);
double stadistinct;
/* N == 0 shouldn't happen, but just in case ... */
if (N > 0)
stadistinct = (n * d) / ((n - f1) + f1 * n / N);
else
stadistinct = 0;
2002-09-04 22:31:48 +02:00
/* Clamp to sane range in case of roundoff error */
if (stadistinct < d)
stadistinct = d;
if (stadistinct > N)
stadistinct = N;
/* And round to integer */
stats->stadistinct = floor(stadistinct + 0.5);
}
/*
* If we estimated the number of distinct values at more than 10% of
* the total row count (a very arbitrary limit), then assume that
* stadistinct should scale with the row count rather than be a fixed
* value.
*/
if (stats->stadistinct > 0.1 * totalrows)
stats->stadistinct = -(stats->stadistinct / totalrows);
/*
* Decide how many values are worth storing as most-common values. If
* we are able to generate a complete MCV list (all the values in the
* sample will fit, and we think these are all the ones in the table),
* then do so. Otherwise, store only those values that are
Improve ANALYZE's strategy for finding MCVs. Previously, a value was included in the MCV list if its frequency was 25% larger than the estimated average frequency of all nonnull values in the table. For uniform distributions, that can lead to values being included in the MCV list and significantly overestimated on the basis of relatively few (sometimes just 2) instances being seen in the sample. For non-uniform distributions, it can lead to too few values being included in the MCV list, since the overall average frequency may be dominated by a small number of very common values, while the remaining values may still have a large spread of frequencies, causing both substantial overestimation and underestimation of the remaining values. Furthermore, increasing the statistics target may have little effect because the overall average frequency will remain relatively unchanged. Instead, populate the MCV list with the largest set of common values that are statistically significantly more common than the average frequency of the remaining values. This takes into account the variance of the sample counts, which depends on the counts themselves and on the proportion of the table that was sampled. As a result, it constrains the relative standard error of estimates based on the frequencies of values in the list, reducing the chances of too many values being included. At the same time, it allows more values to be included, since the MCVs need only be more common than the remaining non-MCVs, rather than the overall average. Thus it tends to produce fewer MCVs than the previous code for uniform distributions, and more for non-uniform distributions, reducing estimation errors in both cases. In addition, the algorithm responds better to increasing the statistics target, allowing more values to be included in the MCV list when more of the table is sampled. Jeff Janes, substantially modified by me. Reviewed by John Naylor and Tomas Vondra. Discussion: https://postgr.es/m/CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com
2018-03-22 10:37:36 +01:00
* significantly more common than the values not in the list.
*
* Note: the first of these cases is meant to address columns with
* small, fixed sets of possible values, such as boolean or enum
* columns. If we can *completely* represent the column population by
* an MCV list that will fit into the stats target, then we should do
* so and thus provide the planner with complete information. But if
* the MCV list is not complete, it's generally worth being more
* selective, and not just filling it all the way up to the stats
Improve ANALYZE's strategy for finding MCVs. Previously, a value was included in the MCV list if its frequency was 25% larger than the estimated average frequency of all nonnull values in the table. For uniform distributions, that can lead to values being included in the MCV list and significantly overestimated on the basis of relatively few (sometimes just 2) instances being seen in the sample. For non-uniform distributions, it can lead to too few values being included in the MCV list, since the overall average frequency may be dominated by a small number of very common values, while the remaining values may still have a large spread of frequencies, causing both substantial overestimation and underestimation of the remaining values. Furthermore, increasing the statistics target may have little effect because the overall average frequency will remain relatively unchanged. Instead, populate the MCV list with the largest set of common values that are statistically significantly more common than the average frequency of the remaining values. This takes into account the variance of the sample counts, which depends on the counts themselves and on the proportion of the table that was sampled. As a result, it constrains the relative standard error of estimates based on the frequencies of values in the list, reducing the chances of too many values being included. At the same time, it allows more values to be included, since the MCVs need only be more common than the remaining non-MCVs, rather than the overall average. Thus it tends to produce fewer MCVs than the previous code for uniform distributions, and more for non-uniform distributions, reducing estimation errors in both cases. In addition, the algorithm responds better to increasing the statistics target, allowing more values to be included in the MCV list when more of the table is sampled. Jeff Janes, substantially modified by me. Reviewed by John Naylor and Tomas Vondra. Discussion: https://postgr.es/m/CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com
2018-03-22 10:37:36 +01:00
* target.
*/
if (track_cnt == ndistinct && toowide_cnt == 0 &&
stats->stadistinct > 0 &&
track_cnt <= num_mcv)
{
/* Track list includes all values seen, and all will fit */
num_mcv = track_cnt;
}
else
{
Improve ANALYZE's strategy for finding MCVs. Previously, a value was included in the MCV list if its frequency was 25% larger than the estimated average frequency of all nonnull values in the table. For uniform distributions, that can lead to values being included in the MCV list and significantly overestimated on the basis of relatively few (sometimes just 2) instances being seen in the sample. For non-uniform distributions, it can lead to too few values being included in the MCV list, since the overall average frequency may be dominated by a small number of very common values, while the remaining values may still have a large spread of frequencies, causing both substantial overestimation and underestimation of the remaining values. Furthermore, increasing the statistics target may have little effect because the overall average frequency will remain relatively unchanged. Instead, populate the MCV list with the largest set of common values that are statistically significantly more common than the average frequency of the remaining values. This takes into account the variance of the sample counts, which depends on the counts themselves and on the proportion of the table that was sampled. As a result, it constrains the relative standard error of estimates based on the frequencies of values in the list, reducing the chances of too many values being included. At the same time, it allows more values to be included, since the MCVs need only be more common than the remaining non-MCVs, rather than the overall average. Thus it tends to produce fewer MCVs than the previous code for uniform distributions, and more for non-uniform distributions, reducing estimation errors in both cases. In addition, the algorithm responds better to increasing the statistics target, allowing more values to be included in the MCV list when more of the table is sampled. Jeff Janes, substantially modified by me. Reviewed by John Naylor and Tomas Vondra. Discussion: https://postgr.es/m/CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com
2018-03-22 10:37:36 +01:00
int *mcv_counts;
/* Incomplete list; decide how many values are worth keeping */
if (num_mcv > track_cnt)
num_mcv = track_cnt;
Improve ANALYZE's strategy for finding MCVs. Previously, a value was included in the MCV list if its frequency was 25% larger than the estimated average frequency of all nonnull values in the table. For uniform distributions, that can lead to values being included in the MCV list and significantly overestimated on the basis of relatively few (sometimes just 2) instances being seen in the sample. For non-uniform distributions, it can lead to too few values being included in the MCV list, since the overall average frequency may be dominated by a small number of very common values, while the remaining values may still have a large spread of frequencies, causing both substantial overestimation and underestimation of the remaining values. Furthermore, increasing the statistics target may have little effect because the overall average frequency will remain relatively unchanged. Instead, populate the MCV list with the largest set of common values that are statistically significantly more common than the average frequency of the remaining values. This takes into account the variance of the sample counts, which depends on the counts themselves and on the proportion of the table that was sampled. As a result, it constrains the relative standard error of estimates based on the frequencies of values in the list, reducing the chances of too many values being included. At the same time, it allows more values to be included, since the MCVs need only be more common than the remaining non-MCVs, rather than the overall average. Thus it tends to produce fewer MCVs than the previous code for uniform distributions, and more for non-uniform distributions, reducing estimation errors in both cases. In addition, the algorithm responds better to increasing the statistics target, allowing more values to be included in the MCV list when more of the table is sampled. Jeff Janes, substantially modified by me. Reviewed by John Naylor and Tomas Vondra. Discussion: https://postgr.es/m/CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com
2018-03-22 10:37:36 +01:00
if (num_mcv > 0)
{
Improve ANALYZE's strategy for finding MCVs. Previously, a value was included in the MCV list if its frequency was 25% larger than the estimated average frequency of all nonnull values in the table. For uniform distributions, that can lead to values being included in the MCV list and significantly overestimated on the basis of relatively few (sometimes just 2) instances being seen in the sample. For non-uniform distributions, it can lead to too few values being included in the MCV list, since the overall average frequency may be dominated by a small number of very common values, while the remaining values may still have a large spread of frequencies, causing both substantial overestimation and underestimation of the remaining values. Furthermore, increasing the statistics target may have little effect because the overall average frequency will remain relatively unchanged. Instead, populate the MCV list with the largest set of common values that are statistically significantly more common than the average frequency of the remaining values. This takes into account the variance of the sample counts, which depends on the counts themselves and on the proportion of the table that was sampled. As a result, it constrains the relative standard error of estimates based on the frequencies of values in the list, reducing the chances of too many values being included. At the same time, it allows more values to be included, since the MCVs need only be more common than the remaining non-MCVs, rather than the overall average. Thus it tends to produce fewer MCVs than the previous code for uniform distributions, and more for non-uniform distributions, reducing estimation errors in both cases. In addition, the algorithm responds better to increasing the statistics target, allowing more values to be included in the MCV list when more of the table is sampled. Jeff Janes, substantially modified by me. Reviewed by John Naylor and Tomas Vondra. Discussion: https://postgr.es/m/CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com
2018-03-22 10:37:36 +01:00
mcv_counts = (int *) palloc(num_mcv * sizeof(int));
for (i = 0; i < num_mcv; i++)
mcv_counts[i] = track[i].count;
num_mcv = analyze_mcv_list(mcv_counts, num_mcv,
stats->stadistinct,
stats->stanullfrac,
samplerows, totalrows);
}
}
/* Generate MCV slot entry */
if (num_mcv > 0)
{
MemoryContext old_context;
Datum *mcv_values;
float4 *mcv_freqs;
/* Must copy the target values into anl_context */
old_context = MemoryContextSwitchTo(stats->anl_context);
mcv_values = (Datum *) palloc(num_mcv * sizeof(Datum));
mcv_freqs = (float4 *) palloc(num_mcv * sizeof(float4));
for (i = 0; i < num_mcv; i++)
{
mcv_values[i] = datumCopy(values[track[i].first].value,
stats->attrtype->typbyval,
stats->attrtype->typlen);
mcv_freqs[i] = (double) track[i].count / (double) samplerows;
2000-05-29 19:44:17 +02:00
}
MemoryContextSwitchTo(old_context);
stats->stakind[slot_idx] = STATISTIC_KIND_MCV;
stats->staop[slot_idx] = mystats->eqopr;
Make pg_statistic and related code account more honestly for collations. When we first put in collations support, we basically punted on teaching pg_statistic, ANALYZE, and the planner selectivity functions about that. They've just used DEFAULT_COLLATION_OID independently of the actual collation of the data. It's time to improve that, so: * Add columns to pg_statistic that record the specific collation associated with each statistics slot. * Teach ANALYZE to use the column's actual collation when comparing values for statistical purposes, and record this in the appropriate slot. (Note that type-specific typanalyze functions are now expected to fill stats->stacoll with the appropriate collation, too.) * Teach assorted selectivity functions to use the actual collation of the stats they are looking at, instead of just assuming it's DEFAULT_COLLATION_OID. This should give noticeably better results in selectivity estimates for columns with nondefault collations, at least for query clauses that use that same collation (which would be the default behavior in most cases). It's still true that comparisons with explicit COLLATE clauses different from the stored data's collation won't be well-estimated, but that's no worse than before. Also, this patch does make the first step towards doing better with that, which is that it's now theoretically possible to collect stats for a collation other than the column's own collation. Patch by me; thanks to Peter Eisentraut for review. Discussion: https://postgr.es/m/14706.1544630227@sss.pgh.pa.us
2018-12-14 18:52:49 +01:00
stats->stacoll[slot_idx] = stats->attrcollid;
stats->stanumbers[slot_idx] = mcv_freqs;
stats->numnumbers[slot_idx] = num_mcv;
stats->stavalues[slot_idx] = mcv_values;
stats->numvalues[slot_idx] = num_mcv;
/*
* Accept the defaults for stats->statypid and others. They have
* been set before we were called (see vacuum.h)
*/
slot_idx++;
}
2000-05-29 19:44:17 +02:00
/*
* Generate a histogram slot entry if there are at least two distinct
* values not accounted for in the MCV list. (This ensures the
* histogram won't collapse to empty or a singleton.)
*/
num_hist = ndistinct - num_mcv;
if (num_hist > num_bins)
num_hist = num_bins + 1;
if (num_hist >= 2)
{
MemoryContext old_context;
Datum *hist_values;
int nvals;
int pos,
posfrac,
delta,
deltafrac;
2000-05-29 19:44:17 +02:00
/* Sort the MCV items into position order to speed next loop */
qsort_interruptible(track, num_mcv, sizeof(ScalarMCVItem),
compare_mcvs, NULL);
2000-05-29 19:44:17 +02:00
/*
* Collapse out the MCV items from the values[] array.
2000-05-29 19:44:17 +02:00
*
* Note we destroy the values[] array here... but we don't need it
* for anything more. We do, however, still need values_cnt.
* nvals will be the number of remaining entries in values[].
2000-05-29 19:44:17 +02:00
*/
if (num_mcv > 0)
2000-05-29 19:44:17 +02:00
{
int src,
dest;
int j;
2000-05-29 19:44:17 +02:00
src = dest = 0;
j = 0; /* index of next interesting MCV item */
while (src < values_cnt)
{
int ncopy;
if (j < num_mcv)
{
int first = track[j].first;
if (src >= first)
{
/* advance past this MCV item */
src = first + track[j].count;
j++;
continue;
}
ncopy = first - src;
}
else
ncopy = values_cnt - src;
memmove(&values[dest], &values[src],
ncopy * sizeof(ScalarItem));
src += ncopy;
dest += ncopy;
}
nvals = dest;
}
else
nvals = values_cnt;
Assert(nvals >= num_hist);
2000-05-29 19:44:17 +02:00
/* Must copy the target values into anl_context */
old_context = MemoryContextSwitchTo(stats->anl_context);
hist_values = (Datum *) palloc(num_hist * sizeof(Datum));
/*
* The object of this loop is to copy the first and last values[]
* entries along with evenly-spaced values in between. So the
* i'th value is values[(i * (nvals - 1)) / (num_hist - 1)]. But
* computing that subscript directly risks integer overflow when
* the stats target is more than a couple thousand. Instead we
* add (nvals - 1) / (num_hist - 1) to pos at each step, tracking
* the integral and fractional parts of the sum separately.
*/
delta = (nvals - 1) / (num_hist - 1);
deltafrac = (nvals - 1) % (num_hist - 1);
pos = posfrac = 0;
for (i = 0; i < num_hist; i++)
{
hist_values[i] = datumCopy(values[pos].value,
stats->attrtype->typbyval,
stats->attrtype->typlen);
pos += delta;
posfrac += deltafrac;
if (posfrac >= (num_hist - 1))
{
/* fractional part exceeds 1, carry to integer part */
pos++;
posfrac -= (num_hist - 1);
}
2000-05-29 19:44:17 +02:00
}
MemoryContextSwitchTo(old_context);
stats->stakind[slot_idx] = STATISTIC_KIND_HISTOGRAM;
stats->staop[slot_idx] = mystats->ltopr;
Make pg_statistic and related code account more honestly for collations. When we first put in collations support, we basically punted on teaching pg_statistic, ANALYZE, and the planner selectivity functions about that. They've just used DEFAULT_COLLATION_OID independently of the actual collation of the data. It's time to improve that, so: * Add columns to pg_statistic that record the specific collation associated with each statistics slot. * Teach ANALYZE to use the column's actual collation when comparing values for statistical purposes, and record this in the appropriate slot. (Note that type-specific typanalyze functions are now expected to fill stats->stacoll with the appropriate collation, too.) * Teach assorted selectivity functions to use the actual collation of the stats they are looking at, instead of just assuming it's DEFAULT_COLLATION_OID. This should give noticeably better results in selectivity estimates for columns with nondefault collations, at least for query clauses that use that same collation (which would be the default behavior in most cases). It's still true that comparisons with explicit COLLATE clauses different from the stored data's collation won't be well-estimated, but that's no worse than before. Also, this patch does make the first step towards doing better with that, which is that it's now theoretically possible to collect stats for a collation other than the column's own collation. Patch by me; thanks to Peter Eisentraut for review. Discussion: https://postgr.es/m/14706.1544630227@sss.pgh.pa.us
2018-12-14 18:52:49 +01:00
stats->stacoll[slot_idx] = stats->attrcollid;
stats->stavalues[slot_idx] = hist_values;
stats->numvalues[slot_idx] = num_hist;
/*
* Accept the defaults for stats->statypid and others. They have
* been set before we were called (see vacuum.h)
*/
slot_idx++;
}
/* Generate a correlation entry if there are multiple values */
if (values_cnt > 1)
{
MemoryContext old_context;
float4 *corrs;
double corr_xsum,
corr_x2sum;
/* Must copy the target values into anl_context */
old_context = MemoryContextSwitchTo(stats->anl_context);
corrs = (float4 *) palloc(sizeof(float4));
MemoryContextSwitchTo(old_context);
/*----------
* Since we know the x and y value sets are both
* 0, 1, ..., values_cnt-1
* we have sum(x) = sum(y) =
* (values_cnt-1)*values_cnt / 2
* and sum(x^2) = sum(y^2) =
* (values_cnt-1)*values_cnt*(2*values_cnt-1) / 6.
*----------
*/
corr_xsum = ((double) (values_cnt - 1)) *
((double) values_cnt) / 2.0;
corr_x2sum = ((double) (values_cnt - 1)) *
((double) values_cnt) * (double) (2 * values_cnt - 1) / 6.0;
/* And the correlation coefficient reduces to */
corrs[0] = (values_cnt * corr_xysum - corr_xsum * corr_xsum) /
(values_cnt * corr_x2sum - corr_xsum * corr_xsum);
stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION;
stats->staop[slot_idx] = mystats->ltopr;
Make pg_statistic and related code account more honestly for collations. When we first put in collations support, we basically punted on teaching pg_statistic, ANALYZE, and the planner selectivity functions about that. They've just used DEFAULT_COLLATION_OID independently of the actual collation of the data. It's time to improve that, so: * Add columns to pg_statistic that record the specific collation associated with each statistics slot. * Teach ANALYZE to use the column's actual collation when comparing values for statistical purposes, and record this in the appropriate slot. (Note that type-specific typanalyze functions are now expected to fill stats->stacoll with the appropriate collation, too.) * Teach assorted selectivity functions to use the actual collation of the stats they are looking at, instead of just assuming it's DEFAULT_COLLATION_OID. This should give noticeably better results in selectivity estimates for columns with nondefault collations, at least for query clauses that use that same collation (which would be the default behavior in most cases). It's still true that comparisons with explicit COLLATE clauses different from the stored data's collation won't be well-estimated, but that's no worse than before. Also, this patch does make the first step towards doing better with that, which is that it's now theoretically possible to collect stats for a collation other than the column's own collation. Patch by me; thanks to Peter Eisentraut for review. Discussion: https://postgr.es/m/14706.1544630227@sss.pgh.pa.us
2018-12-14 18:52:49 +01:00
stats->stacoll[slot_idx] = stats->attrcollid;
stats->stanumbers[slot_idx] = corrs;
stats->numnumbers[slot_idx] = 1;
slot_idx++;
2000-05-29 19:44:17 +02:00
}
}
else if (nonnull_cnt > 0)
{
/* We found some non-null values, but they were all too wide */
Assert(nonnull_cnt == toowide_cnt);
stats->stats_valid = true;
/* Do the simple null-frac and width stats */
stats->stanullfrac = (double) null_cnt / (double) samplerows;
if (is_varwidth)
stats->stawidth = total_width / (double) nonnull_cnt;
else
stats->stawidth = stats->attrtype->typlen;
/* Assume all too-wide values are distinct, so it's a unique column */
Fix misestimation of n_distinct for a nearly-unique column with many nulls. If ANALYZE found no repeated non-null entries in its sample, it set the column's stadistinct value to -1.0, intending to indicate that the entries are all distinct. But what this value actually means is that the number of distinct values is 100% of the table's rowcount, and thus it was overestimating the number of distinct values by however many nulls there are. This could lead to very poor selectivity estimates, as for example in a recent report from Andreas Joseph Krogh. We should discount the stadistinct value by whatever we've estimated the nulls fraction to be. (That is what will happen if we choose to use a negative stadistinct for a column that does have repeated entries, so this code path was just inconsistent.) In addition to fixing the stadistinct entries stored by several different ANALYZE code paths, adjust the logic where get_variable_numdistinct() forces an "all distinct" estimate on the basis of finding a relevant unique index. Unique indexes don't reject nulls, so there's no reason to assume that the null fraction doesn't apply. Back-patch to all supported branches. Back-patching is a bit of a judgment call, but this problem seems to affect only a few users (else we'd have identified it long ago), and it's bad enough when it does happen that destabilizing plan choices in a worse direction seems unlikely. Patch by me, with documentation wording suggested by Dean Rasheed Report: <VisenaEmail.26.df42f82acae38a58.156463942b8@tc7-visena> Discussion: <16143.1470350371@sss.pgh.pa.us>
2016-08-08 00:52:02 +02:00
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
}
else if (null_cnt > 0)
{
/* We found only nulls; assume the column is entirely null */
stats->stats_valid = true;
stats->stanullfrac = 1.0;
if (is_varwidth)
stats->stawidth = 0; /* "unknown" */
else
stats->stawidth = stats->attrtype->typlen;
stats->stadistinct = 0.0; /* "unknown" */
}
/* We don't need to bother cleaning up any of our temporary palloc's */
2000-05-29 19:44:17 +02:00
}
/*
* Comparator for sorting ScalarItems
2000-05-29 19:44:17 +02:00
*
* Aside from sorting the items, we update the tupnoLink[] array
* whenever two ScalarItems are found to contain equal datums. The array
* is indexed by tupno; for each ScalarItem, it contains the highest
* tupno that that item's datum has been found to be equal to. This allows
* us to avoid additional comparisons in compute_scalar_stats().
2000-05-29 19:44:17 +02:00
*/
static int
compare_scalars(const void *a, const void *b, void *arg)
2000-05-29 19:44:17 +02:00
{
Datum da = ((const ScalarItem *) a)->value;
int ta = ((const ScalarItem *) a)->tupno;
Datum db = ((const ScalarItem *) b)->value;
int tb = ((const ScalarItem *) b)->tupno;
CompareScalarsContext *cxt = (CompareScalarsContext *) arg;
int compare;
2000-05-29 19:44:17 +02:00
compare = ApplySortComparator(da, false, db, false, cxt->ssup);
if (compare != 0)
return compare;
2000-05-29 19:44:17 +02:00
/*
* The two datums are equal, so update cxt->tupnoLink[].
*/
if (cxt->tupnoLink[ta] < tb)
cxt->tupnoLink[ta] = tb;
if (cxt->tupnoLink[tb] < ta)
cxt->tupnoLink[tb] = ta;
/*
* For equal datums, sort by tupno
*/
return ta - tb;
}
/*
* Comparator for sorting ScalarMCVItems by position
*/
static int
compare_mcvs(const void *a, const void *b, void *arg)
{
int da = ((const ScalarMCVItem *) a)->first;
int db = ((const ScalarMCVItem *) b)->first;
return da - db;
}
Improve ANALYZE's strategy for finding MCVs. Previously, a value was included in the MCV list if its frequency was 25% larger than the estimated average frequency of all nonnull values in the table. For uniform distributions, that can lead to values being included in the MCV list and significantly overestimated on the basis of relatively few (sometimes just 2) instances being seen in the sample. For non-uniform distributions, it can lead to too few values being included in the MCV list, since the overall average frequency may be dominated by a small number of very common values, while the remaining values may still have a large spread of frequencies, causing both substantial overestimation and underestimation of the remaining values. Furthermore, increasing the statistics target may have little effect because the overall average frequency will remain relatively unchanged. Instead, populate the MCV list with the largest set of common values that are statistically significantly more common than the average frequency of the remaining values. This takes into account the variance of the sample counts, which depends on the counts themselves and on the proportion of the table that was sampled. As a result, it constrains the relative standard error of estimates based on the frequencies of values in the list, reducing the chances of too many values being included. At the same time, it allows more values to be included, since the MCVs need only be more common than the remaining non-MCVs, rather than the overall average. Thus it tends to produce fewer MCVs than the previous code for uniform distributions, and more for non-uniform distributions, reducing estimation errors in both cases. In addition, the algorithm responds better to increasing the statistics target, allowing more values to be included in the MCV list when more of the table is sampled. Jeff Janes, substantially modified by me. Reviewed by John Naylor and Tomas Vondra. Discussion: https://postgr.es/m/CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com
2018-03-22 10:37:36 +01:00
/*
* Analyze the list of common values in the sample and decide how many are
* worth storing in the table's MCV list.
*
* mcv_counts is assumed to be a list of the counts of the most common values
* seen in the sample, starting with the most common. The return value is the
* number that are significantly more common than the values not in the list,
* and which are therefore deemed worth storing in the table's MCV list.
*/
static int
analyze_mcv_list(int *mcv_counts,
int num_mcv,
double stadistinct,
double stanullfrac,
int samplerows,
double totalrows)
{
double ndistinct_table;
double sumcount;
int i;
/*
* If the entire table was sampled, keep the whole list. This also
* protects us against division by zero in the code below.
*/
if (samplerows == totalrows || totalrows <= 1.0)
return num_mcv;
/* Re-extract the estimated number of distinct nonnull values in table */
ndistinct_table = stadistinct;
if (ndistinct_table < 0)
ndistinct_table = -ndistinct_table * totalrows;
/*
* Exclude the least common values from the MCV list, if they are not
* significantly more common than the estimated selectivity they would
* have if they weren't in the list. All non-MCV values are assumed to be
* equally common, after taking into account the frequencies of all the
* values in the MCV list and the number of nulls (c.f. eqsel()).
Improve ANALYZE's strategy for finding MCVs. Previously, a value was included in the MCV list if its frequency was 25% larger than the estimated average frequency of all nonnull values in the table. For uniform distributions, that can lead to values being included in the MCV list and significantly overestimated on the basis of relatively few (sometimes just 2) instances being seen in the sample. For non-uniform distributions, it can lead to too few values being included in the MCV list, since the overall average frequency may be dominated by a small number of very common values, while the remaining values may still have a large spread of frequencies, causing both substantial overestimation and underestimation of the remaining values. Furthermore, increasing the statistics target may have little effect because the overall average frequency will remain relatively unchanged. Instead, populate the MCV list with the largest set of common values that are statistically significantly more common than the average frequency of the remaining values. This takes into account the variance of the sample counts, which depends on the counts themselves and on the proportion of the table that was sampled. As a result, it constrains the relative standard error of estimates based on the frequencies of values in the list, reducing the chances of too many values being included. At the same time, it allows more values to be included, since the MCVs need only be more common than the remaining non-MCVs, rather than the overall average. Thus it tends to produce fewer MCVs than the previous code for uniform distributions, and more for non-uniform distributions, reducing estimation errors in both cases. In addition, the algorithm responds better to increasing the statistics target, allowing more values to be included in the MCV list when more of the table is sampled. Jeff Janes, substantially modified by me. Reviewed by John Naylor and Tomas Vondra. Discussion: https://postgr.es/m/CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com
2018-03-22 10:37:36 +01:00
*
* Here sumcount tracks the total count of all but the last (least common)
* value in the MCV list, allowing us to determine the effect of excluding
* that value from the list.
*
* Note that we deliberately do this by removing values from the full
* list, rather than starting with an empty list and adding values,
* because the latter approach can fail to add any values if all the most
* common values have around the same frequency and make up the majority
* of the table, so that the overall average frequency of all values is
* roughly the same as that of the common values. This would lead to any
* uncommon values being significantly overestimated.
*/
sumcount = 0.0;
for (i = 0; i < num_mcv - 1; i++)
sumcount += mcv_counts[i];
while (num_mcv > 0)
{
double selec,
otherdistinct,
N,
n,
K,
variance,
stddev;
/*
* Estimated selectivity the least common value would have if it
* wasn't in the MCV list (c.f. eqsel()).
*/
selec = 1.0 - sumcount / samplerows - stanullfrac;
if (selec < 0.0)
selec = 0.0;
if (selec > 1.0)
selec = 1.0;
otherdistinct = ndistinct_table - (num_mcv - 1);
if (otherdistinct > 1)
selec /= otherdistinct;
/*
* If the value is kept in the MCV list, its population frequency is
* assumed to equal its sample frequency. We use the lower end of a
* textbook continuity-corrected Wald-type confidence interval to
* determine if that is significantly more common than the non-MCV
* frequency --- specifically we assume the population frequency is
* highly likely to be within around 2 standard errors of the sample
* frequency, which equates to an interval of 2 standard deviations
* either side of the sample count, plus an additional 0.5 for the
* continuity correction. Since we are sampling without replacement,
* this is a hypergeometric distribution.
*
* XXX: Empirically, this approach seems to work quite well, but it
* may be worth considering more advanced techniques for estimating
* the confidence interval of the hypergeometric distribution.
*/
N = totalrows;
n = samplerows;
K = N * mcv_counts[num_mcv - 1] / n;
variance = n * K * (N - K) * (N - n) / (N * N * (N - 1));
stddev = sqrt(variance);
if (mcv_counts[num_mcv - 1] > selec * samplerows + 2 * stddev + 0.5)
{
/*
* The value is significantly more common than the non-MCV
* selectivity would suggest. Keep it, and all the other more
* common values in the list.
*/
break;
}
else
{
/* Discard this value and consider the next least common value */
num_mcv--;
if (num_mcv == 0)
break;
sumcount -= mcv_counts[num_mcv - 1];
}
}
return num_mcv;
}