Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
|
|
|
* mvdistinct.c
|
|
|
|
* POSTGRES multivariate ndistinct coefficients
|
|
|
|
*
|
|
|
|
* Estimating number of groups in a combination of columns (e.g. for GROUP BY)
|
|
|
|
* is tricky, and the estimation error is often significant.
|
|
|
|
|
|
|
|
* The multivariate ndistinct coefficients address this by storing ndistinct
|
|
|
|
* estimates for combinations of the user-specified columns. So for example
|
|
|
|
* given a statistics object on three columns (a,b,c), this module estimates
|
2017-05-14 16:54:47 +02:00
|
|
|
* and stores n-distinct for (a,b), (a,c), (b,c) and (a,b,c). The per-column
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
* estimates are already available in pg_statistic.
|
|
|
|
*
|
|
|
|
*
|
2019-01-02 18:44:25 +01:00
|
|
|
* Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
|
|
|
* src/backend/statistics/mvdistinct.c
|
2017-05-14 16:54:47 +02:00
|
|
|
*
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
#include "postgres.h"
|
|
|
|
|
|
|
|
#include <math.h>
|
|
|
|
|
|
|
|
#include "access/htup_details.h"
|
|
|
|
#include "catalog/pg_statistic_ext.h"
|
Rework the pg_statistic_ext catalog
Since extended statistic got introduced in PostgreSQL 10, there was a
single catalog pg_statistic_ext storing both the definitions and built
statistic. That's however problematic when a user is supposed to have
access only to the definitions, but not to user data.
Consider for example pg_dump on a database with RLS enabled - if the
pg_statistic_ext catalog respects RLS (which it should, if it contains
user data), pg_dump would not see any records and the result would not
define any extended statistics. That would be a surprising behavior.
Until now this was not a pressing issue, because the existing types of
extended statistic (functional dependencies and ndistinct coefficients)
do not include any user data directly. This changed with introduction
of MCV lists, which do include most common combinations of values.
The easiest way to fix this is to split the pg_statistic_ext catalog
into two - one for definitions, one for the built statistic values.
The new catalog is called pg_statistic_ext_data, and we're maintaining
a 1:1 relationship with the old catalog - either there are matching
records in both catalogs, or neither of them.
Bumped CATVERSION due to changing system catalog definitions.
Author: Dean Rasheed, with improvements by me
Reviewed-by: Dean Rasheed, John Naylor
Discussion: https://postgr.es/m/CAEZATCUhT9rt7Ui%3DVdx4N%3D%3DVV5XOK5dsXfnGgVOz_JhAicB%3DZA%40mail.gmail.com
2019-06-13 17:19:21 +02:00
|
|
|
#include "catalog/pg_statistic_ext_data.h"
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
#include "utils/fmgrprotos.h"
|
|
|
|
#include "utils/lsyscache.h"
|
|
|
|
#include "lib/stringinfo.h"
|
|
|
|
#include "utils/syscache.h"
|
|
|
|
#include "utils/typcache.h"
|
|
|
|
#include "statistics/extended_stats_internal.h"
|
|
|
|
#include "statistics/statistics.h"
|
|
|
|
|
|
|
|
|
|
|
|
static double ndistinct_for_combination(double totalrows, int numrows,
|
2019-05-22 19:04:48 +02:00
|
|
|
HeapTuple *rows, VacAttrStats **stats,
|
|
|
|
int k, int *combination);
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
static double estimate_ndistinct(double totalrows, int numrows, int d, int f1);
|
|
|
|
static int n_choose_k(int n, int k);
|
|
|
|
static int num_combinations(int n);
|
|
|
|
|
2019-04-21 19:54:15 +02:00
|
|
|
/* size of the struct header fields (magic, type, nitems) */
|
|
|
|
#define SizeOfHeader (3 * sizeof(uint32))
|
|
|
|
|
|
|
|
/* size of a serialized ndistinct item (coefficient, natts, atts) */
|
|
|
|
#define SizeOfItem(natts) \
|
|
|
|
(sizeof(double) + sizeof(int) + (natts) * sizeof(AttrNumber))
|
|
|
|
|
|
|
|
/* minimal size of a ndistinct item (with two attributes) */
|
|
|
|
#define MinSizeOfItem SizeOfItem(2)
|
|
|
|
|
|
|
|
/* minimal size of mvndistinct, when all items are minimal */
|
|
|
|
#define MinSizeOfItems(nitems) \
|
|
|
|
(SizeOfHeader + (nitems) * MinSizeOfItem)
|
|
|
|
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
/* Combination generator API */
|
|
|
|
|
|
|
|
/* internal state for generator of k-combinations of n elements */
|
|
|
|
typedef struct CombinationGenerator
|
|
|
|
{
|
2017-05-17 22:31:56 +02:00
|
|
|
int k; /* size of the combination */
|
|
|
|
int n; /* total number of elements */
|
|
|
|
int current; /* index of the next combination to return */
|
|
|
|
int ncombinations; /* number of combinations (size of array) */
|
|
|
|
int *combinations; /* array of pre-built combinations */
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
} CombinationGenerator;
|
|
|
|
|
|
|
|
static CombinationGenerator *generator_init(int n, int k);
|
|
|
|
static void generator_free(CombinationGenerator *state);
|
|
|
|
static int *generator_next(CombinationGenerator *state);
|
|
|
|
static void generate_combinations(CombinationGenerator *state);
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* statext_ndistinct_build
|
|
|
|
* Compute ndistinct coefficient for the combination of attributes.
|
|
|
|
*
|
|
|
|
* This computes the ndistinct estimate using the same estimator used
|
|
|
|
* in analyze.c and then computes the coefficient.
|
|
|
|
*/
|
|
|
|
MVNDistinct *
|
|
|
|
statext_ndistinct_build(double totalrows, int numrows, HeapTuple *rows,
|
|
|
|
Bitmapset *attrs, VacAttrStats **stats)
|
|
|
|
{
|
|
|
|
MVNDistinct *result;
|
|
|
|
int k;
|
|
|
|
int itemcnt;
|
|
|
|
int numattrs = bms_num_members(attrs);
|
|
|
|
int numcombs = num_combinations(numattrs);
|
|
|
|
|
|
|
|
result = palloc(offsetof(MVNDistinct, items) +
|
|
|
|
numcombs * sizeof(MVNDistinctItem));
|
|
|
|
result->magic = STATS_NDISTINCT_MAGIC;
|
|
|
|
result->type = STATS_NDISTINCT_TYPE_BASIC;
|
|
|
|
result->nitems = numcombs;
|
|
|
|
|
|
|
|
itemcnt = 0;
|
|
|
|
for (k = 2; k <= numattrs; k++)
|
|
|
|
{
|
2017-05-17 22:31:56 +02:00
|
|
|
int *combination;
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
CombinationGenerator *generator;
|
|
|
|
|
|
|
|
/* generate combinations of K out of N elements */
|
|
|
|
generator = generator_init(numattrs, k);
|
|
|
|
|
|
|
|
while ((combination = generator_next(generator)))
|
|
|
|
{
|
|
|
|
MVNDistinctItem *item = &result->items[itemcnt];
|
2017-05-17 22:31:56 +02:00
|
|
|
int j;
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
|
|
|
|
item->attrs = NULL;
|
|
|
|
for (j = 0; j < k; j++)
|
|
|
|
item->attrs = bms_add_member(item->attrs,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
stats[combination[j]]->attr->attnum);
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
item->ndistinct =
|
|
|
|
ndistinct_for_combination(totalrows, numrows, rows,
|
|
|
|
stats, k, combination);
|
|
|
|
|
|
|
|
itemcnt++;
|
|
|
|
Assert(itemcnt <= result->nitems);
|
|
|
|
}
|
|
|
|
|
|
|
|
generator_free(generator);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* must consume exactly the whole output array */
|
|
|
|
Assert(itemcnt == result->nitems);
|
|
|
|
|
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* statext_ndistinct_load
|
|
|
|
* Load the ndistinct value for the indicated pg_statistic_ext tuple
|
|
|
|
*/
|
|
|
|
MVNDistinct *
|
|
|
|
statext_ndistinct_load(Oid mvoid)
|
|
|
|
{
|
2018-05-02 18:22:48 +02:00
|
|
|
MVNDistinct *result;
|
|
|
|
bool isnull;
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
Datum ndist;
|
|
|
|
HeapTuple htup;
|
|
|
|
|
Rework the pg_statistic_ext catalog
Since extended statistic got introduced in PostgreSQL 10, there was a
single catalog pg_statistic_ext storing both the definitions and built
statistic. That's however problematic when a user is supposed to have
access only to the definitions, but not to user data.
Consider for example pg_dump on a database with RLS enabled - if the
pg_statistic_ext catalog respects RLS (which it should, if it contains
user data), pg_dump would not see any records and the result would not
define any extended statistics. That would be a surprising behavior.
Until now this was not a pressing issue, because the existing types of
extended statistic (functional dependencies and ndistinct coefficients)
do not include any user data directly. This changed with introduction
of MCV lists, which do include most common combinations of values.
The easiest way to fix this is to split the pg_statistic_ext catalog
into two - one for definitions, one for the built statistic values.
The new catalog is called pg_statistic_ext_data, and we're maintaining
a 1:1 relationship with the old catalog - either there are matching
records in both catalogs, or neither of them.
Bumped CATVERSION due to changing system catalog definitions.
Author: Dean Rasheed, with improvements by me
Reviewed-by: Dean Rasheed, John Naylor
Discussion: https://postgr.es/m/CAEZATCUhT9rt7Ui%3DVdx4N%3D%3DVV5XOK5dsXfnGgVOz_JhAicB%3DZA%40mail.gmail.com
2019-06-13 17:19:21 +02:00
|
|
|
htup = SearchSysCache1(STATEXTDATASTXOID, ObjectIdGetDatum(mvoid));
|
2018-05-02 18:22:48 +02:00
|
|
|
if (!HeapTupleIsValid(htup))
|
2017-05-14 16:54:47 +02:00
|
|
|
elog(ERROR, "cache lookup failed for statistics object %u", mvoid);
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
|
Rework the pg_statistic_ext catalog
Since extended statistic got introduced in PostgreSQL 10, there was a
single catalog pg_statistic_ext storing both the definitions and built
statistic. That's however problematic when a user is supposed to have
access only to the definitions, but not to user data.
Consider for example pg_dump on a database with RLS enabled - if the
pg_statistic_ext catalog respects RLS (which it should, if it contains
user data), pg_dump would not see any records and the result would not
define any extended statistics. That would be a surprising behavior.
Until now this was not a pressing issue, because the existing types of
extended statistic (functional dependencies and ndistinct coefficients)
do not include any user data directly. This changed with introduction
of MCV lists, which do include most common combinations of values.
The easiest way to fix this is to split the pg_statistic_ext catalog
into two - one for definitions, one for the built statistic values.
The new catalog is called pg_statistic_ext_data, and we're maintaining
a 1:1 relationship with the old catalog - either there are matching
records in both catalogs, or neither of them.
Bumped CATVERSION due to changing system catalog definitions.
Author: Dean Rasheed, with improvements by me
Reviewed-by: Dean Rasheed, John Naylor
Discussion: https://postgr.es/m/CAEZATCUhT9rt7Ui%3DVdx4N%3D%3DVV5XOK5dsXfnGgVOz_JhAicB%3DZA%40mail.gmail.com
2019-06-13 17:19:21 +02:00
|
|
|
ndist = SysCacheGetAttr(STATEXTDATASTXOID, htup,
|
|
|
|
Anum_pg_statistic_ext_data_stxdndistinct, &isnull);
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
if (isnull)
|
|
|
|
elog(ERROR,
|
2018-05-02 18:22:48 +02:00
|
|
|
"requested statistic kind \"%c\" is not yet built for statistics object %u",
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
STATS_EXT_NDISTINCT, mvoid);
|
|
|
|
|
2018-05-02 18:22:48 +02:00
|
|
|
result = statext_ndistinct_deserialize(DatumGetByteaPP(ndist));
|
|
|
|
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
ReleaseSysCache(htup);
|
|
|
|
|
2018-05-02 18:22:48 +02:00
|
|
|
return result;
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* statext_ndistinct_serialize
|
|
|
|
* serialize ndistinct to the on-disk bytea format
|
|
|
|
*/
|
|
|
|
bytea *
|
|
|
|
statext_ndistinct_serialize(MVNDistinct *ndistinct)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
bytea *output;
|
|
|
|
char *tmp;
|
|
|
|
Size len;
|
|
|
|
|
|
|
|
Assert(ndistinct->magic == STATS_NDISTINCT_MAGIC);
|
|
|
|
Assert(ndistinct->type == STATS_NDISTINCT_TYPE_BASIC);
|
|
|
|
|
|
|
|
/*
|
2017-03-27 19:52:19 +02:00
|
|
|
* Base size is size of scalar fields in the struct, plus one base struct
|
|
|
|
* for each item, including number of items for each.
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
*/
|
2019-04-21 19:54:15 +02:00
|
|
|
len = VARHDRSZ + SizeOfHeader;
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
|
|
|
|
/* and also include space for the actual attribute numbers */
|
|
|
|
for (i = 0; i < ndistinct->nitems; i++)
|
|
|
|
{
|
2017-05-17 22:31:56 +02:00
|
|
|
int nmembers;
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
|
|
|
|
nmembers = bms_num_members(ndistinct->items[i].attrs);
|
|
|
|
Assert(nmembers >= 2);
|
2019-04-21 19:54:15 +02:00
|
|
|
|
|
|
|
len += SizeOfItem(nmembers);
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
output = (bytea *) palloc(len);
|
|
|
|
SET_VARSIZE(output, len);
|
|
|
|
|
|
|
|
tmp = VARDATA(output);
|
|
|
|
|
2017-03-27 19:52:19 +02:00
|
|
|
/* Store the base struct values (magic, type, nitems) */
|
|
|
|
memcpy(tmp, &ndistinct->magic, sizeof(uint32));
|
|
|
|
tmp += sizeof(uint32);
|
|
|
|
memcpy(tmp, &ndistinct->type, sizeof(uint32));
|
|
|
|
tmp += sizeof(uint32);
|
|
|
|
memcpy(tmp, &ndistinct->nitems, sizeof(uint32));
|
|
|
|
tmp += sizeof(uint32);
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
|
|
|
|
/*
|
2019-04-21 19:54:15 +02:00
|
|
|
* store number of attributes and attribute numbers for each entry
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
*/
|
|
|
|
for (i = 0; i < ndistinct->nitems; i++)
|
|
|
|
{
|
|
|
|
MVNDistinctItem item = ndistinct->items[i];
|
2017-05-17 22:31:56 +02:00
|
|
|
int nmembers = bms_num_members(item.attrs);
|
|
|
|
int x;
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
|
|
|
|
memcpy(tmp, &item.ndistinct, sizeof(double));
|
|
|
|
tmp += sizeof(double);
|
|
|
|
memcpy(tmp, &nmembers, sizeof(int));
|
|
|
|
tmp += sizeof(int);
|
|
|
|
|
|
|
|
x = -1;
|
|
|
|
while ((x = bms_next_member(item.attrs, x)) >= 0)
|
|
|
|
{
|
|
|
|
AttrNumber value = (AttrNumber) x;
|
|
|
|
|
|
|
|
memcpy(tmp, &value, sizeof(AttrNumber));
|
|
|
|
tmp += sizeof(AttrNumber);
|
|
|
|
}
|
|
|
|
|
2019-04-21 19:54:15 +02:00
|
|
|
/* protect against overflows */
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
Assert(tmp <= ((char *) output + len));
|
|
|
|
}
|
|
|
|
|
2019-04-21 19:54:15 +02:00
|
|
|
/* check we used exactly the expected space */
|
|
|
|
Assert(tmp == ((char *) output + len));
|
|
|
|
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
return output;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* statext_ndistinct_deserialize
|
|
|
|
* Read an on-disk bytea format MVNDistinct to in-memory format
|
|
|
|
*/
|
|
|
|
MVNDistinct *
|
|
|
|
statext_ndistinct_deserialize(bytea *data)
|
|
|
|
{
|
|
|
|
int i;
|
2017-03-27 19:52:19 +02:00
|
|
|
Size minimum_size;
|
2017-05-17 22:31:56 +02:00
|
|
|
MVNDistinct ndist;
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
MVNDistinct *ndistinct;
|
|
|
|
char *tmp;
|
|
|
|
|
|
|
|
if (data == NULL)
|
|
|
|
return NULL;
|
|
|
|
|
2017-03-27 19:52:19 +02:00
|
|
|
/* we expect at least the basic fields of MVNDistinct struct */
|
2019-04-21 19:54:15 +02:00
|
|
|
if (VARSIZE_ANY_EXHDR(data) < SizeOfHeader)
|
2017-04-08 00:27:00 +02:00
|
|
|
elog(ERROR, "invalid MVNDistinct size %zd (expected at least %zd)",
|
2019-04-21 19:54:15 +02:00
|
|
|
VARSIZE_ANY_EXHDR(data), SizeOfHeader);
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
|
|
|
|
/* initialize pointer to the data part (skip the varlena header) */
|
|
|
|
tmp = VARDATA_ANY(data);
|
|
|
|
|
2017-03-27 19:52:19 +02:00
|
|
|
/* read the header fields and perform basic sanity checks */
|
|
|
|
memcpy(&ndist.magic, tmp, sizeof(uint32));
|
|
|
|
tmp += sizeof(uint32);
|
|
|
|
memcpy(&ndist.type, tmp, sizeof(uint32));
|
|
|
|
tmp += sizeof(uint32);
|
|
|
|
memcpy(&ndist.nitems, tmp, sizeof(uint32));
|
|
|
|
tmp += sizeof(uint32);
|
|
|
|
|
|
|
|
if (ndist.magic != STATS_NDISTINCT_MAGIC)
|
2019-05-30 16:16:12 +02:00
|
|
|
elog(ERROR, "invalid ndistinct magic %08x (expected %08x)",
|
|
|
|
ndist.magic, STATS_NDISTINCT_MAGIC);
|
2017-03-27 19:52:19 +02:00
|
|
|
if (ndist.type != STATS_NDISTINCT_TYPE_BASIC)
|
2019-05-30 16:16:12 +02:00
|
|
|
elog(ERROR, "invalid ndistinct type %d (expected %d)",
|
|
|
|
ndist.type, STATS_NDISTINCT_TYPE_BASIC);
|
2017-03-27 19:52:19 +02:00
|
|
|
if (ndist.nitems == 0)
|
2019-05-30 16:16:12 +02:00
|
|
|
elog(ERROR, "invalid zero-length item array in MVNDistinct");
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
|
|
|
|
/* what minimum bytea size do we expect for those parameters */
|
2019-04-21 19:54:15 +02:00
|
|
|
minimum_size = MinSizeOfItems(ndist.nitems);
|
2017-03-27 19:52:19 +02:00
|
|
|
if (VARSIZE_ANY_EXHDR(data) < minimum_size)
|
2019-05-30 16:16:12 +02:00
|
|
|
elog(ERROR, "invalid MVNDistinct size %zd (expected at least %zd)",
|
|
|
|
VARSIZE_ANY_EXHDR(data), minimum_size);
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
|
2017-03-27 19:52:19 +02:00
|
|
|
/*
|
2017-05-17 22:31:56 +02:00
|
|
|
* Allocate space for the ndistinct items (no space for each item's
|
|
|
|
* attnos: those live in bitmapsets allocated separately)
|
2017-03-27 19:52:19 +02:00
|
|
|
*/
|
2019-04-21 19:54:15 +02:00
|
|
|
ndistinct = palloc0(MAXALIGN(offsetof(MVNDistinct, items)) +
|
2017-03-27 19:52:19 +02:00
|
|
|
(ndist.nitems * sizeof(MVNDistinctItem)));
|
|
|
|
ndistinct->magic = ndist.magic;
|
|
|
|
ndistinct->type = ndist.type;
|
|
|
|
ndistinct->nitems = ndist.nitems;
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
|
|
|
|
for (i = 0; i < ndistinct->nitems; i++)
|
|
|
|
{
|
|
|
|
MVNDistinctItem *item = &ndistinct->items[i];
|
|
|
|
int nelems;
|
|
|
|
|
|
|
|
item->attrs = NULL;
|
|
|
|
|
|
|
|
/* ndistinct value */
|
|
|
|
memcpy(&item->ndistinct, tmp, sizeof(double));
|
|
|
|
tmp += sizeof(double);
|
|
|
|
|
|
|
|
/* number of attributes */
|
|
|
|
memcpy(&nelems, tmp, sizeof(int));
|
|
|
|
tmp += sizeof(int);
|
|
|
|
Assert((nelems >= 2) && (nelems <= STATS_MAX_DIMENSIONS));
|
|
|
|
|
|
|
|
while (nelems-- > 0)
|
|
|
|
{
|
|
|
|
AttrNumber attno;
|
|
|
|
|
|
|
|
memcpy(&attno, tmp, sizeof(AttrNumber));
|
|
|
|
tmp += sizeof(AttrNumber);
|
|
|
|
item->attrs = bms_add_member(item->attrs, attno);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* still within the bytea */
|
|
|
|
Assert(tmp <= ((char *) data + VARSIZE_ANY(data)));
|
|
|
|
}
|
|
|
|
|
|
|
|
/* we should have consumed the whole bytea exactly */
|
|
|
|
Assert(tmp == ((char *) data + VARSIZE_ANY(data)));
|
|
|
|
|
|
|
|
return ndistinct;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* pg_ndistinct_in
|
|
|
|
* input routine for type pg_ndistinct
|
|
|
|
*
|
|
|
|
* pg_ndistinct is real enough to be a table column, but it has no
|
2019-07-19 01:50:14 +02:00
|
|
|
* operations of its own, and disallows input (just like pg_node_tree).
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
*/
|
|
|
|
Datum
|
|
|
|
pg_ndistinct_in(PG_FUNCTION_ARGS)
|
|
|
|
{
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
|
|
|
|
errmsg("cannot accept a value of type %s", "pg_ndistinct")));
|
|
|
|
|
|
|
|
PG_RETURN_VOID(); /* keep compiler quiet */
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* pg_ndistinct
|
|
|
|
* output routine for type pg_ndistinct
|
|
|
|
*
|
|
|
|
* Produces a human-readable representation of the value.
|
|
|
|
*/
|
|
|
|
Datum
|
|
|
|
pg_ndistinct_out(PG_FUNCTION_ARGS)
|
|
|
|
{
|
|
|
|
bytea *data = PG_GETARG_BYTEA_PP(0);
|
|
|
|
MVNDistinct *ndist = statext_ndistinct_deserialize(data);
|
|
|
|
int i;
|
|
|
|
StringInfoData str;
|
|
|
|
|
|
|
|
initStringInfo(&str);
|
2017-05-02 23:49:32 +02:00
|
|
|
appendStringInfoChar(&str, '{');
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
|
|
|
|
for (i = 0; i < ndist->nitems; i++)
|
|
|
|
{
|
|
|
|
MVNDistinctItem item = ndist->items[i];
|
2017-05-17 22:31:56 +02:00
|
|
|
int x = -1;
|
|
|
|
bool first = true;
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
|
|
|
|
if (i > 0)
|
|
|
|
appendStringInfoString(&str, ", ");
|
|
|
|
|
2017-05-02 23:49:32 +02:00
|
|
|
while ((x = bms_next_member(item.attrs, x)) >= 0)
|
|
|
|
{
|
|
|
|
appendStringInfo(&str, "%s%d", first ? "\"" : ", ", x);
|
|
|
|
first = false;
|
|
|
|
}
|
|
|
|
appendStringInfo(&str, "\": %d", (int) item.ndistinct);
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
}
|
|
|
|
|
2017-05-02 23:49:32 +02:00
|
|
|
appendStringInfoChar(&str, '}');
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
|
|
|
|
PG_RETURN_CSTRING(str.data);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* pg_ndistinct_recv
|
|
|
|
* binary input routine for type pg_ndistinct
|
|
|
|
*/
|
|
|
|
Datum
|
|
|
|
pg_ndistinct_recv(PG_FUNCTION_ARGS)
|
|
|
|
{
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
|
|
|
|
errmsg("cannot accept a value of type %s", "pg_ndistinct")));
|
|
|
|
|
|
|
|
PG_RETURN_VOID(); /* keep compiler quiet */
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* pg_ndistinct_send
|
|
|
|
* binary output routine for type pg_ndistinct
|
|
|
|
*
|
|
|
|
* n-distinct is serialized into a bytea value, so let's send that.
|
|
|
|
*/
|
|
|
|
Datum
|
|
|
|
pg_ndistinct_send(PG_FUNCTION_ARGS)
|
|
|
|
{
|
|
|
|
return byteasend(fcinfo);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ndistinct_for_combination
|
|
|
|
* Estimates number of distinct values in a combination of columns.
|
|
|
|
*
|
|
|
|
* This uses the same ndistinct estimator as compute_scalar_stats() in
|
|
|
|
* ANALYZE, i.e.,
|
|
|
|
* n*d / (n - f1 + f1*n/N)
|
|
|
|
*
|
|
|
|
* except that instead of values in a single column we are dealing with
|
|
|
|
* combination of multiple columns.
|
|
|
|
*/
|
|
|
|
static double
|
|
|
|
ndistinct_for_combination(double totalrows, int numrows, HeapTuple *rows,
|
|
|
|
VacAttrStats **stats, int k, int *combination)
|
|
|
|
{
|
|
|
|
int i,
|
|
|
|
j;
|
|
|
|
int f1,
|
|
|
|
cnt,
|
|
|
|
d;
|
|
|
|
bool *isnull;
|
|
|
|
Datum *values;
|
|
|
|
SortItem *items;
|
|
|
|
MultiSortSupport mss;
|
|
|
|
|
|
|
|
mss = multi_sort_init(k);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In order to determine the number of distinct elements, create separate
|
|
|
|
* values[]/isnull[] arrays with all the data we have, then sort them
|
|
|
|
* using the specified column combination as dimensions. We could try to
|
|
|
|
* sort in place, but it'd probably be more complex and bug-prone.
|
|
|
|
*/
|
|
|
|
items = (SortItem *) palloc(numrows * sizeof(SortItem));
|
|
|
|
values = (Datum *) palloc0(sizeof(Datum) * numrows * k);
|
|
|
|
isnull = (bool *) palloc0(sizeof(bool) * numrows * k);
|
|
|
|
|
|
|
|
for (i = 0; i < numrows; i++)
|
|
|
|
{
|
|
|
|
items[i].values = &values[i * k];
|
|
|
|
items[i].isnull = &isnull[i * k];
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2017-05-17 22:31:56 +02:00
|
|
|
* For each dimension, set up sort-support and fill in the values from the
|
|
|
|
* sample data.
|
Make pg_statistic and related code account more honestly for collations.
When we first put in collations support, we basically punted on teaching
pg_statistic, ANALYZE, and the planner selectivity functions about that.
They've just used DEFAULT_COLLATION_OID independently of the actual
collation of the data. It's time to improve that, so:
* Add columns to pg_statistic that record the specific collation associated
with each statistics slot.
* Teach ANALYZE to use the column's actual collation when comparing values
for statistical purposes, and record this in the appropriate slot. (Note
that type-specific typanalyze functions are now expected to fill
stats->stacoll with the appropriate collation, too.)
* Teach assorted selectivity functions to use the actual collation of
the stats they are looking at, instead of just assuming it's
DEFAULT_COLLATION_OID.
This should give noticeably better results in selectivity estimates for
columns with nondefault collations, at least for query clauses that use
that same collation (which would be the default behavior in most cases).
It's still true that comparisons with explicit COLLATE clauses different
from the stored data's collation won't be well-estimated, but that's no
worse than before. Also, this patch does make the first step towards
doing better with that, which is that it's now theoretically possible to
collect stats for a collation other than the column's own collation.
Patch by me; thanks to Peter Eisentraut for review.
Discussion: https://postgr.es/m/14706.1544630227@sss.pgh.pa.us
2018-12-14 18:52:49 +01:00
|
|
|
*
|
|
|
|
* We use the column data types' default sort operators and collations;
|
|
|
|
* perhaps at some point it'd be worth using column-specific collations?
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
*/
|
|
|
|
for (i = 0; i < k; i++)
|
|
|
|
{
|
2017-05-17 22:31:56 +02:00
|
|
|
VacAttrStats *colstat = stats[combination[i]];
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
TypeCacheEntry *type;
|
|
|
|
|
|
|
|
type = lookup_type_cache(colstat->attrtypid, TYPECACHE_LT_OPR);
|
2017-05-17 22:31:56 +02:00
|
|
|
if (type->lt_opr == InvalidOid) /* shouldn't happen */
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
elog(ERROR, "cache lookup failed for ordering operator for type %u",
|
|
|
|
colstat->attrtypid);
|
|
|
|
|
|
|
|
/* prepare the sort function for this dimension */
|
2019-07-18 12:28:16 +02:00
|
|
|
multi_sort_add_dimension(mss, i, type->lt_opr, colstat->attrcollid);
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
|
|
|
|
/* accumulate all the data for this dimension into the arrays */
|
|
|
|
for (j = 0; j < numrows; j++)
|
|
|
|
{
|
|
|
|
items[j].values[i] =
|
|
|
|
heap_getattr(rows[j],
|
|
|
|
colstat->attr->attnum,
|
|
|
|
colstat->tupDesc,
|
|
|
|
&items[j].isnull[i]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* We can sort the array now ... */
|
|
|
|
qsort_arg((void *) items, numrows, sizeof(SortItem),
|
|
|
|
multi_sort_compare, mss);
|
|
|
|
|
|
|
|
/* ... and count the number of distinct combinations */
|
|
|
|
|
|
|
|
f1 = 0;
|
|
|
|
cnt = 1;
|
|
|
|
d = 1;
|
|
|
|
for (i = 1; i < numrows; i++)
|
|
|
|
{
|
|
|
|
if (multi_sort_compare(&items[i], &items[i - 1], mss) != 0)
|
|
|
|
{
|
|
|
|
if (cnt == 1)
|
|
|
|
f1 += 1;
|
|
|
|
|
|
|
|
d++;
|
|
|
|
cnt = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
cnt += 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (cnt == 1)
|
|
|
|
f1 += 1;
|
|
|
|
|
|
|
|
return estimate_ndistinct(totalrows, numrows, d, f1);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* The Duj1 estimator (already used in analyze.c). */
|
|
|
|
static double
|
|
|
|
estimate_ndistinct(double totalrows, int numrows, int d, int f1)
|
|
|
|
{
|
|
|
|
double numer,
|
|
|
|
denom,
|
|
|
|
ndistinct;
|
|
|
|
|
2017-06-21 20:39:04 +02:00
|
|
|
numer = (double) numrows * (double) d;
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
|
|
|
|
denom = (double) (numrows - f1) +
|
2017-06-21 20:39:04 +02:00
|
|
|
(double) f1 * (double) numrows / totalrows;
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
|
|
|
|
ndistinct = numer / denom;
|
|
|
|
|
|
|
|
/* Clamp to sane range in case of roundoff error */
|
|
|
|
if (ndistinct < (double) d)
|
|
|
|
ndistinct = (double) d;
|
|
|
|
|
|
|
|
if (ndistinct > totalrows)
|
|
|
|
ndistinct = totalrows;
|
|
|
|
|
|
|
|
return floor(ndistinct + 0.5);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* n_choose_k
|
|
|
|
* computes binomial coefficients using an algorithm that is both
|
|
|
|
* efficient and prevents overflows
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
n_choose_k(int n, int k)
|
|
|
|
{
|
|
|
|
int d,
|
|
|
|
r;
|
|
|
|
|
|
|
|
Assert((k > 0) && (n >= k));
|
|
|
|
|
|
|
|
/* use symmetry of the binomial coefficients */
|
|
|
|
k = Min(k, n - k);
|
|
|
|
|
|
|
|
r = 1;
|
|
|
|
for (d = 1; d <= k; ++d)
|
|
|
|
{
|
|
|
|
r *= n--;
|
|
|
|
r /= d;
|
|
|
|
}
|
|
|
|
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* num_combinations
|
|
|
|
* number of combinations, excluding single-value combinations
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
num_combinations(int n)
|
|
|
|
{
|
|
|
|
int k;
|
|
|
|
int ncombs = 1;
|
|
|
|
|
|
|
|
for (k = 1; k <= n; k++)
|
|
|
|
ncombs *= 2;
|
|
|
|
|
|
|
|
ncombs -= (n + 1);
|
|
|
|
|
|
|
|
return ncombs;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* generator_init
|
|
|
|
* initialize the generator of combinations
|
|
|
|
*
|
|
|
|
* The generator produces combinations of K elements in the interval (0..N).
|
|
|
|
* We prebuild all the combinations in this method, which is simpler than
|
|
|
|
* generating them on the fly.
|
|
|
|
*/
|
|
|
|
static CombinationGenerator *
|
|
|
|
generator_init(int n, int k)
|
|
|
|
{
|
|
|
|
CombinationGenerator *state;
|
|
|
|
|
|
|
|
Assert((n >= k) && (k > 0));
|
|
|
|
|
|
|
|
/* allocate the generator state as a single chunk of memory */
|
|
|
|
state = (CombinationGenerator *) palloc(sizeof(CombinationGenerator));
|
|
|
|
|
|
|
|
state->ncombinations = n_choose_k(n, k);
|
|
|
|
|
2017-05-17 22:31:56 +02:00
|
|
|
/* pre-allocate space for all combinations */
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
state->combinations = (int *) palloc(sizeof(int) * k * state->ncombinations);
|
|
|
|
|
|
|
|
state->current = 0;
|
|
|
|
state->k = k;
|
|
|
|
state->n = n;
|
|
|
|
|
|
|
|
/* now actually pre-generate all the combinations of K elements */
|
|
|
|
generate_combinations(state);
|
|
|
|
|
|
|
|
/* make sure we got the expected number of combinations */
|
|
|
|
Assert(state->current == state->ncombinations);
|
|
|
|
|
|
|
|
/* reset the number, so we start with the first one */
|
|
|
|
state->current = 0;
|
|
|
|
|
|
|
|
return state;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* generator_next
|
|
|
|
* returns the next combination from the prebuilt list
|
|
|
|
*
|
|
|
|
* Returns a combination of K array indexes (0 .. N), as specified to
|
|
|
|
* generator_init), or NULL when there are no more combination.
|
|
|
|
*/
|
|
|
|
static int *
|
|
|
|
generator_next(CombinationGenerator *state)
|
|
|
|
{
|
|
|
|
if (state->current == state->ncombinations)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
return &state->combinations[state->k * state->current++];
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* generator_free
|
|
|
|
* free the internal state of the generator
|
|
|
|
*
|
|
|
|
* Releases the generator internal state (pre-built combinations).
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
generator_free(CombinationGenerator *state)
|
|
|
|
{
|
|
|
|
pfree(state->combinations);
|
|
|
|
pfree(state);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* generate_combinations_recurse
|
|
|
|
* given a prefix, generate all possible combinations
|
|
|
|
*
|
|
|
|
* Given a prefix (first few elements of the combination), generate following
|
|
|
|
* elements recursively. We generate the combinations in lexicographic order,
|
|
|
|
* which eliminates permutations of the same combination.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
generate_combinations_recurse(CombinationGenerator *state,
|
|
|
|
int index, int start, int *current)
|
|
|
|
{
|
|
|
|
/* If we haven't filled all the elements, simply recurse. */
|
|
|
|
if (index < state->k)
|
|
|
|
{
|
2017-05-17 22:31:56 +02:00
|
|
|
int i;
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The values have to be in ascending order, so make sure we start
|
|
|
|
* with the value passed by parameter.
|
|
|
|
*/
|
|
|
|
|
|
|
|
for (i = start; i < state->n; i++)
|
|
|
|
{
|
|
|
|
current[index] = i;
|
|
|
|
generate_combinations_recurse(state, (index + 1), (i + 1), current);
|
|
|
|
}
|
|
|
|
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* we got a valid combination, add it to the array */
|
|
|
|
memcpy(&state->combinations[(state->k * state->current)],
|
|
|
|
current, state->k * sizeof(int));
|
|
|
|
state->current++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* generate_combinations
|
|
|
|
* generate all k-combinations of N elements
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
generate_combinations(CombinationGenerator *state)
|
|
|
|
{
|
2017-05-17 22:31:56 +02:00
|
|
|
int *current = (int *) palloc0(sizeof(int) * state->k);
|
Implement multivariate n-distinct coefficients
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns. Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too. All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table. This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve. A new
special pseudo-type pg_ndistinct is used.
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
Author: Tomas Vondra. Some code rework by Álvaro.
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,
Ideriha Takeshi
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz
https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
2017-03-24 18:06:10 +01:00
|
|
|
|
|
|
|
generate_combinations_recurse(state, 0, 0, current);
|
|
|
|
|
|
|
|
pfree(current);
|
|
|
|
}
|