Add support for multivariate MCV lists
Introduce a third extended statistic type, supported by the CREATE
STATISTICS command - MCV lists, a generalization of the statistic
already built and used for individual columns.
Compared to the already supported types (n-distinct coefficients and
functional dependencies), MCV lists are more complex, include column
values and allow estimation of much wider range of common clauses
(equality and inequality conditions, IS NULL, IS NOT NULL etc.).
Similarly to the other types, a new pseudo-type (pg_mcv_list) is used.
Author: Tomas Vondra
Reviewed-by: Dean Rasheed, David Rowley, Mark Dilger, Alvaro Herrera
Discussion: https://postgr.es/m/dfdac334-9cf2-2597-fb27-f0fb3753f435@2ndquadrant.com
2019-03-27 18:32:18 +01:00
|
|
|
MCV lists
|
|
|
|
=========
|
|
|
|
|
|
|
|
Multivariate MCV (most-common values) lists are a straightforward extension of
|
|
|
|
regular MCV list, tracking most frequent combinations of values for a group of
|
|
|
|
attributes.
|
|
|
|
|
|
|
|
This works particularly well for columns with a small number of distinct values,
|
|
|
|
as the list may include all the combinations and approximate the distribution
|
|
|
|
very accurately.
|
|
|
|
|
|
|
|
For columns with a large number of distinct values (e.g. those with continuous
|
|
|
|
domains), the list will only track the most frequent combinations. If the
|
|
|
|
distribution is mostly uniform (all combinations about equally frequent), the
|
|
|
|
MCV list will be empty.
|
|
|
|
|
|
|
|
Estimates of some clauses (e.g. equality) based on MCV lists are more accurate
|
|
|
|
than when using histograms.
|
|
|
|
|
|
|
|
Also, MCV lists don't necessarily require sorting of the values (the fact that
|
|
|
|
we use sorting when building them is implementation detail), but even more
|
|
|
|
importantly the ordering is not built into the approximation (while histograms
|
|
|
|
are built on ordering). So MCV lists work well even for attributes where the
|
|
|
|
ordering of the data type is disconnected from the meaning of the data. For
|
|
|
|
example we know how to sort strings, but it's unlikely to make much sense for
|
|
|
|
city names (or other label-like attributes).
|
|
|
|
|
|
|
|
|
|
|
|
Selectivity estimation
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
The estimation, implemented in mcv_clauselist_selectivity(), is quite simple
|
|
|
|
in principle - we need to identify MCV items matching all the clauses and sum
|
|
|
|
frequencies of all those items.
|
|
|
|
|
|
|
|
Currently MCV lists support estimation of the following clause types:
|
|
|
|
|
|
|
|
(a) equality clauses WHERE (a = 1) AND (b = 2)
|
|
|
|
(b) inequality clauses WHERE (a < 1) AND (b >= 2)
|
|
|
|
(c) NULL clauses WHERE (a IS NULL) AND (b IS NOT NULL)
|
|
|
|
(d) OR clauses WHERE (a < 1) OR (b >= 2)
|
|
|
|
|
|
|
|
It's possible to add support for additional clauses, for example:
|
|
|
|
|
|
|
|
(e) multi-var clauses WHERE (a > b)
|
|
|
|
|
|
|
|
and possibly others. These are tasks for the future, not yet implemented.
|
|
|
|
|
|
|
|
|
|
|
|
Hashed MCV (not yet implemented)
|
|
|
|
--------------------------------
|
|
|
|
|
|
|
|
Regular MCV lists have to include actual values for each item, so if those items
|
|
|
|
are large the list may be quite large. This is especially true for multivariate
|
|
|
|
MCV lists, although the current implementation partially mitigates this by
|
|
|
|
performing de-duplicating the values before storing them on disk.
|
|
|
|
|
|
|
|
It's possible to only store hashes (32-bit values) instead of the actual values,
|
|
|
|
significantly reducing the space requirements. Obviously, this would only make
|
|
|
|
the MCV lists useful for estimating equality conditions (assuming the 32-bit
|
|
|
|
hashes make the collisions rare enough).
|
|
|
|
|
|
|
|
This might also complicate matching the columns to available stats.
|
|
|
|
|
|
|
|
|
|
|
|
TODO Consider implementing hashed MCV list, storing just 32-bit hashes instead
|
|
|
|
of the actual values. This type of MCV list will be useful only for
|
|
|
|
estimating equality clauses, and will reduce space requirements for large
|
|
|
|
varlena types (in such cases we usually only want equality anyway).
|
|
|
|
|
|
|
|
|
|
|
|
Inspecting the MCV list
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
Inspecting the regular (per-attribute) MCV lists is trivial, as it's enough
|
|
|
|
to select the columns from pg_stats. The data is encoded as anyarrays, and
|
|
|
|
all the items have the same data type, so anyarray provides a simple way to
|
|
|
|
get a text representation.
|
|
|
|
|
|
|
|
With multivariate MCV lists the columns may use different data types, making
|
|
|
|
it impossible to use anyarrays. It might be possible to produce a similar
|
|
|
|
array-like representation, but that would complicate further processing and
|
|
|
|
analysis of the MCV list.
|
|
|
|
|
|
|
|
So instead the MCV lists are stored in a custom data type (pg_mcv_list),
|
|
|
|
which however makes it more difficult to inspect the contents. To make that
|
|
|
|
easier, there's a SRF returning detailed information about the MCV lists.
|
|
|
|
|
Rework the pg_statistic_ext catalog
Since extended statistic got introduced in PostgreSQL 10, there was a
single catalog pg_statistic_ext storing both the definitions and built
statistic. That's however problematic when a user is supposed to have
access only to the definitions, but not to user data.
Consider for example pg_dump on a database with RLS enabled - if the
pg_statistic_ext catalog respects RLS (which it should, if it contains
user data), pg_dump would not see any records and the result would not
define any extended statistics. That would be a surprising behavior.
Until now this was not a pressing issue, because the existing types of
extended statistic (functional dependencies and ndistinct coefficients)
do not include any user data directly. This changed with introduction
of MCV lists, which do include most common combinations of values.
The easiest way to fix this is to split the pg_statistic_ext catalog
into two - one for definitions, one for the built statistic values.
The new catalog is called pg_statistic_ext_data, and we're maintaining
a 1:1 relationship with the old catalog - either there are matching
records in both catalogs, or neither of them.
Bumped CATVERSION due to changing system catalog definitions.
Author: Dean Rasheed, with improvements by me
Reviewed-by: Dean Rasheed, John Naylor
Discussion: https://postgr.es/m/CAEZATCUhT9rt7Ui%3DVdx4N%3D%3DVV5XOK5dsXfnGgVOz_JhAicB%3DZA%40mail.gmail.com
2019-06-13 17:19:21 +02:00
|
|
|
SELECT m.* FROM pg_statistic_ext s,
|
|
|
|
pg_statistic_ext_data d,
|
|
|
|
pg_mcv_list_items(stxdmcv) m
|
|
|
|
WHERE s.stxname = 'stts2'
|
|
|
|
AND d.stxoid = s.oid;
|
Add support for multivariate MCV lists
Introduce a third extended statistic type, supported by the CREATE
STATISTICS command - MCV lists, a generalization of the statistic
already built and used for individual columns.
Compared to the already supported types (n-distinct coefficients and
functional dependencies), MCV lists are more complex, include column
values and allow estimation of much wider range of common clauses
(equality and inequality conditions, IS NULL, IS NOT NULL etc.).
Similarly to the other types, a new pseudo-type (pg_mcv_list) is used.
Author: Tomas Vondra
Reviewed-by: Dean Rasheed, David Rowley, Mark Dilger, Alvaro Herrera
Discussion: https://postgr.es/m/dfdac334-9cf2-2597-fb27-f0fb3753f435@2ndquadrant.com
2019-03-27 18:32:18 +01:00
|
|
|
|
|
|
|
It accepts one parameter - a pg_mcv_list value (which can only be obtained
|
Rework the pg_statistic_ext catalog
Since extended statistic got introduced in PostgreSQL 10, there was a
single catalog pg_statistic_ext storing both the definitions and built
statistic. That's however problematic when a user is supposed to have
access only to the definitions, but not to user data.
Consider for example pg_dump on a database with RLS enabled - if the
pg_statistic_ext catalog respects RLS (which it should, if it contains
user data), pg_dump would not see any records and the result would not
define any extended statistics. That would be a surprising behavior.
Until now this was not a pressing issue, because the existing types of
extended statistic (functional dependencies and ndistinct coefficients)
do not include any user data directly. This changed with introduction
of MCV lists, which do include most common combinations of values.
The easiest way to fix this is to split the pg_statistic_ext catalog
into two - one for definitions, one for the built statistic values.
The new catalog is called pg_statistic_ext_data, and we're maintaining
a 1:1 relationship with the old catalog - either there are matching
records in both catalogs, or neither of them.
Bumped CATVERSION due to changing system catalog definitions.
Author: Dean Rasheed, with improvements by me
Reviewed-by: Dean Rasheed, John Naylor
Discussion: https://postgr.es/m/CAEZATCUhT9rt7Ui%3DVdx4N%3D%3DVV5XOK5dsXfnGgVOz_JhAicB%3DZA%40mail.gmail.com
2019-06-13 17:19:21 +02:00
|
|
|
from pg_statistic_ext_data catalog, to defend against malicious input), and
|
Add support for multivariate MCV lists
Introduce a third extended statistic type, supported by the CREATE
STATISTICS command - MCV lists, a generalization of the statistic
already built and used for individual columns.
Compared to the already supported types (n-distinct coefficients and
functional dependencies), MCV lists are more complex, include column
values and allow estimation of much wider range of common clauses
(equality and inequality conditions, IS NULL, IS NOT NULL etc.).
Similarly to the other types, a new pseudo-type (pg_mcv_list) is used.
Author: Tomas Vondra
Reviewed-by: Dean Rasheed, David Rowley, Mark Dilger, Alvaro Herrera
Discussion: https://postgr.es/m/dfdac334-9cf2-2597-fb27-f0fb3753f435@2ndquadrant.com
2019-03-27 18:32:18 +01:00
|
|
|
returns these columns:
|
|
|
|
|
|
|
|
- item index (0, ..., (nitems-1))
|
|
|
|
- values (string array)
|
|
|
|
- nulls only (boolean array)
|
|
|
|
- frequency (double precision)
|
|
|
|
- base_frequency (double precision)
|