Commit Graph

15 Commits

Author SHA1 Message Date
Tom Lane ae4867ec74 Avoid divide-by-zero in regex_selectivity() with long fixed prefix.
Given a regex pattern with a very long fixed prefix (approaching 500
characters), the result of pow(FIXED_CHAR_SEL, fixed_prefix_len) can
underflow to zero.  Typically the preceding selectivity calculation
would have underflowed as well, so that we compute 0/0 and get NaN.
In released branches this leads to an assertion failure later on.
That doesn't happen in HEAD, for reasons I've not explored yet,
but it's surely still a bug.

To fix, just skip the division when the pow() result is zero, so
that we'll (most likely) return a zero selectivity estimate.  In
the edge cases where "sel" didn't yet underflow, perhaps this
isn't desirable, but I'm not sure that the case is worth spending
a lot of effort on.  The results of regex_selectivity_sub() are
barely worth the electrons they're written on anyway :-(

Per report from Alexander Lakhin.  Back-patch to all supported versions.

Discussion: https://postgr.es/m/6de0a0c3-ada9-cd0c-3e4e-2fa9964b41e3@gmail.com
2021-02-12 16:26:47 -05:00
Bruce Momjian ca3b37487b Update copyright for 2021
Backpatch-through: 9.5
2021-01-02 13:06:25 -05:00
Tom Lane 0c882e52a8 Improve ineq_histogram_selectivity's behavior for non-default orderings.
ineq_histogram_selectivity() can be invoked in situations where the
ordering we care about is not that of the column's histogram.  We could
be considering some other collation, or even more drastically, the
query operator might not agree at all with what was used to construct
the histogram.  (We'll get here for anything using scalarineqsel-based
estimators, so that's quite likely to happen for extension operators.)

Up to now we just ignored this issue and assumed we were dealing with
an operator/collation whose sort order exactly matches the histogram,
possibly resulting in junk estimates if the binary search gets confused.
It's past time to improve that, since the use of nondefault collations
is increasing.  What we can do is verify that the given operator and
collation match what's recorded in pg_statistic, and use the existing
code only if so.  When they don't match, instead execute the operator
against each histogram entry, and take the fraction of successes as our
selectivity estimate.  This gives an estimate that is probably good to
about 1/histogram_size, with no assumptions about ordering.  (The quality
of the estimate is likely to degrade near the ends of the value range,
since the two orderings probably don't agree on what is an extremal value;
but this is surely going to be more reliable than what we did before.)

At some point we might further improve matters by storing more than one
histogram calculated according to different orderings.  But this code
would still be good fallback logic when no matches exist, so that is
not an argument for not doing this.

While here, also improve get_variable_range() to deal more honestly
with non-default collations.

This isn't back-patchable, because it requires adding another argument
to ineq_histogram_selectivity, and because it might have significant
impact on the estimation results for extension operators relying on
scalarineqsel --- mostly for the better, one hopes, but in any case
destabilizing plan choices in back branches is best avoided.

Per investigation of a report from James Lucas.

Discussion: https://postgr.es/m/CAAFmbbOvfi=wMM=3qRsPunBSLb8BFREno2oOzSBS=mzfLPKABw@mail.gmail.com
2020-06-05 16:55:27 -04:00
Tom Lane 044c99bc56 Use query collation, not column's collation, while examining statistics.
Commit 5e0928005 changed the planner so that, instead of blindly using
DEFAULT_COLLATION_OID when invoking operators for selectivity estimation,
it would use the collation of the column whose statistics we're
considering.  This was recognized as still being not quite the right
thing, but it seemed like a good incremental improvement.  However,
shortly thereafter we introduced nondeterministic collations, and that
creates cases where operators can fail if they're passed the wrong
collation.  We don't want planning to fail in cases where the query itself
would work, so this means that we *must* use the query's collation when
invoking operators for estimation purposes.

The only real problem this creates is in ineq_histogram_selectivity, where
the binary search might produce a garbage answer if we perform comparisons
using a different collation than the column's histogram is ordered with.
However, when the query's collation is significantly different from the
column's default collation, the estimate we previously generated would be
pretty irrelevant anyway; so it's not clear that this will result in
noticeably worse estimates in practice.  (A follow-on patch will improve
this situation in HEAD, but it seems too invasive for back-patch.)

The patch requires changing the signatures of mcv_selectivity and allied
functions, which are exported and very possibly are used by extensions.
In HEAD, I just did that, but an API/ABI break of this sort isn't
acceptable in stable branches.  Therefore, in v12 the patch introduces
"mcv_selectivity_ext" and so on, with signatures matching HEAD, and makes
the old functions into wrappers that assume DEFAULT_COLLATION_OID should
be used.  That does not match the prior behavior, but it should avoid risk
of failure in most cases.  (In practice, I think most extension datatypes
aren't collation-aware, so the change probably doesn't matter to them.)

Per report from James Lucas.  Back-patch to v12 where the problem was
introduced.

Discussion: https://postgr.es/m/CAAFmbbOvfi=wMM=3qRsPunBSLb8BFREno2oOzSBS=mzfLPKABw@mail.gmail.com
2020-06-05 16:18:50 -04:00
Bruce Momjian 7559d8ebfa Update copyrights for 2020
Backpatch-through: update all files in master, backpatch legal files through 9.4
2020-01-01 12:21:45 -05:00
Tom Lane 9ff5b699ed Sync patternsel_common's operator selection logic with pattern_prefix's.
Make patternsel_common() select the comparison operators to use with
hardwired logic that matches pattern_prefix()'s new logic, eliminating
its dependencies on particular index opfamilies.

This shouldn't change any behavior, as it's just replacing runtime
operator lookups with the same values hard-wired.  But it makes these
closely-related functions look more alike, and saving some runtime
syscache lookups is worth something.

Actually, it's not quite true that this is zero behavioral change:
when estimating for a column of type "name", the comparison constant
will be kept as "text" not coerced to "name".  But that's more correct
anyway, and it allows additional simplification of the coercion logic,
again syncing this more closely with pattern_prefix().

Per consideration of a report from Manuel Rigger.

Discussion: https://postgr.es/m/CA+u7OA7nnGYy8rY0vdTe811NuA+Frr9nbcBO9u2Z+JxqNaud+g@mail.gmail.com
2019-11-20 15:00:18 -05:00
Tom Lane 2ddedcafca Reduce match_pattern_prefix()'s dependencies on index opfamilies.
Historically, the planner's LIKE/regex index optimizations were only
carried out for specific index opfamilies.  That's never been a great
idea from the standpoint of extensibility, but it didn't matter so
much as long as we had no practical way to extend such behaviors anyway.
With the addition of planner support functions, and in view of ongoing
work to support additional table and index AMs, it seems like a good
time to relax this.

Hence, recast the decisions in match_pattern_prefix() so that rather
than decide which operators to generate by looking at what the index
opfamily contains, we decide which operators to generate a-priori
and then see if the opfamily supports them.  This is much more
defensible from a semantic standpoint anyway, since we know the
semantics of the chosen operators precisely, and we only need to
assume that the opfamily correctly implements operators it claims
to support.

The existing "pattern" opfamilies put a crimp in this approach, since
we need to select the pattern operators if we want those to work.
So we still have to special-case those opfamilies.  But that seems
all right, since in view of the addition of collations, the pattern
opfamilies seem like a legacy hack that nobody will be building on.

The only immediate effect of this change, so far as the core code is
concerned, is that anchored LIKE/regex patterns can be mapped onto
BRIN index searches, and exact-match patterns can be mapped onto hash
indexes, not only btree and spgist indexes as before.  That's not a
terribly exciting result, but it does fix an omission mentioned in
the ancient comments here.

Note: no catversion bump, even though this touches pg_operator.dat,
because it's only adding OID macros not changing the contents of
postgres.bki.

Per consideration of a report from Manuel Rigger.

Discussion: https://postgr.es/m/CA+u7OA7nnGYy8rY0vdTe811NuA+Frr9nbcBO9u2Z+JxqNaud+g@mail.gmail.com
2019-11-20 14:13:04 -05:00
Tom Lane b3c265d7be Fix corner-case failure in match_pattern_prefix().
The planner's optimization code for LIKE and regex operators could
error out with a complaint like "no = operator for opfamily NNN"
if someone created a binary-compatible index (for example, a
bpchar_ops index on a text column) on the LIKE's left argument.

This is a consequence of careless refactoring in commit 74dfe58a5.
The old code in match_special_index_operator only accepted specific
combinations of the pattern operator and the index opclass, thereby
indirectly guaranteeing that the opclass would have a comparison
operator with the same LHS input type as the pattern operator.
While moving the logic out to a planner support function, I simplified
that test in a way that no longer guarantees that.  Really though we'd
like an altogether weaker dependency on the opclass, so rather than
put back exactly the old code, just allow lookup failure.  I have in
mind now to rewrite this logic completely, but this is the minimum
change needed to fix the bug in v12.

Per report from Manuel Rigger.  Back-patch to v12 where the mistake
came in.

Discussion: https://postgr.es/m/CA+u7OA7nnGYy8rY0vdTe811NuA+Frr9nbcBO9u2Z+JxqNaud+g@mail.gmail.com
2019-11-19 17:03:34 -05:00
Tom Lane 03c811a483 Fix planner's test for case-foldable characters in ILIKE with ICU.
As coded, the ICU-collation path in pattern_char_isalpha() failed
to consider regular ASCII letters to be case-varying.  This led to
like_fixed_prefix treating too much of an ILIKE pattern as being a
fixed prefix, so that indexscans derived from an ILIKE clause might
miss entries that they should find.

Per bug #15892 from James Inform.  This is an oversight in the original
ICU patch (commit eccfef81e), so back-patch to v10 where that came in.

Discussion: https://postgr.es/m/15892-e5d2bea3e8a04a1b@postgresql.org
2019-08-12 13:15:47 -04:00
Tom Lane 8255c7a5ee Phase 2 pgindent run for v12.
Switch to 2.1 version of pg_bsd_indent.  This formats
multiline function declarations "correctly", that is with
additional lines of parameter declarations indented to match
where the first line's left parenthesis is.

Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
2019-05-22 13:04:48 -04:00
Tom Lane be76af171c Initial pgindent run for v12.
This is still using the 2.0 version of pg_bsd_indent.
I thought it would be good to commit this separately,
so as to document the differences between 2.0 and 2.1 behavior.

Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
2019-05-22 12:55:34 -04:00
Peter Eisentraut abb9c63b2c Unbreak index optimization for LIKE on bytea
The same code is used to handle both text and bytea, but bytea is not
collation-aware, so we shouldn't call get_collation_isdeterministic()
in that case, since that will error out with an invalid collation.

Reported-by: Jeevan Chalke <jeevan.chalke@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/CAM2%2B6%3DWaf3qJ1%3DyVTUH8_yG-SC0xcBMY%2BSFLhvKKNnWNXSUDBw%40mail.gmail.com
2019-04-15 09:29:17 +02:00
Peter Eisentraut 5e1963fb76 Collations with nondeterministic comparison
This adds a flag "deterministic" to collations.  If that is false,
such a collation disables various optimizations that assume that
strings are equal only if they are byte-wise equal.  That then allows
use cases such as case-insensitive or accent-insensitive comparisons
or handling of strings with different Unicode normal forms.

This functionality is only supported with the ICU provider.  At least
glibc doesn't appear to have any locales that work in a
nondeterministic way, so it's not worth supporting this for the libc
provider.

The term "deterministic comparison" in this context is from Unicode
Technical Standard #10
(https://unicode.org/reports/tr10/#Deterministic_Comparison).

This patch makes changes in three areas:

- CREATE COLLATION DDL changes and system catalog changes to support
  this new flag.

- Many executor nodes and auxiliary code are extended to track
  collations.  Previously, this code would just throw away collation
  information, because the eventually-called user-defined functions
  didn't use it since they only cared about equality, which didn't
  need collation information.

- String data type functions that do equality comparisons and hashing
  are changed to take the (non-)deterministic flag into account.  For
  comparison, this just means skipping various shortcuts and tie
  breakers that use byte-wise comparison.  For hashing, we first need
  to convert the input string to a canonical "sort key" using the ICU
  analogue of strxfrm().

Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com
2019-03-22 12:12:43 +01:00
Tom Lane 49fa99e54e Move pattern selectivity code from selfuncs.c to like_support.c.
While at it, refactor patternsel() a bit so that it can be used from
the LIKE/regex planner support functions as well.  This makes the
planner able to deal equally well with either operator or function
syntax for these operations.  I'm not excited about that as a feature
in itself, but it provides a nice model for extensions to follow if
they want such behavior for their operations.

This change localizes the use of pattern_fixed_prefix() and
make_greater_string() so that they no longer need be exported.
(We might get pushback from extensions about that, perhaps,
in which case I'd be inclined to re-export them in a new header
file like_support.h.)

This reduces the bulk of selfuncs.c a fair amount, removing ~1370
lines or about one-sixth of that file; it's still too big, but this
is progress.

Discussion: https://postgr.es/m/24537.1550093915@sss.pgh.pa.us
2019-02-14 10:51:59 -05:00
Tom Lane 74dfe58a59 Allow extensions to generate lossy index conditions.
For a long time, indxpath.c has had the ability to extract derived (lossy)
index conditions from certain operators such as LIKE.  For just as long,
it's been obvious that we really ought to make that capability available
to extensions.  This commit finally accomplishes that, by adding another
API for planner support functions that lets them create derived index
conditions for their functions.  As proof of concept, the hardwired
"special index operator" code formerly present in indxpath.c is pushed
out to planner support functions attached to LIKE and other relevant
operators.

A weak spot in this design is that an extension needs to know OIDs for
the operators, datatypes, and opfamilies involved in the transformation
it wants to make.  The core-code prototypes use hard-wired OID references
but extensions don't have that option for their own operators etc.  It's
usually possible to look up the required info, but that may be slow and
inconvenient.  However, improving that situation is a separate task.

I want to do some additional refactorization around selfuncs.c, but
that also seems like a separate task.

Discussion: https://postgr.es/m/15193.1548028093@sss.pgh.pa.us
2019-02-11 21:26:14 -05:00