Commit Graph

23 Commits

Author SHA1 Message Date
Michael Paquier b8da37b3ad Rework pg_input_error_message(), now renamed pg_input_error_info()
pg_input_error_info() is now a SQL function able to return a row with
more than just the error message generated for incorrect data type
inputs when these are able to handle soft failures, returning more
contents of ErrorData, as of:
- The error message (same as before).
- The error detail, if set.
- The error hint, if set.
- SQL error code.

All the regression tests that relied on pg_input_error_message() are
updated to reflect the effects of the rename.

Per discussion with Tom Lane and Andrew Dunstan.

Author: Nathan Bossart
Discussion: https://postgr.es/m/139a68e1-bd1f-a9a7-b5fe-0be9845c6311@dunslane.net
2023-02-28 08:04:13 +09:00
Tom Lane 78212f2101 Convert tsqueryin and tsvectorin to report errors softly.
This is slightly tedious because the adjustments cascade through
a couple of levels of subroutines, but it's not very hard.
I chose to avoid changing function signatures more than absolutely
necessary, by passing the escontext pointer in existing structs
where possible.

tsquery's nuisance NOTICEs about empty queries are suppressed in
soft-error mode, since they're not errors and we surely don't want
them to be shown to the user anyway.  Maybe that whole behavior
should be reconsidered.

Discussion: https://postgr.es/m/3824377.1672076822@sss.pgh.pa.us
2022-12-27 12:00:31 -05:00
Tom Lane cbe25dcff7 Disallow making an empty lexeme via array_to_tsvector().
The tsvector data type has always forbidden lexemes to be empty.
However, array_to_tsvector() didn't get that memo, and would
allow an empty-string array element to become an empty lexeme.
This could result in dump/restore failures later, not to mention
whatever semantic issues might be behind the original prohibition.

However, other functions that take a plain text input directly as
a lexeme value do not need a similar restriction, because they only
match the string against existing tsvector entries.  In particular
it'd be a bad idea to make ts_delete() reject empty strings, since
that is the most convenient way to clean up any bad data that might
have gotten into a tsvector column via this bug.

Reflecting on that, let's also remove the prohibition against NULL
array elements in tsvector_delete_arr and tsvector_setweight_by_filter.
It seems more consistent to ignore them, as an empty-string element
would be ignored.

There's a case for back-patching this, since it's clearly a bug fix.
On balance though, it doesn't seem like something to change in a
minor release.

Jean-Christophe Arnu

Discussion: https://postgr.es/m/CAHZmTm1YVndPgUVRoag2WL0w900XcoiivDDj-gTTYBsG25c65A@mail.gmail.com
2021-11-06 13:28:53 -04:00
Tom Lane 2f2007fbb2 Fix assorted bugs by changing TS_execute's callback API to ternary logic.
Text search sometimes failed to find valid matches, for instance
'!crew:A'::tsquery might fail to locate 'crew:1B'::tsvector during
an index search.  The root of the issue is that TS_execute's callback
functions were not changed to use ternary (yes/no/maybe) reporting
when we made the search logic itself do so.  It's somewhat annoying
to break that API, but on the other hand we now see that any code
using plain boolean logic is almost certainly broken since the
addition of phrase search.  There seem to be very few outside callers
of this code anyway, so we'll just break them intentionally to get
them to adapt.

This allows removal of tsginidx.c's private re-implementation of
TS_execute, since that's now entirely duplicative.  It's also no
longer necessary to avoid use of CALC_NOT in tsgistidx.c, since
the underlying callbacks can now do something reasonable.

Back-patch into v13.  We can't change this in stable branches,
but it seems not quite too late to fix it in v13.

Tom Lane and Pavel Borisov

Discussion: https://postgr.es/m/CALT9ZEE-aLotzBg-pOp2GFTesGWVYzXA3=mZKzRDa_OKnLF7Mg@mail.gmail.com
2020-07-24 15:26:51 -04:00
Tom Lane db89f0e3a4 Fix YA text phrase search bug.
checkcondition_str() failed to report multiple matches for a prefix
pattern correctly: it would dutifully merge the match positions, but
then after exiting that loop, if the last prefix-matching word had
had no suitable positions, it would report there were no matches.
The upshot would be failing to recognize a match that the query
should match.

It looks like you need all of these conditions to see the bug:
* a phrase search (else we don't ask for match position details)
* a prefix search item (else we don't get to this code)
* a weight restriction (else checkclass_str won't fail)

Noted while investigating a problem report from Pavel Borisov,
though this is distinct from the issue he was on about.

Back-patch to 9.6 where phrase search was added.
2020-05-07 15:59:51 -04:00
Tom Lane e81e5741a6 Fix full text search to handle NOT above a phrase search correctly.
Queries such as '!(foo<->bar)' failed to find matching rows when
implemented as a GiST or GIN index search.  That's because of
failing to handle phrase searches as tri-valued when considering
a query without any position information for the target tsvector.
We can only say that the phrase operator might match, not that it
does match; and therefore its NOT also might match.  The previous
coding incorrectly inverted the approximate phrase result to
decide that there was certainly no match.

To fix, we need to make TS_phrase_execute return a real ternary result,
and then bubble that up accurately in TS_execute.  As long as we have
to do that anyway, we can simplify the baroque things TS_phrase_execute
was doing internally to manage tri-valued searching with only a bool
as explicit result.

For now, I left the externally-visible result of TS_execute as a plain
bool.  There do not appear to be any outside callers that need to
distinguish a three-way result, given that they passed in a flag
saying what to do in the absence of position data.  This might need
to change someday, but we wouldn't want to back-patch such a change.

Although tsginidx.c has its own TS_execute_ternary implementation for
use at upper index levels, that sadly managed to get this case wrong
as well :-(.  Fixing it is a lot easier fortunately.

Per bug #16388 from Charles Offenbacher.  Back-patch to 9.6 where
phrase search was introduced.

Discussion: https://postgr.es/m/16388-98cffba38d0b7e6e@postgresql.org
2020-04-27 12:21:04 -04:00
Andrew Gierth da6520be7f More float test and portability fixes.
Avoid assuming exact results in tstypes test; some platforms vary.
(per buildfarm members eulachon, danio, lapwing)

Avoid dubious usage (inherited from upstream) of bool parameters to
copy_special_str, to see if this fixes the mac/ppc failures (per
buildfarm members prariedog and locust). (Isolated test programs on a
ppc mac don't seem to show any other cause that would explain them.)
2019-02-13 19:35:50 +00:00
Andrew Gierth 02ddd49932 Change floating-point output format for improved performance.
Previously, floating-point output was done by rounding to a specific
decimal precision; by default, to 6 or 15 decimal digits (losing
information) or as requested using extra_float_digits. Drivers that
wanted exact float values, and applications like pg_dump that must
preserve values exactly, set extra_float_digits=3 (or sometimes 2 for
historical reasons, though this isn't enough for float4).

Unfortunately, decimal rounded output is slow enough to become a
noticable bottleneck when dealing with large result sets or COPY of
large tables when many floating-point values are involved.

Floating-point output can be done much faster when the output is not
rounded to a specific decimal length, but rather is chosen as the
shortest decimal representation that is closer to the original float
value than to any other value representable in the same precision. The
recently published Ryu algorithm by Ulf Adams is both relatively
simple and remarkably fast.

Accordingly, change float4out/float8out to output shortest decimal
representations if extra_float_digits is greater than 0, and make that
the new default. Applications that need rounded output can set
extra_float_digits back to 0 or below, and take the resulting
performance hit.

We make one concession to portability for systems with buggy
floating-point input: we do not output decimal values that fall
exactly halfway between adjacent representable binary values (which
would rely on the reader doing round-to-nearest-even correctly). This
is known to be a problem at least for VS2013 on Windows.

Our version of the Ryu code originates from
https://github.com/ulfjack/ryu/ at commit c9c3fb1979, but with the
following (significant) modifications:

 - Output format is changed to use fixed-point notation for small
   exponents, as printf would, and also to use lowercase 'e', a
   minimum of 2 exponent digits, and a mandatory sign on the exponent,
   to keep the formatting as close as possible to previous output.

 - The output of exact midpoint values is disabled as noted above.

 - The integer fast-path code is changed somewhat (since we have
   fixed-point output and the upstream did not).

 - Our project style has been largely applied to the code with the
   exception of C99 declaration-after-statement, which has been
   retained as an exception to our present policy.

 - Most of upstream's debugging and conditionals are removed, and we
   use our own configure tests to determine things like uint128
   availability.

Changing the float output format obviously affects a number of
regression tests. This patch uses an explicit setting of
extra_float_digits=0 for test output that is not expected to be
exactly reproducible (e.g. due to numerical instability or differing
algorithms for transcendental functions).

Conversions from floats to numeric are unchanged by this patch. These
may appear in index expressions and it is not yet clear whether any
change should be made, so that can be left for another day.

This patch assumes that the only supported floating point format is
now IEEE format, and the documentation is updated to reflect that.

Code by me, adapting the work of Ulf Adams and other contributors.

References:
https://dl.acm.org/citation.cfm?id=3192369

Reviewed-by: Tom Lane, Andres Freund, Donald Dong
Discussion: https://postgr.es/m/87r2el1bx6.fsf@news-spur.riddles.org.uk
2019-02-13 15:20:33 +00:00
Tom Lane 9d4ca01314 Ensure that a tsquery like '!foo' matches empty tsvectors.
!foo means "the tsvector does not contain foo", and therefore it should
match an empty tsvector.  ts_match_vq() overenthusiastically supposed
that an empty tsvector could never match any query, so it forcibly
returned FALSE, the wrong answer.  Remove the premature optimization.

Our behavior on this point was inconsistent, because while seqscans and
GIST index searches both failed to match empty tsvectors, GIN index
searches would find them, since GIN scans don't rely on ts_match_vq().
That makes this certainly a bug, not a debatable definition disagreement,
so back-patch to all supported branches.

Report and diagnosis by Tom Dunstan (bug #14515); added test cases by me.

Discussion: https://postgr.es/m/20170126025524.1434.97828@wrigleys.postgresql.org
2017-01-26 12:18:07 -05:00
Tom Lane 89fcea1ace Fix strange behavior (and possible crashes) in full text phrase search.
In an attempt to simplify the tsquery matching engine, the original
phrase search patch invented rewrite rules that would rearrange a
tsquery so that no AND/OR/NOT operator appeared below a PHRASE operator.
But this approach had numerous problems.  The rearrangement step was
missed by ts_rewrite (and perhaps other places), allowing tsqueries
to be created that would cause Assert failures or perhaps crashes at
execution, as reported by Andreas Seltenreich.  The rewrite rules
effectively defined semantics for operators underneath PHRASE that were
buggy, or at least unintuitive.  And because rewriting was done in
tsqueryin() rather than at execution, the rearrangement was user-visible,
which is not very desirable --- for example, it might cause unexpected
matches or failures to match in ts_rewrite.

As a somewhat independent problem, the behavior of nested PHRASE operators
was only sane for left-deep trees; queries like "x <-> (y <-> z)" did not
behave intuitively at all.

To fix, get rid of the rewrite logic altogether, and instead teach the
tsquery execution engine to manage AND/OR/NOT below a PHRASE operator
by explicitly computing the match location(s) and match widths for these
operators.

This requires introducing some additional fields into the publicly visible
ExecPhraseData struct; but since there's no way for third-party code to
pass such a struct to TS_phrase_execute, it shouldn't create an ABI problem
as long as we don't move the offsets of the existing fields.

Another related problem was that index searches supposed that "!x <-> y"
could be lossily approximated as "!x & y", which isn't correct because
the latter will reject, say, "x q y" which the query itself accepts.
This required some tweaking in TS_execute_ternary along with the main
tsquery engine.

Back-patch to 9.6 where phrase operators were introduced.  While this
could be argued to change behavior more than we'd like in a stable branch,
we have to do something about the crash hazards and index-vs-seqscan
inconsistency, and it doesn't seem desirable to let the unintuitive
behaviors induced by the rewriting implementation stand as precedent.

Discussion: https://postgr.es/m/28215.1481999808@sss.pgh.pa.us
Discussion: https://postgr.es/m/26706.1482087250@sss.pgh.pa.us
2016-12-21 15:18:39 -05:00
Tom Lane f10eab73df Make array_to_tsvector() sort and de-duplicate the given strings.
This is required for the result to be a legal tsvector value.
Noted while fooling with Andreas Seltenreich's ts_delete() crash.

Discussion: <87invhoj6e.fsf@credativ.de>
2016-08-05 16:09:06 -04:00
Tom Lane c50d192ce3 Fix ts_delete(tsvector, text[]) to cope with duplicate array entries.
Such cases either failed an Assert, or produced a corrupt tsvector in
non-Assert builds, as reported by Andreas Seltenreich.  The reason is
that tsvector_delete_by_indices() just assumed that its input array had
no duplicates.  Fix by explicitly de-duping.

In passing, improve some comments, and fix a number of tests for null
values to use ERRCODE_NULL_VALUE_NOT_ALLOWED not
ERRCODE_INVALID_PARAMETER_VALUE.

Discussion: <87invhoj6e.fsf@credativ.de>
2016-08-05 15:14:19 -04:00
Teodor Sigaev 00f304ce2d Fix parsing NOT sequence in tsquery
Digging around bug #14245 I found that commit
6734a1cacd missed that NOT operation is
right associative in opposite to all other. This miss is resposible for
tsquery parser fail on sequence of NOT operations
2016-07-15 20:01:41 +03:00
Teodor Sigaev 6734a1cacd Change predecence of phrase operator.
<-> operator now have higher predecence than & (AND) operator. This change
was motivated by unexpected difference of similar queries:
'a & b <-> c'::tsquery and 'b <-> c & a'. Before first query means
(a & b) <-> c and second one - '(b <-> c) & a', now phrase operator evaluates
first.

Per suggestion from Tom Lane 32260.1465402409@sss.pgh.pa.us
2016-06-27 20:55:24 +03:00
Teodor Sigaev 028350f619 Make exact distance match for FTS phrase operator
Phrase operator now requires exact distance betweens lexems instead of
less-or-equal.

Per discussion c19fcfec308e6ccd952cdde9e648b505@mail.gmail.com
2016-06-27 20:41:00 +03:00
Tom Lane 0b9a234432 Rename tsvector delete() to ts_delete(), and filter() to ts_filter().
The similarity of the original names to SQL keywords seems like a bad
idea.  Rename them before we're stuck with 'em forever.

In passing, minor code and docs cleanup.

Discussion: <4875.1462210058@sss.pgh.pa.us>
2016-05-05 19:43:32 -04:00
Teodor Sigaev 1ec4c7c055 Restore original tsquery operation numbering.
As noticed by Tom Lane changing operation's number in commit
bb140506df causes on-disk format incompatibility.
Revert to previous numbering, that is reason to add special array to store
priorities of operation. Also it reverts order of tsquery to previous.

Author: Dmitry Ivanov
2016-04-08 20:11:30 +03:00
Teodor Sigaev bb140506df Phrase full text search.
Patch introduces new text search operator (<-> or <DISTANCE>) into tsquery.
On-disk and binary in/out format of tsquery are backward compatible.
It has two side effect:
- change order for tsquery, so, users, who has a btree index over tsquery,
  should reindex it
- less number of parenthesis in tsquery output, and tsquery becomes more
  readable

Authors: Teodor Sigaev, Oleg Bartunov, Dmitry Ivanov
Reviewers: Alexander Korotkov, Artur Zakirov
2016-04-07 18:44:18 +03:00
Teodor Sigaev 6943a946c7 Tsvector editing functions
Adds several tsvector editting function: convert tsvector to/from text array,
set weight for given lexemes, delete lexeme(s), unnest, filter lexemes
with given weights

Author: Stas Kelvich with some editorization by me
Reviewers: Tomas Vondram, Teodor Sigaev
2016-03-11 19:22:36 +03:00
Tom Lane e6dbcb72fa Extend GIN to support partial-match searches, and extend tsquery to support
prefix matching using this facility.

Teodor Sigaev and Oleg Bartunov
2008-05-16 16:31:02 +00:00
Teodor Sigaev a867b40cf4 Fix tsvectorout() and tsqueryout() to escape backslesh, add test of that.
Patch by Bruce Momjian <bruce@momjian.us>

Backpatch is needed, but it's impossible to apply it directly
2007-11-16 15:05:59 +00:00
Bruce Momjian 1c36de33b0 Uppercase keywords in regression tsearch test scripts. 2007-08-21 15:41:13 +00:00
Tom Lane 140d4ebcb4 Tsearch2 functionality migrates to core. The bulk of this work is by
Oleg Bartunov and Teodor Sigaev, but I did a lot of editorializing,
so anything that's broken is probably my fault.

Documentation is nonexistent as yet, but let's land the patch so we can
get some portability testing done.
2007-08-21 01:11:32 +00:00