Commit Graph

112 Commits

Author SHA1 Message Date
Tom Lane 78212f2101 Convert tsqueryin and tsvectorin to report errors softly.
This is slightly tedious because the adjustments cascade through
a couple of levels of subroutines, but it's not very hard.
I chose to avoid changing function signatures more than absolutely
necessary, by passing the escontext pointer in existing structs
where possible.

tsquery's nuisance NOTICEs about empty queries are suppressed in
soft-error mode, since they're not errors and we surely don't want
them to be shown to the user anyway.  Maybe that whole behavior
should be reconsidered.

Discussion: https://postgr.es/m/3824377.1672076822@sss.pgh.pa.us
2022-12-27 12:00:31 -05:00
Daniel Gustafsson f1d042b21d Replace link to Hunspell with the current homepage
The Hunspell project moved from Sourceforge to Github sometime
in 2016, so update our links to match the new URL.  Backpatch
the doc changes to all supported versions.

Discussion: https://postgr.es/m/DC9A662A-360D-4125-A453-5A6CB9C6C4B4@yesql.se
Backpatch-through: v11
2022-11-21 23:25:48 +01:00
Tom Lane 5644d6f909 Add comments and a missing CHECK_FOR_INTERRUPTS in ts_headline.
I just spent an annoying amount of time reverse-engineering the
100%-undocumented API between ts_headline and the text search
parser's prsheadline function.  Add some commentary about that
while it's fresh in mind.  Also remove some unused macros in
wparser_def.c.

While at it, I noticed that when commit 78e73e875 added a
CHECK_FOR_INTERRUPTS call in TS_execute_recurse, it missed
doing so in the parallel function TS_phrase_execute, which
surely needs one just as much.

Back-patch because of the missing CHECK_FOR_INTERRUPTS.
Might as well back-patch the rest of this too.
2022-11-21 17:07:29 -05:00
Tom Lane ca71131eeb Introduce t_isalnum() to replace t_isalpha() || t_isdigit() tests.
ts_locale.c omitted support for "isalnum" tests, perhaps on the
grounds that there were initially no use-cases for that.  However,
both ltree and pg_trgm need such tests, and we do also have one
use-case now in the core backend.  The workaround of testing
isalpha and isdigit separately seems quite inefficient, especially
when dealing with multibyte characters; so let's fill in the
missing support.

Discussion: https://postgr.es/m/2548310.1664999615@sss.pgh.pa.us
2022-10-06 11:08:56 -04:00
Peter Eisentraut c8b2ef05f4 Convert *GetDatum() and DatumGet*() macros to inline functions
The previous macro implementations just cast the argument to a target
type but did not check whether the input type was appropriate.  The
function implementation can do better type checking of the input type.

For the *GetDatumFast() macros, converting to an inline function
doesn't work in the !USE_FLOAT8_BYVAL case, but we can use
AssertVariableIsOfTypeMacro() to get a similar level of type checking.

Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/8528fb7e-0aa2-6b54-85fb-0c0886dbd6ed%40enterprisedb.com
2022-09-27 20:50:21 +02:00
Peter Geoghegan a601366a46 Harmonize more parameter names in bulk.
Make sure that function declarations use names that exactly match the
corresponding names from function definitions in optimizer, parser,
utility, libpq, and "commands" code, as well as in remaining library
code.  Do the same for all code related to frontend programs (with the
exception of pg_dump/pg_dumpall related code).

Like other recent commits that cleaned up function parameter names, this
commit was written with help from clang-tidy.  Later commits will handle
ecpg and pg_dump/pg_dumpall.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg@mail.gmail.com
2022-09-20 13:09:30 -07:00
Tom Lane 0a20ff54f5 Split up guc.c for better build speed and ease of maintenance.
guc.c has grown to be one of our largest .c files, making it
a bottleneck for compilation.  It's also acquired a bunch of
knowledge that'd be better kept elsewhere, because of our not
very good habit of putting variable-specific check hooks here.
Hence, split it up along these lines:

* guc.c itself retains just the core GUC housekeeping mechanisms.
* New file guc_funcs.c contains the SET/SHOW interfaces and some
  SQL-accessible functions for GUC manipulation.
* New file guc_tables.c contains the data arrays that define the
  built-in GUC variables, along with some already-exported constant
  tables.
* GUC check/assign/show hook functions are moved to the variable's
  home module, whenever that's clearly identifiable.  A few hard-
  to-classify hooks ended up in commands/variable.c, which was
  already a home for miscellaneous GUC hook functions.

To avoid cluttering a lot more header files with #include "guc.h",
I also invented a new header file utils/guc_hooks.h and put all
the GUC hook functions' declarations there, regardless of their
originating module.  That allowed removal of #include "guc.h"
from some existing headers.  The fallout from that (hopefully
all caught here) demonstrates clearly why such inclusions are
best minimized: there are a lot of files that, for example,
were getting array.h at two or more levels of remove, despite
not having any connection at all to GUCs in themselves.

There is some very minor code beautification here, such as
renaming a couple of inconsistently-named hook functions
and improving some comments.  But mostly this just moves
code from point A to point B and deals with the ensuing
needs for #include adjustments and exporting a few functions
that previously weren't exported.

Patch by me, per a suggestion from Andres Freund; thanks also
to Michael Paquier for the idea to invent guc_funcs.c.

Discussion: https://postgr.es/m/587607.1662836699@sss.pgh.pa.us
2022-09-13 11:11:45 -04:00
Peter Eisentraut e8d78581bb Revert "Convert *GetDatum() and DatumGet*() macros to inline functions"
This reverts commit 595836e99b.

It has problems when USE_FLOAT8_BYVAL is off.
2022-09-12 19:57:07 +02:00
Peter Eisentraut 595836e99b Convert *GetDatum() and DatumGet*() macros to inline functions
The previous macro implementations just cast the argument to a target
type but did not check whether the input type was appropriate.  The
function implementation can do better type checking of the input type.

Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://www.postgresql.org/message-id/flat/8528fb7e-0aa2-6b54-85fb-0c0886dbd6ed%40enterprisedb.com
2022-09-12 17:36:26 +02:00
Thomas Munro 634a89c708 Remove configure probe for wctype.h.
This header is present in SUSv2 and Windows.

Also remove the inclusion of <wchar.h>, following clues that it was only
included for the benefit of historical systems that didn't have
<wctype.h>.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CA%2BhUKGKAmTgbg_hMiGG5T7pkpzOnY1cWFAHYtZXHCpqeC_hCkA%40mail.gmail.com
2022-07-23 16:54:00 +12:00
Andres Freund 905c020bef Add missing 'extern' to function prototypes.
Postgres style is to spell out extern. Noticed while scripting adding
PGDLLIMPORT markers to functions.

Discussion: https://postgr.es/m/20220512164513.vaheofqp2q24l65r@alap3.anarazel.de
2022-05-12 12:39:33 -07:00
David Rowley a00fd066b1 Add missing spaces after single-line comments
Only 1 of 3 of these changes appear to be handled by pgindent. That change
is new to v15.  The remaining two appear to be left alone by pgindent. The
exact reason for that is not 100% clear to me.  It seems related to the
fact that it's a line that contains *only* a single line comment and no
actual code.  It does not seem worth investigating this in too much
detail.  In any case, these do not conform to our usual practices, so fix
them.

Author: Justin Pryzby
Discussion: https://postgr.es/m/20220411020336.GB26620@telsasoft.com
2022-04-14 09:28:56 +12:00
Robert Haas 8ec569479f Apply PGDLLIMPORT markings broadly.
Up until now, we've had a policy of only marking certain variables
in the PostgreSQL header files with PGDLLIMPORT, but now we've
decided to mark them all. This means that extensions running on
Windows should no longer operate at a disadvantage as compared to
extensions running on Linux: if the variable is present in a header
file, it should be accessible.

Discussion: http://postgr.es/m/CA+TgmoYanc1_FSfimhgiWSqVyP5KKmh5NP2BWNwDhO8Pg2vGYQ@mail.gmail.com
2022-04-08 08:16:38 -04:00
Bruce Momjian 27b77ecf9f Update copyright for 2022
Backpatch-through: 10
2022-01-07 19:04:57 -05:00
Tom Lane d303849b05 Don't leak compiled regex(es) when an ispell cache entry is dropped.
The text search cache mechanisms assume that we can clean up
an invalidated dictionary cache entry simply by resetting the
associated long-lived memory context.  However, that does not work
for ispell affixes that make use of regular expressions, because
the regex library deals in plain old malloc.  Hence, we leaked
compiled regex(es) any time we dropped such a cache entry.  That
could quickly add up, since even a fairly trivial regex can use up
tens of kB, and a large one can eat megabytes.  Add a memory context
callback to ensure that a regex gets freed when its owning cache
entry is cleared.

Found via valgrind testing.
This problem is ancient, so back-patch to all supported branches.

Discussion: https://postgr.es/m/3816764.1616104288@sss.pgh.pa.us
2021-03-18 22:22:47 -04:00
Tom Lane 38bb3aef35 Convert tsginidx.c's GIN indexing logic to fully ternary operation.
Commit 2f2007fbb did this partially, but there were two remaining
warts.  checkcondition_gin handled some uncertain cases by setting
the out-of-band recheck flag, some by returning TS_MAYBE, and some
by doing both.  Meanwhile, TS_execute arbitrarily converted a
TS_MAYBE result to TS_YES.  Thus, if checkcondition_gin chose to
only return TS_MAYBE, the outcome would be TS_YES with no recheck
flag, potentially resulting in wrong query outputs.

The case where this'd happen is if there were GIN_MAYBE entries
in the indexscan results passed to gin_tsquery_[tri]consistent,
which so far as I can see would only happen if the tidbitmap used
to accumulate indexscan results grew large enough to become lossy.

I initially thought of fixing this by ensuring we always set the
recheck flag as well as returning TS_MAYBE in uncertain cases.
But that errs in the other direction, potentially forcing rechecks
of rows that provably match the query (since the recheck flag
remains set even if TS_execute later finds that the answer must be
TS_YES).  Instead, let's get rid of the out-of-band recheck flag
altogether and rely on returning TS_MAYBE.  This requires exporting
a version of TS_execute that will actually return the full ternary
result of the evaluation ... but we likely should have done that
to start with.

Unfortunately it doesn't seem practical to add a regression test case
that covers this: the amount of data needed to cause the GIN bitmap to
become lossy results in a longer runtime than I think we want to have
in the tests.  (I'm wondering about allowing smaller work_mem settings
to ameliorate that, but it'd be a matter for a separate patch.)

Per bug #16865 from Dimitri Nüscheler.  Back-patch to v13 where
the faulty commit came in.

Discussion: https://postgr.es/m/16865-4ffdc3e682e6d75b@postgresql.org
2021-02-16 12:07:14 -05:00
Bruce Momjian ca3b37487b Update copyright for 2021
Backpatch-through: 9.5
2021-01-02 13:06:25 -05:00
Tom Lane 83b61319a1 Improve behavior of tsearch_readline(), and remove t_readline().
Commit fbeb9da22, which added the tsearch_readline APIs, left
t_readline() in place as a compatibility measure.  But that function
has been unused and deprecated for twelve years now, so that seems
like enough time to remove it.  Doing so, and merging t_readline's
code into tsearch_readline, aids in making several useful
improvements:

* The hard-wired 4K limit on line length in tsearch data files is
removed, by using a StringInfo buffer instead of a fixed-size buffer.

* We can buy back the per-line palloc/pfree added by 3ea7e9550
in the common case where encoding conversion is not required.

* We no longer need a separate pg_verify_mbstr call, as that
functionality was folded into encoding conversion some time ago.

(We could have done some of this stuff while keeping t_readline as a
separate API, but there seems little point, since there's no reason
for anyone to still be using t_readline directly.)

Discussion: https://postgr.es/m/48A4FA71-524E-41B9-953A-FD04EF36E2E7@yesql.se
2020-09-23 20:26:58 -04:00
Tom Lane 79d6d1a277 Replace TS_execute's TS_EXEC_CALC_NOT flag with TS_EXEC_SKIP_NOT.
It's fairly silly that ignoring NOT subexpressions is TS_execute's
default behavior.  It's wrong on its face and it encourages errors
of omission.  Moreover, the only two remaining callers that aren't
specifying CALC_NOT are in ts_headline calculations, and it's very
arguable that those are bugs: if you've specified "!foo" in your
query, why would you want to get a headline that includes "foo"?

Hence, rip that out and change the default behavior to be to calculate
NOT accurately.  As a concession to the slim chance that there is still
somebody somewhere who needs the incorrect behavior, provide a new
SKIP_NOT flag to explicitly request that.

Back-patch into v13, mainly because it seems better to change this
at the same time as the previous commit's rejiggering of TS_execute
related APIs.  Any outside callers affected by this change are
probably also affected by that one.

Discussion: https://postgr.es/m/CALT9ZEE-aLotzBg-pOp2GFTesGWVYzXA3=mZKzRDa_OKnLF7Mg@mail.gmail.com
2020-07-24 15:43:56 -04:00
Tom Lane 2f2007fbb2 Fix assorted bugs by changing TS_execute's callback API to ternary logic.
Text search sometimes failed to find valid matches, for instance
'!crew:A'::tsquery might fail to locate 'crew:1B'::tsvector during
an index search.  The root of the issue is that TS_execute's callback
functions were not changed to use ternary (yes/no/maybe) reporting
when we made the search logic itself do so.  It's somewhat annoying
to break that API, but on the other hand we now see that any code
using plain boolean logic is almost certainly broken since the
addition of phrase search.  There seem to be very few outside callers
of this code anyway, so we'll just break them intentionally to get
them to adapt.

This allows removal of tsginidx.c's private re-implementation of
TS_execute, since that's now entirely duplicative.  It's also no
longer necessary to avoid use of CALC_NOT in tsgistidx.c, since
the underlying callbacks can now do something reasonable.

Back-patch into v13.  We can't change this in stable branches,
but it seems not quite too late to fix it in v13.

Tom Lane and Pavel Borisov

Discussion: https://postgr.es/m/CALT9ZEE-aLotzBg-pOp2GFTesGWVYzXA3=mZKzRDa_OKnLF7Mg@mail.gmail.com
2020-07-24 15:26:51 -04:00
Tom Lane 97cf1fa4ed Assume that we have <wchar.h>.
Windows has this, and so do all other live platforms according to the
buildfarm; it's been required by POSIX since SUSv2.  So remove the
configure probe and tests of HAVE_WCHAR_H.

This is part of a series of commits to get rid of no-longer-relevant
configure checks and dead src/port/ code.  I'm committing them separately
to make it easier to back out individual changes if they prove less
portable than I expect.

Discussion: https://postgr.es/m/15379.1582221614@sss.pgh.pa.us
2020-02-21 14:30:47 -05:00
Bruce Momjian 7559d8ebfa Update copyrights for 2020
Backpatch-through: update all files in master, backpatch legal files through 9.4
2020-01-01 12:21:45 -05:00
Amit Kapila e0487223ec Make the order of the header file includes consistent.
Similar to commits 14aec03502, 7e735035f2 and dddf4cdc33, this commit
makes the order of header file inclusion consistent in more places.

Author: Vignesh C
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/CALDaNm2Sznv8RR6Ex-iJO6xAdsxgWhCoETkaYX=+9DW3q0QCfA@mail.gmail.com
2019-11-25 08:08:57 +05:30
Tom Lane 8255c7a5ee Phase 2 pgindent run for v12.
Switch to 2.1 version of pg_bsd_indent.  This formats
multiline function declarations "correctly", that is with
additional lines of parameter declarations indented to match
where the first line's left parenthesis is.

Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
2019-05-22 13:04:48 -04:00
Bruce Momjian 97c39498e5 Update copyright for 2019
Backpatch-through: certain files through 9.4
2019-01-02 12:44:25 -05:00
Tom Lane bdf46af748 Post-feature-freeze pgindent run.
Discussion: https://postgr.es/m/15719.1523984266@sss.pgh.pa.us
2018-04-26 14:47:16 -04:00
Teodor Sigaev 1664ae1978 Add websearch_to_tsquery
Error-tolerant conversion function with web-like syntax for search query,
it simplifies  constraining search engine with close to habitual interface for
users.

Bump catalog version

Authors: Victor Drobny, Dmitry Ivanov with editorization by me
Reviewed by: Aleksander Alekseev, Tomas Vondra, Thomas Munro, Aleksandr Parfenov
Discussion: https://www.postgresql.org/message-id/flat/fe931111ff7e9ad79196486ada79e268@postgrespro.ru
2018-04-05 19:55:11 +03:00
Bruce Momjian 9d4649ca49 Update copyright for 2018
Backpatch-through: certain files through 9.3
2018-01-02 23:30:12 -05:00
Peter Eisentraut 2eb4a831e5 Change TRUE/FALSE to true/false
The lower case spellings are C and C++ standard and are used in most
parts of the PostgreSQL sources.  The upper case spellings are only used
in some files/modules.  So standardize on the standard spellings.

The APIs for ICU, Perl, and Windows define their own TRUE and FALSE, so
those are left as is when using those APIs.

In code comments, we use the lower-case spelling for the C concepts and
keep the upper-case spelling for the SQL concepts.

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
2017-11-08 11:37:28 -05:00
Tom Lane 85feb77aa0 Assume wcstombs(), towlower(), and sibling functions are always present.
These functions are required by SUS v2, which is our minimum baseline
for Unix platforms, and are present on all interesting Windows versions
as well.  Even our oldest buildfarm members have them.  Thus, we were not
testing the "!USE_WIDE_UPPER_LOWER" code paths, which explains why the bug
fixed in commit e6023ee7f escaped detection.  Per discussion, there seems
to be no more real-world value in maintaining this option.  Hence, remove
the configure-time tests for wcstombs() and towlower(), remove the
USE_WIDE_UPPER_LOWER symbol, and remove all the !USE_WIDE_UPPER_LOWER code.
There's not actually all that much of the latter, but simplifying the #if
nests is a win in itself.

Discussion: https://postgr.es/m/20170921052928.GA188913@rfd.leadboat.com
2017-09-22 11:00:58 -04:00
Tom Lane b4c6d31c0b Fix serious performance problems in json(b) to_tsvector().
In an off-list followup to bug #14745, Bob Jones complained that
to_tsvector() on a 2MB jsonb value took an unreasonable amount of
time and space --- enough to draw the wrath of the OOM killer on
his machine.  On my machine, his example proved to require upwards
of 18 seconds and 4GB, which seemed pretty bogus considering that
to_tsvector() on the same data treated as text took just a couple
hundred msec and 10 or so MB.

On investigation, the problem is that the implementation scans each
string element of the json(b) and converts it to tsvector separately,
then applies tsvector_concat() to join those separate tsvectors.
The unreasonable memory usage came from leaking every single one of
the transient tsvectors --- but even without that mistake, this is an
O(N^2) or worse algorithm, because tsvector_concat() has to repeatedly
process the words coming from earlier elements.

We can fix it by accumulating all the lexeme data and applying
make_tsvector() just once.  As a side benefit, that also makes the
desired adjustment of lexeme positions far cheaper, because we can
just tweak the running "pos" counter between JSON elements.

In passing, try to make the explanation of that tweak more intelligible.
(I didn't think that a barely-readable comment far removed from the
actual code was helpful.)  And do some minor other code beautification.
2017-07-18 12:45:51 -04:00
Tom Lane c7b8998ebb Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.

Commit e3860ffa4d wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code.  The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there.  BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs.  So the
net result is that in about half the cases, such comments are placed
one tab stop left of before.  This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.

Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.

This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 15:19:25 -04:00
Tom Lane e3860ffa4d Initial pgindent run with pg_bsd_indent version 2.0.
The new indent version includes numerous fixes thanks to Piotr Stefaniak.
The main changes visible in this commit are:

* Nicer formatting of function-pointer declarations.
* No longer unexpectedly removes spaces in expressions using casts,
  sizeof, or offsetof.
* No longer wants to add a space in "struct structname *varname", as
  well as some similar cases for const- or volatile-qualified pointers.
* Declarations using PG_USED_FOR_ASSERTS_ONLY are formatted more nicely.
* Fixes bug where comments following declarations were sometimes placed
  with no space separating them from the code.
* Fixes some odd decisions for comments following case labels.
* Fixes some cases where comments following code were indented to less
  than the expected column 33.

On the less good side, it now tends to put more whitespace around typedef
names that are not listed in typedefs.list.  This might encourage us to
put more effort into typedef name collection; it's not really a bug in
indent itself.

There are more changes coming after this round, having to do with comment
indentation and alignment of lines appearing within parentheses.  I wanted
to limit the size of the diffs to something that could be reviewed without
one's eyes completely glazing over, so it seemed better to split up the
changes as much as practical.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 14:39:04 -04:00
Andrew Dunstan e306df7f9c Full Text Search support for json and jsonb
The new functions are ts_headline() and to_tsvector.

Dmitry Dolgov, edited and documented by me.
2017-03-31 14:26:03 -04:00
Heikki Linnakangas 181bdb90ba Fix typos in comments.
Backpatch to all supported versions, where applicable, to make backpatching
of future fixes go more smoothly.

Josh Soref

Discussion: https://www.postgresql.org/message-id/CACZqfqCf+5qRztLPgmmosr-B0Ye4srWzzw_mo4c_8_B_mtjmJQ@mail.gmail.com
2017-02-06 11:33:58 +02:00
Peter Eisentraut 352a24a1f9 Generate fmgr prototypes automatically
Gen_fmgrtab.pl creates a new file fmgrprotos.h, which contains
prototypes for all functions registered in pg_proc.h.  This avoids
having to manually maintain these prototypes across a random variety of
header files.  It also automatically enforces a correct function
signature, and since there are warnings about missing prototypes, it
will detect functions that are defined but not registered in
pg_proc.h (or otherwise used).

Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
2017-01-17 14:06:07 -05:00
Bruce Momjian 1d25779284 Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
Tom Lane 89fcea1ace Fix strange behavior (and possible crashes) in full text phrase search.
In an attempt to simplify the tsquery matching engine, the original
phrase search patch invented rewrite rules that would rearrange a
tsquery so that no AND/OR/NOT operator appeared below a PHRASE operator.
But this approach had numerous problems.  The rearrangement step was
missed by ts_rewrite (and perhaps other places), allowing tsqueries
to be created that would cause Assert failures or perhaps crashes at
execution, as reported by Andreas Seltenreich.  The rewrite rules
effectively defined semantics for operators underneath PHRASE that were
buggy, or at least unintuitive.  And because rewriting was done in
tsqueryin() rather than at execution, the rearrangement was user-visible,
which is not very desirable --- for example, it might cause unexpected
matches or failures to match in ts_rewrite.

As a somewhat independent problem, the behavior of nested PHRASE operators
was only sane for left-deep trees; queries like "x <-> (y <-> z)" did not
behave intuitively at all.

To fix, get rid of the rewrite logic altogether, and instead teach the
tsquery execution engine to manage AND/OR/NOT below a PHRASE operator
by explicitly computing the match location(s) and match widths for these
operators.

This requires introducing some additional fields into the publicly visible
ExecPhraseData struct; but since there's no way for third-party code to
pass such a struct to TS_phrase_execute, it shouldn't create an ABI problem
as long as we don't move the offsets of the existing fields.

Another related problem was that index searches supposed that "!x <-> y"
could be lossily approximated as "!x & y", which isn't correct because
the latter will reject, say, "x q y" which the query itself accepts.
This required some tweaking in TS_execute_ternary along with the main
tsquery engine.

Back-patch to 9.6 where phrase operators were introduced.  While this
could be argued to change behavior more than we'd like in a stable branch,
we have to do something about the crash hazards and index-vs-seqscan
inconsistency, and it doesn't seem desirable to let the unintuitive
behaviors induced by the rewriting implementation stand as precedent.

Discussion: https://postgr.es/m/28215.1481999808@sss.pgh.pa.us
Discussion: https://postgr.es/m/26706.1482087250@sss.pgh.pa.us
2016-12-21 15:18:39 -05:00
Tom Lane 23c75b55aa Improve documentation around TS_execute().
I got frustrated by the lack of commentary in this area, so here is some
reverse-engineered documentation, along with minor stylistic cleanup.
No code changes more significant than removal of unused variables.

Back-patch to 9.6, not because that's useful in itself, but because
we have some bugs to fix in phrase search and this would cause merge
failures if it's only in HEAD.
2016-12-16 11:50:32 -05:00
Tom Lane b5bce6c1ec Final pgindent + perltidy run for 9.6. 2016-08-15 13:42:51 -04:00
Teodor Sigaev 6734a1cacd Change predecence of phrase operator.
<-> operator now have higher predecence than & (AND) operator. This change
was motivated by unexpected difference of similar queries:
'a & b <-> c'::tsquery and 'b <-> c & a'. Before first query means
(a & b) <-> c and second one - '(b <-> c) & a', now phrase operator evaluates
first.

Per suggestion from Tom Lane 32260.1465402409@sss.pgh.pa.us
2016-06-27 20:55:24 +03:00
Teodor Sigaev 3dbbd0f02a Do not fallback to AND for FTS phrase operator.
If there is no positional information of lexemes then phrase operator will not
fallback to AND operator. This change makes needing to modify TS_execute()
interface, because somewhere (in indexes, for example) positional information
is unaccesible and in this cases we need to force fallback to AND.

Per discussion c19fcfec308e6ccd952cdde9e648b505@mail.gmail.com
2016-06-27 20:47:32 +03:00
Robert Haas 4bc424b968 pgindent run for 9.6 2016-06-09 18:02:36 -04:00
Teodor Sigaev 1ec4c7c055 Restore original tsquery operation numbering.
As noticed by Tom Lane changing operation's number in commit
bb140506df causes on-disk format incompatibility.
Revert to previous numbering, that is reason to add special array to store
priorities of operation. Also it reverts order of tsquery to previous.

Author: Dmitry Ivanov
2016-04-08 20:11:30 +03:00
Teodor Sigaev 4e55b3f033 Rename comparePos() to compareWordEntryPos()
Rename comparePos() to compareWordEntryPos() to prevent export of too
generic name.

Per gripe from Tom Lane.
2016-04-08 12:04:15 +03:00
Teodor Sigaev bb140506df Phrase full text search.
Patch introduces new text search operator (<-> or <DISTANCE>) into tsquery.
On-disk and binary in/out format of tsquery are backward compatible.
It has two side effect:
- change order for tsquery, so, users, who has a btree index over tsquery,
  should reindex it
- less number of parenthesis in tsquery output, and tsquery becomes more
  readable

Authors: Teodor Sigaev, Oleg Bartunov, Dmitry Ivanov
Reviewers: Alexander Korotkov, Artur Zakirov
2016-04-07 18:44:18 +03:00
Teodor Sigaev f4ceed6ceb Improve support of Hunspell
- allow to use non-ascii characters as affix flag. Non-numeric affix flags now
  are stored as string instead of numeric value of character.
- allow to use 0 as affix flag in numeric encoded affixes

That adds support for arabian, hungarian, turkish and
brazilian portuguese languages.

Author: Artur Zakirov with heavy editorization by me
2016-03-17 17:23:38 +03:00
Teodor Sigaev 6943a946c7 Tsvector editing functions
Adds several tsvector editting function: convert tsvector to/from text array,
set weight for given lexemes, delete lexeme(s), unnest, filter lexemes
with given weights

Author: Stas Kelvich with some editorization by me
Reviewers: Tomas Vondram, Teodor Sigaev
2016-03-11 19:22:36 +03:00
Teodor Sigaev d78a7d9c7f Improve support of Hunspell in ispell dictionary.
Now it's possible to load recent version of Hunspell for several languages.
To handle these dictionaries Hunspell patch adds support for:
* FLAG long - sets the double extended ASCII character flag type
* FLAG num - sets the decimal number flag type (from 1 to 65535)
* AF parameter - alias for flag's set

Also it moves test dictionaries into separate directory.

Author: Artur Zakirov with editorization by me
2016-03-04 20:08:47 +03:00
Tom Lane eb43e851d6 Create stub functions to support pg_upgrade of old contrib/tsearch2.
Commits 9ff60273e3 and dbe2328959 adjusted the declarations
of some core functions referenced by contrib/tsearch2's install script,
forgetting that in a pg_upgrade situation, we'll be trying to restore
operator class definitions that reference the old signatures.  We've
hit this problem before; solve it in the same way as before, namely by
installing stub functions that have the expected signature and just
invoke the correct function.  Per report from Jeff Janes.

(Someday we ought to stop supporting contrib/tsearch2, but I'm not
sure today is that day.)
2016-03-02 17:37:54 -05:00