mirror of
https://git.postgresql.org/git/postgresql.git
synced 2024-07-17 05:11:09 +02:00
Rework word_similarity documentation, make it close to actual algorithm.
word_similarity before claimed as returning similarity of closest word in string, but, actually it returns similarity of substring. Also fix mistyped comments. Author: Alexander Korotkov Review by: David Steele, Liudmila Mantrova Discussionis: https://www.postgresql.org/message-id/flat/CY4PR17MB13207ED8310F847CF117EED0D85A0@CY4PR17MB1320.namprd17.prod.outlook.com https://www.postgresql.org/message-id/flat/f43b242d-000c-f4c8-cb8b-d37e9752cd93%40postgrespro.ru
This commit is contained in:
parent
eb63b72388
commit
4c7feb1611
@ -456,7 +456,7 @@ iterate_word_similarity(int *trg2indexes,
|
||||
lastpos[trgindex] = i;
|
||||
}
|
||||
|
||||
/* Adjust lower bound if this trigram is present in required substring */
|
||||
/* Adjust upper bound if this trigram is present in required substring */
|
||||
if (found[trgindex])
|
||||
{
|
||||
int prev_lower,
|
||||
@ -473,7 +473,7 @@ iterate_word_similarity(int *trg2indexes,
|
||||
|
||||
smlr_cur = CALCSML(count, ulen1, ulen2);
|
||||
|
||||
/* Also try to adjust upper bound for greater similarity */
|
||||
/* Also try to adjust lower bound for greater similarity */
|
||||
tmp_count = count;
|
||||
tmp_ulen2 = ulen2;
|
||||
prev_lower = lower;
|
||||
|
@ -99,12 +99,10 @@
|
||||
</entry>
|
||||
<entry><type>real</type></entry>
|
||||
<entry>
|
||||
Returns a number that indicates how similar the first string
|
||||
to the most similar word of the second string. The function searches in
|
||||
the second string a most similar word not a most similar substring. The
|
||||
range of the result is zero (indicating that the two strings are
|
||||
completely dissimilar) to one (indicating that the first string is
|
||||
identical to one of the words of the second string).
|
||||
Returns a number that indicates the greatest similarity between
|
||||
the set of trigrams in the first string and any continuous extent
|
||||
of an ordered set of trigrams in the second string. For details, see
|
||||
the explanation below.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
@ -131,6 +129,34 @@
|
||||
</tgroup>
|
||||
</table>
|
||||
|
||||
<para>
|
||||
Consider the following example:
|
||||
|
||||
<programlisting>
|
||||
# SELECT word_similarity('word', 'two words');
|
||||
word_similarity
|
||||
-----------------
|
||||
0.8
|
||||
(1 row)
|
||||
</programlisting>
|
||||
|
||||
In the first string, the set of trigrams is
|
||||
<literal>{" w"," wo","ord","wor","rd "}</literal>.
|
||||
In the second string, the ordered set of trigrams is
|
||||
<literal>{" t"," tw",two,"wo "," w"," wo","wor","ord","rds", ds "}</literal>.
|
||||
The most similar extent of an ordered set of trigrams in the second string
|
||||
is <literal>{" w"," wo","wor","ord"}</literal>, and the similarity is
|
||||
<literal>0.8</literal>.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This function returns a value that can be approximately understood as the
|
||||
greatest similarity between the first string and any substring of the second
|
||||
string. However, this function does not add padding to the boundaries of
|
||||
the extent. Thus, a whole word match gets a higher score than a match with
|
||||
a part of the word.
|
||||
</para>
|
||||
|
||||
<table id="pgtrgm-op-table">
|
||||
<title><filename>pg_trgm</filename> Operators</title>
|
||||
<tgroup cols="3">
|
||||
@ -156,10 +182,11 @@
|
||||
<entry><type>text</> <literal><%</literal> <type>text</></entry>
|
||||
<entry><type>boolean</type></entry>
|
||||
<entry>
|
||||
Returns <literal>true</> if its first argument has the similar word in
|
||||
the second argument and they have a similarity that is greater than the
|
||||
current word similarity threshold set by
|
||||
<varname>pg_trgm.word_similarity_threshold</> parameter.
|
||||
Returns <literal>true</literal> if the similarity between the trigram
|
||||
set in the first argument and a continuous extent of an ordered trigram
|
||||
set in the second argument is greater than the current word similarity
|
||||
threshold set by <varname>pg_trgm.word_similarity_threshold</varname>
|
||||
parameter.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
@ -302,10 +329,11 @@ SELECT t, word_similarity('<replaceable>word</>', t) AS sml
|
||||
WHERE '<replaceable>word</>' <% t
|
||||
ORDER BY sml DESC, t;
|
||||
</programlisting>
|
||||
This will return all values in the text column that have a word
|
||||
which sufficiently similar to <replaceable>word</>, sorted from best
|
||||
match to worst. The index will be used to make this a fast operation
|
||||
even over very large data sets.
|
||||
This will return all values in the text column for which there is a
|
||||
continuous extent in the corresponding ordered trigram set that is
|
||||
sufficiently similar to the trigram set of <replaceable>word</replaceable>,
|
||||
sorted from best match to worst. The index will be used to make this
|
||||
a fast operation even over very large data sets.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
|
Loading…
Reference in New Issue
Block a user