Rework word_similarity documentation, make it close to actual algorithm.

word_similarity before claimed as returning similarity of closest word in
string, but, actually it returns similarity of substring. Also fix mistyped
comments.

Author: Alexander Korotkov
Review by: David Steele, Liudmila Mantrova
Discussionis:
https://www.postgresql.org/message-id/flat/CY4PR17MB13207ED8310F847CF117EED0D85A0@CY4PR17MB1320.namprd17.prod.outlook.com
https://www.postgresql.org/message-id/flat/f43b242d-000c-f4c8-cb8b-d37e9752cd93%40postgrespro.ru
This commit is contained in:
Teodor Sigaev 2018-03-21 14:37:51 +03:00
parent eb63b72388
commit 4c7feb1611
2 changed files with 44 additions and 16 deletions

View File

@ -456,7 +456,7 @@ iterate_word_similarity(int *trg2indexes,
lastpos[trgindex] = i;
}
/* Adjust lower bound if this trigram is present in required substring */
/* Adjust upper bound if this trigram is present in required substring */
if (found[trgindex])
{
int prev_lower,
@ -473,7 +473,7 @@ iterate_word_similarity(int *trg2indexes,
smlr_cur = CALCSML(count, ulen1, ulen2);
/* Also try to adjust upper bound for greater similarity */
/* Also try to adjust lower bound for greater similarity */
tmp_count = count;
tmp_ulen2 = ulen2;
prev_lower = lower;

View File

@ -99,12 +99,10 @@
</entry>
<entry><type>real</type></entry>
<entry>
Returns a number that indicates how similar the first string
to the most similar word of the second string. The function searches in
the second string a most similar word not a most similar substring. The
range of the result is zero (indicating that the two strings are
completely dissimilar) to one (indicating that the first string is
identical to one of the words of the second string).
Returns a number that indicates the greatest similarity between
the set of trigrams in the first string and any continuous extent
of an ordered set of trigrams in the second string. For details, see
the explanation below.
</entry>
</row>
<row>
@ -131,6 +129,34 @@
</tgroup>
</table>
<para>
Consider the following example:
<programlisting>
# SELECT word_similarity('word', 'two words');
word_similarity
-----------------
0.8
(1 row)
</programlisting>
In the first string, the set of trigrams is
<literal>{" w"," wo","ord","wor","rd "}</literal>.
In the second string, the ordered set of trigrams is
<literal>{" t"," tw",two,"wo "," w"," wo","wor","ord","rds", ds "}</literal>.
The most similar extent of an ordered set of trigrams in the second string
is <literal>{" w"," wo","wor","ord"}</literal>, and the similarity is
<literal>0.8</literal>.
</para>
<para>
This function returns a value that can be approximately understood as the
greatest similarity between the first string and any substring of the second
string. However, this function does not add padding to the boundaries of
the extent. Thus, a whole word match gets a higher score than a match with
a part of the word.
</para>
<table id="pgtrgm-op-table">
<title><filename>pg_trgm</filename> Operators</title>
<tgroup cols="3">
@ -156,10 +182,11 @@
<entry><type>text</> <literal>&lt;%</literal> <type>text</></entry>
<entry><type>boolean</type></entry>
<entry>
Returns <literal>true</> if its first argument has the similar word in
the second argument and they have a similarity that is greater than the
current word similarity threshold set by
<varname>pg_trgm.word_similarity_threshold</> parameter.
Returns <literal>true</literal> if the similarity between the trigram
set in the first argument and a continuous extent of an ordered trigram
set in the second argument is greater than the current word similarity
threshold set by <varname>pg_trgm.word_similarity_threshold</varname>
parameter.
</entry>
</row>
<row>
@ -302,10 +329,11 @@ SELECT t, word_similarity('<replaceable>word</>', t) AS sml
WHERE '<replaceable>word</>' &lt;% t
ORDER BY sml DESC, t;
</programlisting>
This will return all values in the text column that have a word
which sufficiently similar to <replaceable>word</>, sorted from best
match to worst. The index will be used to make this a fast operation
even over very large data sets.
This will return all values in the text column for which there is a
continuous extent in the corresponding ordered trigram set that is
sufficiently similar to the trigram set of <replaceable>word</replaceable>,
sorted from best match to worst. The index will be used to make this
a fast operation even over very large data sets.
</para>
<para>