Rework word_similarity documentation, make it close to actual algorithm.

word_similarity before claimed as returning similarity of closest word in
string, but, actually it returns similarity of substring. Also fix mistyped
comments.

Author: Alexander Korotkov
Review by: David Steele, Liudmila Mantrova
Discussionis:
https://www.postgresql.org/message-id/flat/CY4PR17MB13207ED8310F847CF117EED0D85A0@CY4PR17MB1320.namprd17.prod.outlook.com
https://www.postgresql.org/message-id/flat/f43b242d-000c-f4c8-cb8b-d37e9752cd93%40postgrespro.ru
This commit is contained in:
Teodor Sigaev 2018-03-21 14:37:51 +03:00
parent eb63b72388
commit 4c7feb1611
2 changed files with 44 additions and 16 deletions

View File

@ -456,7 +456,7 @@ iterate_word_similarity(int *trg2indexes,
lastpos[trgindex] = i; lastpos[trgindex] = i;
} }
/* Adjust lower bound if this trigram is present in required substring */ /* Adjust upper bound if this trigram is present in required substring */
if (found[trgindex]) if (found[trgindex])
{ {
int prev_lower, int prev_lower,
@ -473,7 +473,7 @@ iterate_word_similarity(int *trg2indexes,
smlr_cur = CALCSML(count, ulen1, ulen2); smlr_cur = CALCSML(count, ulen1, ulen2);
/* Also try to adjust upper bound for greater similarity */ /* Also try to adjust lower bound for greater similarity */
tmp_count = count; tmp_count = count;
tmp_ulen2 = ulen2; tmp_ulen2 = ulen2;
prev_lower = lower; prev_lower = lower;

View File

@ -99,12 +99,10 @@
</entry> </entry>
<entry><type>real</type></entry> <entry><type>real</type></entry>
<entry> <entry>
Returns a number that indicates how similar the first string Returns a number that indicates the greatest similarity between
to the most similar word of the second string. The function searches in the set of trigrams in the first string and any continuous extent
the second string a most similar word not a most similar substring. The of an ordered set of trigrams in the second string. For details, see
range of the result is zero (indicating that the two strings are the explanation below.
completely dissimilar) to one (indicating that the first string is
identical to one of the words of the second string).
</entry> </entry>
</row> </row>
<row> <row>
@ -131,6 +129,34 @@
</tgroup> </tgroup>
</table> </table>
<para>
Consider the following example:
<programlisting>
# SELECT word_similarity('word', 'two words');
word_similarity
-----------------
0.8
(1 row)
</programlisting>
In the first string, the set of trigrams is
<literal>{" w"," wo","ord","wor","rd "}</literal>.
In the second string, the ordered set of trigrams is
<literal>{" t"," tw",two,"wo "," w"," wo","wor","ord","rds", ds "}</literal>.
The most similar extent of an ordered set of trigrams in the second string
is <literal>{" w"," wo","wor","ord"}</literal>, and the similarity is
<literal>0.8</literal>.
</para>
<para>
This function returns a value that can be approximately understood as the
greatest similarity between the first string and any substring of the second
string. However, this function does not add padding to the boundaries of
the extent. Thus, a whole word match gets a higher score than a match with
a part of the word.
</para>
<table id="pgtrgm-op-table"> <table id="pgtrgm-op-table">
<title><filename>pg_trgm</filename> Operators</title> <title><filename>pg_trgm</filename> Operators</title>
<tgroup cols="3"> <tgroup cols="3">
@ -156,10 +182,11 @@
<entry><type>text</> <literal>&lt;%</literal> <type>text</></entry> <entry><type>text</> <literal>&lt;%</literal> <type>text</></entry>
<entry><type>boolean</type></entry> <entry><type>boolean</type></entry>
<entry> <entry>
Returns <literal>true</> if its first argument has the similar word in Returns <literal>true</literal> if the similarity between the trigram
the second argument and they have a similarity that is greater than the set in the first argument and a continuous extent of an ordered trigram
current word similarity threshold set by set in the second argument is greater than the current word similarity
<varname>pg_trgm.word_similarity_threshold</> parameter. threshold set by <varname>pg_trgm.word_similarity_threshold</varname>
parameter.
</entry> </entry>
</row> </row>
<row> <row>
@ -302,10 +329,11 @@ SELECT t, word_similarity('<replaceable>word</>', t) AS sml
WHERE '<replaceable>word</>' &lt;% t WHERE '<replaceable>word</>' &lt;% t
ORDER BY sml DESC, t; ORDER BY sml DESC, t;
</programlisting> </programlisting>
This will return all values in the text column that have a word This will return all values in the text column for which there is a
which sufficiently similar to <replaceable>word</>, sorted from best continuous extent in the corresponding ordered trigram set that is
match to worst. The index will be used to make this a fast operation sufficiently similar to the trigram set of <replaceable>word</replaceable>,
even over very large data sets. sorted from best match to worst. The index will be used to make this
a fast operation even over very large data sets.
</para> </para>
<para> <para>