The tsearch2 Reference
Brandon Craig Rhodes
30 June 2003 (edited by Oleg Bartunov, 2 Aug 2003).
Massive update for 8.2 release by Oleg Bartunov, October 2006
This Reference documents the user types and functions
of the tsearch2 module for PostgreSQL.
An introduction to the module is provided
by the tsearch2 Guide,
a companion document to this one.
Table of Contents
Vectors and Queries
Vector Operations
Query Operations
Full Text Search Operator
Configurations
Testing
Parsers
Dictionaries
Ranking
Headlines
Indexes
Thesaurus dictionary
Vectors and queries both store lexemes,
but for different purposes.
A tsvector stores the lexemes
of the words that are parsed out of a document,
and can also remember the position of each word.
A tsquery specifies a boolean condition among lexemes.
Any of the following functions with a configuration argument
can use either an integer id or textual ts_name
to select a configuration;
if the option is omitted, then the current configuration is used.
For more information on the current configuration,
read the next section on Configurations.
-
to_tsvector( [configuration,]
document TEXT) RETURNS TSVECTOR
-
Parses a document into tokens,
reduces the tokens to lexemes,
and returns a tsvector which lists the lexemes
together with their positions in the document.
For the best description of this process,
see the section on Parsing and Stemming
in the accompanying tsearch2 Guide.
-
strip(vector TSVECTOR) RETURNS TSVECTOR
-
Return a vector which lists the same lexemes
as the given vector,
but which lacks any information
about where in the document each lexeme appeared.
While the returned vector is thus useless for relevance ranking,
it will usually be much smaller.
-
setweight(vector TSVECTOR, letter) RETURNS TSVECTOR
-
This function returns a copy of the input vector
in which every location has been labeled
with either the letter
'A', 'B', or 'C',
or the default label 'D'
(which is the default with which new vectors are created,
and as such is usually not displayed).
These labels are retained when vectors are concatenated,
allowing words from different parts of a document
to be weighted differently by ranking functions.
-
vector1 || vector2
concat(vector1 TSVECTOR, vector2 TSVECTOR)
RETURNS TSVECTOR
-
Returns a vector which combines the lexemes and position information
in the two vectors given as arguments.
Position weight labels (described in the previous paragraph)
are retained intact during the concatenation.
This has at least two uses.
First,
if some sections of your document
need be parsed with different configurations than others,
you can parse them separately
and concatenate the resulting vectors into one.
Second,
you can weight words from some sections of you document
more heavily than those from others by:
parsing the sections into separate vectors;
assigning the vectors different position labels
with the setweight() function;
concatenating them into a single vector;
and then providing a weights argument
to the rank() function
that assigns different weights to positions with different labels.
-
length(vector TSVECTOR) RETURNS INT4
-
Returns the number of lexemes stored in the vector.
-
text::TSVECTOR RETURNS TSVECTOR
-
Directly casting text to a tsvector
allows you to directly inject lexemes into a vector,
with whatever positions and position weights you choose to specify.
The text should be formatted
like the vector would be printed by the output of a SELECT.
See the Casting
section in the Guide for details.
-
tsearch2(vector_column_name[, (my_filter_name | text_column_name1) [...] ], text_column_nameN)
-
tsearch2() trigger used to automatically update vector_column_name, my_filter_name
is the function name to preprocess text_column_name. There are can be many
functions and text columns specified in tsearch2() trigger.
The following rule used:
function applied to all subsequent text columns until next function occurs.
Example, function dropatsymbol replaces all entries of @
sign by space.
CREATE FUNCTION dropatsymbol(text) RETURNS text
AS 'select replace($1, ''@'', '' '');'
LANGUAGE SQL;
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT
ON tblMessages FOR EACH ROW EXECUTE PROCEDURE
tsearch2(tsvector_column,dropatsymbol, strMessage);
-
stat(sqlquery text [, weight text]) RETURNS SETOF statinfo
-
Here statinfo is a type, defined as
CREATE TYPE statinfo as (word text, ndoc int4, nentry int4)
and sqlquery is a query, which returns column tsvector.
This returns statistics (the number of documents ndoc and total number nentry of word
in the collection) about column vector tsvector.
Useful to check how good is your configuration and
to find stop-words candidates.For example, find top 10 most frequent words:
=# select * from stat('select vector from apod') order by ndoc desc, nentry desc,word limit 10;
Optionally, one can specify weight to obtain statistics about words with specific weight.
=# select * from stat('select vector from apod','a') order by ndoc desc, nentry desc,word limit 10;
-
TSVECTOR < TSVECTOR
TSVECTOR <= TSVECTOR
TSVECTOR = TSVECTOR
TSVECTOR >= TSVECTOR
TSVECTOR > TSVECTOR
-
All btree operations defined for tsvector type. tsvectors compares
with each other using lexicographical order.
-
to_tsquery( [configuration,]
querytext text) RETURNS TSQUERY[A
-
Parses a query,
which should be single words separated by the boolean operators
"&" and,
"|" or,
and "!" not,
which can be grouped using parenthesis.
Each word is reduced to a lexeme using the current
or specified configuration.
Weight class can be assigned to each lexeme entry
to restrict search region
(see setweight for explanation), for example
"fat:a & rats".
-
-
plainto_tsquery( [configuration,]
querytext text) RETURNS TSQUERY
-
Transforms unformatted text to tsquery. It is the same as to_tsquery,
but assumes "&" boolean operator between words and doesn't
recognizes weight classes.
-
querytree(query TSQUERY) RETURNS text
-
This returns a query which actually used in searching in GiST index.
-
text::TSQUERY RETURNS TSQUERY
-
Directly casting text to a tsquery
allows you to directly inject lexemes into a query,
with whatever positions and position weight flags you choose to specify.
The text should be formatted
like the query would be printed by the output of a SELECT.
See the Casting
section in the Guide for details.
-
numnode(query TSQUERY) RETURNS INTEGER
-
This returns the number of nodes in query tree
-
TSQUERY && TSQUERY RETURNS TSQUERY
-
AND-ed TSQUERY
-
TSQUERY || TSQUERY RETURNS TSQUERY
-
OR-ed TSQUERY
-
!! TSQUERY RETURNS TSQUERY
-
negation of TSQUERY
-
TSQUERY < TSQUERY
TSQUERY <= TSQUERY
TSQUERY = TSQUERY
TSQUERY >= TSQUERY
TSQUERY > TSQUERY
-
All btree operations defined for tsquery type. tsqueries compares
with each other using lexicographical order.
Query rewriting
Query rewriting is a set of functions and operators for tsquery type.
It allows to control search at query time without reindexing (opposite to thesaurus), for example,
expand search using synonyms (new york, big apple, nyc, gotham).
rewrite() function changes original query by replacing target by sample.
There are three possibilities to use rewrite() function. Notice, that arguments of rewrite()
function can be column names of type tsquery.
create table rw (q TSQUERY, t TSQUERY, s TSQUERY);
insert into rw values('a & b','a', 'c');
- rewrite (query TSQUERY, target TSQUERY, sample TSQUERY) RETURNS TSQUERY
-
=# select rewrite('a & b'::TSQUERY, 'a'::TSQUERY, 'c'::TSQUERY);
rewrite
-----------
'c' & 'b'
- rewrite (ARRAY[query TSQUERY, target TSQUERY, sample TSQUERY]) RETURNS TSQUERY
-
=# select rewrite(ARRAY['a & b'::TSQUERY, t,s]) from rw;
rewrite
-----------
'c' & 'b'
- rewrite (query TSQUERY,'select target ,sample from test'::text) RETURNS TSQUERY
-
=# select rewrite('a & b'::TSQUERY, 'select t,s from rw'::text);
rewrite
-----------
'c' & 'b'
Two operators defined for tsquery type:
- TSQUERY @ TSQUERY
-
Returns TRUE if right agrument might contained in left argument.
- TSQUERY ~ TSQUERY
-
Returns TRUE if left agrument might contained in right argument.
To speed up these operators one can use GiST index with gist_tp_tsquery_ops opclass.
create index qq on test_tsquery using gist (keyword gist_tp_tsquery_ops);
-
TSQUERY @@ TSVECTOR
TSVECTOR @@ TSQUERY
-
Returns TRUE if TSQUERY contained in TSVECTOR and
FALSE otherwise.
=# select 'cat & rat':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
?column?
----------
t
=# select 'fat & cow':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
?column?
----------
f
A configuration specifies all of the equipment necessary
to transform a document into a tsvector:
the parser that breaks its text into tokens,
and the dictionaries which then transform each token into a lexeme.
Every call to to_tsvector(), to_tsquery() (described above)
uses a configuration to perform its processing.
Three configurations come with tsearch2:
- default -- Indexes words and numbers,
using the en_stem English Snowball stemmer for Latin-alphabet words
and the simple dictionary for all others.
- default_russian -- Indexes words and numbers,
using the en_stem English Snowball stemmer for Latin-alphabet words
and the ru_stem Russian Snowball dictionary for all others. It's default
for ru_RU.KOI8-R locale.
- utf8_russian -- the same as default_russian but
for ru_RU.UTF-8 locale.
- simple -- Processes both words and numbers
with the simple dictionary,
which neither discards any stop words nor alters them.
The tsearch2 modules initially chooses your current configuration
by looking for your current locale in the locale field
of the pg_ts_cfg table described below.
You can manipulate the current configuration yourself with these functions:
-
set_curcfg( id INT | ts_name TEXT
) RETURNS VOID
-
Set the current configuration used by to_tsvector
and to_tsquery.
-
show_curcfg() RETURNS INT4
-
Returns the integer id of the current configuration.
Each configuration is defined by a record in the pg_ts_cfg table:
create table pg_ts_cfg (
id int not null primary key,
ts_name text not null,
prs_name text not null,
locale text
);
The id and ts_name are unique values
which identify the configuration;
the prs_name specifies which parser the configuration uses.
Once this parser has split document text into tokens,
the type of each resulting token --
or, more specifically, the type's tok_alias
as specified in the parser's lexem_type() table --
is searched for together with the configuration's ts_name
in the pg_ts_cfgmap table:
create table pg_ts_cfgmap (
ts_name text not null,
tok_alias text not null,
dict_name text[],
primary key (ts_name,tok_alias)
);
Those tokens whose types are not listed are discarded.
The remaining tokens are assigned integer positions,
starting with 1 for the first token in the document,
and turned into lexemes with the help of the dictionaries
whose names are given in the dict_name array for their type.
These dictionaries are tried in order,
stopping either with the first one to return a lexeme for the token,
or discarding the token if no dictionary returns a lexeme for it.
Function ts_debug allows easy testing of your current configuration.
You may always test another configuration using set_curcfg function.
Example:
apod=# select * from ts_debug('Tsearch module for PostgreSQL 7.3.3');
ts_name | tok_type | description | token | dict_name | tsvector
---------+----------+-------------+------------+-----------+--------------
default | lword | Latin word | Tsearch | {en_stem} | 'tsearch'
default | lword | Latin word | module | {en_stem} | 'modul'
default | lword | Latin word | for | {en_stem} |
default | lword | Latin word | PostgreSQL | {en_stem} | 'postgresql'
default | version | VERSION | 7.3.3 | {simple} | '7.3.3'
Here:
- tsname - configuration name
- tok_type - token type
- description - human readable name of tok_type
- token - parser's token
- dict_name - dictionary used for the token
- tsvector - final result
Each parser is defined by a record in the pg_ts_parser table:
create table pg_ts_parser (
prs_name text not null,
prs_start regprocedure not null,
prs_nexttoken regprocedure not null,
prs_end regprocedure not null,
prs_headline regprocedure not null,
prs_lextype regprocedure not null,
prs_comment text
);
The prs_name uniquely identify the parser,
while prs_comment usually describes its name and version
for the reference of users.
The other items identify the low-level functions
which make the parser operate,
and are only of interest to someone writing a parser of their own.
The tsearch2 module comes with one parser named default
which is suitable for parsing most plain text and HTML documents.
Each parser argument below
must designate a parser with prs_name;
the current parser is used when this argument is omitted.
-
CREATE FUNCTION set_curprs(parser) RETURNS VOID
-
Selects a current parser
which will be used when any of the following functions
are called without a parser as an argument.
-
CREATE FUNCTION token_type(
[ parser ]
) RETURNS SETOF tokentype
-
Returns a table which defines and describes
each kind of token the parser may produce as output.
For each token type the table gives the tokid
which the parser will label each token of that type,
the alias which names the token type,
and a short description descr for the user to read.
-
CREATE FUNCTION parse(
[ parser, ] document TEXT
) RETURNS SETOF tokenout
-
Parses the given document and returns a series of records,
one for each token produced by parsing.
Each token includes a tokid giving its type
and a lexem which gives its content.
Dictionary is a program, which accepts lexeme(s), usually those produced by a parser,
on input and returns:
- array of lexeme(s) if input lexeme is known to the dictionary
- void array - dictionary knows lexeme, but it's stop word.
- NULL - dictionary doesn't recognized input lexeme
Usually, dictionaries used for normalization of words ( ispell, stemmer dictionaries),
but see, for example, intdict dictionary (available from
Tsearch2 home page,
which controls indexing of integers.
Among the dictionaries which come installed with tsearch2 are:
- simple simply folds uppercase letters to lowercase
before returning the word.
- ispell_template - template for ispell dictionaries.
- en_stem runs an English Snowball stemmer on each word
that attempts to reduce the various forms of a verb or noun
to a single recognizable form.
- ru_stem_koi8, ru_stem_utf8 runs a Russian Snowball stemmer on each word.
- synonym - simple lexeme-to-lexeme replacement
- thesaurus_template - template for thesaurus dictionary. It's
phrase-to-phrase replacement
Each dictionary is defined by an entry in the pg_ts_dict table:
CREATE TABLE pg_ts_dict (
dict_name text not null,
dict_init regprocedure,
dict_initoption text,
dict_lexize regprocedure not null,
dict_comment text
);
The dict_name
serve as unique identifiers for the dictionary.
The meaning of the dict_initoption varies among dictionaries,
but for the built-in Snowball dictionaries
it specifies a file from which stop words should be read.
The dict_comment is a human-readable description of the dictionary.
The other fields are internal function identifiers
useful only to developers trying to implement their own dictionaries.
WARNING: Data files, used by dictionaries, should be in server_encoding to
avoid possible problems !
The argument named dictionary
in each of the following functions
should be dict_name
identifying which dictionary should be used for the operation;
if omitted then the current dictionary is used.
-
CREATE FUNCTION set_curdict(dictionary) RETURNS VOID
-
Selects a current dictionary for use by functions
that do not select a dictionary explicitly.
-
CREATE FUNCTION lexize(
[ dictionary, ] word text)
RETURNS TEXT[]
-
Reduces a single word to a lexeme.
Note that lexemes are arrays of zero or more strings,
since in some languages there might be several base words
from which an inflected form could arise.
Using dictionaries template
Templates used to define new dictionaries, for example,
INSERT INTO pg_ts_dict
(SELECT 'en_ispell', dict_init,
'DictFile="/usr/local/share/dicts/ispell/english.dict",'
'AffFile="/usr/local/share/dicts/ispell/english.aff",'
'StopFile="/usr/local/share/dicts/english.stop"',
dict_lexize
FROM pg_ts_dict
WHERE dict_name = 'ispell_template');
Working with stop words
Ispell and snowball stemmers treat stop words differently:
- ispell - normalize word and then lookups normalized form in stop-word file
- snowball stemmer - first, it lookups word in stop-word file and then does it job.
The reason - to minimize possible 'noise'.
Ranking attempts to measure how relevant documents are to particular queries
by inspecting the number of times each search word appears in the document,
and whether different search terms occur near each other.
Note that this information is only available in unstripped vectors --
ranking functions will only return a useful result
for a tsvector which still has position information!
Notice, that ranking functions supplied are just an examples and
doesn't belong to the tsearch2 core, you can
write your very own ranking function and/or combine additional
factors to fit your specific interest.
The two ranking functions currently available are:
-
CREATE FUNCTION rank(
[ weights float4[], ]
vector TSVECTOR, query TSQUERY,
[ normalization int4 ]
) RETURNS float4
-
This is the ranking function from the old version of OpenFTS,
and offers the ability to weight word instances more heavily
depending on how you have classified them.
The weights specify how heavily to weight each category of word:
{D-weight, C-weight, B-weight, A-weight}
If no weights are provided, then these defaults are used:
{0.1, 0.2, 0.4, 1.0}
Often weights are used to mark words from special areas of the document,
like the title or an initial abstract,
and make them more or less important than words in the document body.
-
CREATE FUNCTION rank_cd(
[ weights float4[], ]
vector TSVECTOR, query TSQUERY,
[ normalization int4 ]
) RETURNS float4
-
This function computes the cover density ranking
for the given document vector and query,
as described in Clarke, Cormack, and Tudhope's
"Relevance Ranking for One to Three Term Queries"
in the 1999 Information Processing and Management.
-
CREATE FUNCTION get_covers(vector TSVECTOR, query TSQUERY) RETURNS text
-
Returns extents, which are a shortest and non-nested sequences of words, which satisfy a query.
Extents (covers) used in rank_cd algorithm for fast calculation of proximity ranking.
In example below there are two extents - {1...}1 and {2 ...}2.
=# select get_covers('1:1,2,10 2:4'::tsvector,'1& 2');
get_covers
----------------------
1 {1 1 {2 2 }1 1 }2
Both of these (rank(), rank_cd()) ranking functions
take an integer normalization option
that specifies whether a document's length should impact its rank.
This is often desirable,
since a hundred-word document with five instances of a search word
is probably more relevant than a thousand-word document with five instances.
The option can have the values, which could be combined using "|" ( 2|4) to
take into account several factors:
- 0 (the default) ignores document length.
- 1 divides the rank by the 1 + logarithm of the length
- 2 divides the rank by the length itself.
- 4 divides the rank by the mean harmonic distance between extents
- 8 divides the rank by the number of unique words in document
- 16 divides the rank by 1 + logarithm of the number of unique words in document
-
CREATE FUNCTION headline(
[ id int4, | ts_name text, ]
document text, query TSQUERY,
[ options text ]
) RETURNS text
-
Every form of the the headline() function
accepts a document along with a query,
and returns one or more ellipse-separated excerpts from the document
in which terms from the query are highlighted.
The configuration with which to parse the document
can be specified by either its id or ts_name;
if none is specified that the current configuration is used instead.
An options string if provided should be a comma-separated list
of one or more 'option=value' pairs.
The available options are:
- StartSel, StopSel --
the strings with which query words appearing in the document
should be delimited to distinguish them from other excerpted words.
- MaxWords, MinWords --
limits on the shortest and longest headlines you will accept.
- ShortWord --
this prevents your headline from beginning or ending
with a word which has this many characters or less.
The default value of 3 should eliminate most English
conjunctions and articles.
- HighlightAll --
boolean flag, if TRUE, than the whole document will be highlighted.
Any unspecified options receive these defaults:
StartSel=<b>, StopSel=</b>, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE
Tsearch2 supports indexed access to tsvector in order to further speedup FTS. Notice, indexes are not mandatory for FTS !
GiST index is very good for online update, but is not as scalable as GIN index,
which, in turn, isn't good for updates. Both indexes support concurrency and recovery.
Thesaurus - is a collection of words with included information about the relationships of words and phrases,
i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred, related terms,etc.
Basically,thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally,
preserves them for indexing. Thesaurus used when indexing, so any changes in thesaurus require reindexing.
Tsearch2's thesaurus dictionary (TZ) is an extension of synonym dictionary
with phrase support. Thesaurus is a plain file of the following format:
# this is a comment
sample word(s) : indexed word(s)
...............................
- Colon (:) symbol used as a delimiter.
- Use asterisk (*) at the beginning of indexed word to skip subdictionary.
It's still required, that sample words should be known.
- thesaurus dictionary looks for the most longest match
TZ uses subdictionary (should be defined in tsearch2 configuration)
to normalize thesaurus text. It's possible to define only one dictionary.
Notice, that subdictionary produces an error, if it couldn't recognize word.
In that case, you should remove definition line with this word or teach subdictionary to know it.
Stop-words recognized by subdictionary replaced by 'stop-word placeholder', i.e.,
important only their position.
To break possible ties thesaurus applies the last definition. For example, consider
thesaurus (with simple subdictionary) rules with pattern 'swsw'
('s' designates stop-word and 'w' - known word):
a one the two : swsw
the one a two : swsw2
Words 'a' and 'the' are stop-words defined in the configuration of a subdictionary.
Thesaurus considers texts 'the one the two' and 'that one then two' as equal and will use definition
'swsw2'.
As a normal dictionary, it should be assigned to the specific lexeme types.
Since TZ has a capability to recognize phrases it must remember its state and interact with parser.
TZ use these assignments to check if it should handle next word or stop accumulation.
Compiler of TZ should take care about proper configuration to avoid confusion.
For example, if TZ is assigned to handle only lword lexeme, then TZ definition like
' one 1:11' will not works, since lexeme type digit doesn't assigned to the TZ.
Configuration
- tsearch2
tsearch2 comes with thesaurus template, which could be used to define new dictionary:
INSERT INTO pg_ts_dict
(SELECT 'tz_simple', dict_init,
'DictFile="/path/to/tz_simple.txt",'
'Dictionary="en_stem"',
dict_lexize
FROM pg_ts_dict
WHERE dict_name = 'thesaurus_template');
Here:
- tz_simple - is the dictionary name
- DictFile="/path/to/tz_simple.txt" - is the location of thesaurus file
- Dictionary="en_stem" defines dictionary (snowball english stemmer) to use for thesaurus normalization. Notice, that en_stem dictionary has it's own configuration (stop-words, for example).
Now, it's possible to use tz_simple in pg_ts_cfgmap, for example:
update pg_ts_cfgmap set dict_name='{tz_simple,en_stem}' where ts_name = 'default_russian' and
tok_alias in ('lhword', 'lword', 'lpart_hword');
Examples
tz_simple:
one : 1
two : 2
one two : 12
the one : 1
one 1 : 11
To see, how thesaurus works, one could use to_tsvector, to_tsquery or plainto_tsquery functions:
=# select plainto_tsquery('default_russian',' one day is oneday');
plainto_tsquery
------------------------
'1' & 'day' & 'oneday'
=# select plainto_tsquery('default_russian','one two day is oneday');
plainto_tsquery
-------------------------
'12' & 'day' & 'oneday'
=# select plainto_tsquery('default_russian','the one');
NOTICE: Thesaurus: word 'the' is recognized as stop-word, assign any stop-word (rule 3)
plainto_tsquery
-----------------
'1'
Additional information about thesaurus dictionary is available from
Wiki page.