postgresql/contrib/tsearch
Bruce Momjian 8358302ded This patch fixes minor bugs in dictionary generator in contrib/tsearch
(contrib/tsearch/makedict/makedict.pl)

[ Backpatched to 7.3.]

Teodor Sigaev
2002-12-13 05:50:18 +00:00
..
data txtidx datatype for full text indexing with GiST. 2001-10-12 23:19:09 +00:00
dict Please, apply attached patch for contrib/tsearch to 7.2.1 and current 2002-03-05 06:10:28 +00:00
expected Ooops, forgot to fix contrib regression tests for WARNING/NOTICE change. 2002-11-01 22:46:04 +00:00
makedict This patch fixes minor bugs in dictionary generator in contrib/tsearch 2002-12-13 05:50:18 +00:00
sql SET autocommit no longer needed in /contrib because pg_regress.sh does 2002-10-21 01:42:14 +00:00
Makefile Need -I. to build outside source tree. 2002-11-01 23:23:29 +00:00
README.tsearch please apply small patch for README.tsearch. 2002-08-29 19:55:26 +00:00
crc32.c pgindent run on all C files. Java run to follow. initdb/regression 2001-10-25 05:50:21 +00:00
crc32.h Another pgindent run. Fixes enum indenting, and improves #endif 2001-10-28 06:26:15 +00:00
deflex.h pgindent run. 2002-09-04 20:31:48 +00:00
dict.h Remove all traces of multibyte and locale options. Clean up comments 2002-09-03 21:45:44 +00:00
gistidx.c pgindent run. 2002-09-04 20:31:48 +00:00
gistidx.h New pgindent run with fixes suggested by Tom. Patch manually reviewed, 2001-11-05 17:46:40 +00:00
morph.c Thank you very much, you catch it :). This bug had a long life, because it 2002-12-06 05:15:02 +00:00
morph.h Another pgindent run. Fixes enum indenting, and improves #endif 2001-10-28 06:26:15 +00:00
parser.h Another pgindent run. Fixes enum indenting, and improves #endif 2001-10-28 06:26:15 +00:00
parser.l Be careful to include postgres.h *before* any system headers, to ensure 2002-09-05 00:43:07 +00:00
query.c Arrange to compile flex output files as inclusions into other files 2002-11-01 22:52:34 +00:00
query.h New pgindent run with fixes suggested by Tom. Patch manually reviewed, 2001-11-05 17:46:40 +00:00
rewrite.c Patch resolve ERROR problem for non-goog query_txt. 2002-08-23 02:56:36 +00:00
rewrite.h Another pgindent run. Fixes enum indenting, and improves #endif 2001-10-28 06:26:15 +00:00
tsearch.sql.in Update /contrib for "autocommit TO 'on'". 2002-10-18 18:41:22 +00:00
txtidx.c Improve warning messages from tsearch trigger function; clean up some 2002-11-14 20:10:00 +00:00
txtidx.h New pgindent run with fixes suggested by Tom. Patch manually reviewed, 2001-11-05 17:46:40 +00:00

README.tsearch

Tsearch contrib module contains implementation of new data type txtidx - 
a searchable data type (textual) with indexed access.

All work was done by Teodor Sigaev (teodor@stack.net) and Oleg Bartunov
(oleg@sai.msu.su).

CHANGES:

August 29, 2002
        Space usage and using CLUSTER command documented
August 22, 2002
	Fix works with 'bad' queries
August 13, 2002
	Use parser of OpenFTS v0.33.

IMPORTANT NOTICE:

This is a first step of our work on integration of OpenFTS
full text search engine (http://openfts.sourceforge.net/) into 
PostgreSQL. It's based on our recent development of GiST 
(Generalized Search Tree) for PostgreSQL 7.2 (see our GiST page
at http://www.sai.msu.su/~megera/postgres/gist/ for info about GiST)
and will works only for PostgreSQL version 7.2 and later.

We didn't try to implement a full-featured search engine with 
stable interfaces but rather experiment with various approaches.
There are many issues remains (most of them just not documented or
implemented) but we'd like to present a working prototype
of full text search engine fully integrated into PostgreSQL to 
collect user's feedback and recommendations. 

INSTALLATION:

   cd contrib/tsearch
   gmake
   gmake install

REGRESSION TEST:

   gmake installcheck

USAGE:

 psql DATABASE < tsearch.sql (from contrib/tsearch)
  

INTRODUCTION:

This module provides an implementation of a new data type 'txtidx' which is
a string of a space separated "words". "Words" with spaces 
should be enclosed in apostrophes and apostrophes inside a "word" should be 
escaped by backslash.

This is quite different from OpenFTS approach which uses array
of integers (ID of lexems) and requires storing of lexem-id pairs in database. 
One of the prominent benefit of this new approach is that  it's possible now
to perform full text search in a 'natural' way. 

Some examples:

  create table foo (
	titleidx  txtidx
  );

 2 regular words:
   insert into foo values ( 'the are' );
 Word with space:
   insert into foo values ( 'the\\ are' );
 Words with apostrophe:
   insert into foo values ( 'value\'s this' );
 Complex word with apostrophe:
   insert into foo values ( 'value\'s this we \'PostgreSQL site\'' );

   select * from foo where titleidx @@ '\'PostgreSQL site\' | this';
   select * from foo where titleidx @@ 'value\'s | this';
   select * from foo where titleidx @@ '(the|this)&!we';

test=# select 'two words'::txtidx;
    txtidx     
---------------
 'two' 'words'
(1 row)

test=# select 'single\\ word'::txtidx;
    txtidx     
---------------
 'single word'
(1 row)


FULL TEXT SEARCH:

The basic idea of this data type is to use it for full text search inside
database. If you have a 'text' column title and corresponding column 
titleidx of type 'txtidx', which contains the same information from 
text column, then search on title could be replaced by
searching on titleidx which would be fast because of indexed access.

As a real life example consider database with table 'titles' containing
titles of mailing list postings in column 'title':

  create table titles (
        title  text
 
 );

Suppose, you already have a lot of titles and want to do full text search
on them.

First, you need to install contrib/tsearch module (see INSTALLATION and USAGE).
Add column 'titleidx' of type txtidx, containing space separated words from 
title. It's possible to use function txt2txtidx(title) to fill 'titleidx' 
column (see notice 1):
  
  -- add titleidx column of type txtidx 
  alter table titles add titleidx txtidx;
  update titles set titleidx=txt2txtidx(title);

Create index on titleidx:
  create index t_idx on titles using gist(titleidx);

and now you can search all titles with words 'patch' and 'gist':
  select title from titles where titleidx ## 'patch&gist';

Here, ## is a new operation defined for type 'txtidx' which could use index 
(if exists) built on titleidx. This operator uses morphology to
expand query, i.e. 
  ## 'patches&gist' will find titles with 'patch' and 'gist' also.
If you want to provide query as is, use operator @@ instead:
  select title from titles where titleidx @@ 'patch&gist';
but remember, that function txt2txtidx does uses morphology, so you need
to fill column 'titleidx' using some another way. We hope in future releases
provide more consistent and convenient interfaces.

Query could contains boolean operators &,|,!,() with their usual meaning,
for example: 'patch&gist&!cvs', 'patch|cvs'.
Each operation ( ##, @@ ) requires appropriate query type - 
 txtidx ## mquery_txt
 txtidx @@ query_txt

To see what query actually will be used :

test=# select 'patches&gist'::mquery_txt;
    mquery_txt    
------------------
 'patch' & 'gist'
(1 row)

test=# select 'patches&gist'::query_txt;
     query_txt      
--------------------
 'patches' & 'gist'
(1 row)

Notice the difference !

You could use trigger to be sure column 'titleidx' is consistent
with any changes in column 'title':

  create trigger txtidxupdate before update or insert on titles
  for each row execute procedure tsearch(titleidx, title);

This trigger uses the same parser, dictionaries as function
txt2txtidx (see notice 1).
Current syntax allows creating trigger for several columns
you want to be searchable:

  create trigger txtidxupdate before update or insert on titles
  for each row execute procedure tsearch(titleidx, title1, title2,... );

Use function txtidxsize(titleidx) to get the number of "words" in column
titleidx. To get total number of words in table titles:

test=# select sum(txtidxsize(titleidx)) from titles;
   sum   
---------
 1917182
(1 row)

NOTICES:

1.
function txt2txtidx and trigger use parser, dictionaries coming with 
this contrib module on default. Parser is mostly the same as in OpenFTS and 
dictionaries are simple stemmers (sort of Lovin's stemmer which uses a 
longest match algorithm.) for english and russian languages. There is a perl
script makedict/makedict.pl, which could be used to create specific
dictionaries from files with endings and stop-words. 
Example files for english and russian languages are available 
from http://www.sai.msu.su/~megera/postgres/gist/tsearch/.
Run script without parameters to see information about arguments and options. 

Example:
cd makedict
./makedict.pl -l LOCALNAME -e FILEENDINGS -s FILESTOPWORD \
              -o ../dict/YOURDICT.dct

Another options of makedict.pl:
-f do not execute tolower for any char
-a function of checking stopword will be work after lemmatize, 
   default is before

You need to edit dict.h to use your dictionary and, probably, 
morph.c to change mapdict array.

Don't forget to do 
  make clean; make; make install

2.
txtidx doesn't preserve words ordering (this is not critical for searching)
for performance reason, for example:

test=# select 'page two'::txtidx;
    txtidx    
--------------
 'two' 'page'
(1 row)

3. 
Indexed access provided by txtidx data type isn't always good
because of internal data structure we use (RD-Tree). Particularly,
queries like '!gist' will be  slower than just a sequential scan,
because for such queries RD-Tree doesn't provides selectivity on internal 
nodes and all checks should be processed at leaf nodes, i.t. scan of
full index. You may play with function query_tree to see how effective
will be index usage:

test=# select querytree( 'patch&gist'::query_txt );
    querytree     
------------------
 'patch' & 'gist'
(1 row)

This is an example of "good" query - index will effective for both words.

test=# select querytree( 'patch&!gist'::query_txt );
 querytree 
-----------
 'patch'
(1 row)

This means that index is effective only to search word 'patch' and resulted
rows will be checked against '!gist'.

test=# select querytree( 'patch|!gist'::query_txt );
 querytree 
-----------
 T
(1 row)

test=# select querytree( '!gist'::query_txt );
 querytree 
-----------
 T
(1 row)

These two queries will be processed by scanning of full index !
Very slow !

4.
Following selects produce the same result

  select title from titles where titleidx @@ 'patch&gist';
  select title from titles where titleidx @@ 'patch' and titleidx @@ 'gist';

but the former will be more effective, because of internal optimization
of query executor.


TODO:

Better configurability (as in OpenFTS)
User's interfaces to parser, dictionaries ...
Write documentation


BENCHMARKS:

We use test collection in our experiments which contains 377905
titles from various mailing lists stored in our mailware
project. 

All runs were performed on IBM ThinkPad T21 notebook with 
PIII 733 Mhz, 256 RAM, 20 Gb HDD, Linux 2.2.19, postgresql 7.2.dev
We didn't do extensive benchmarking and all
numbers provide for illustration. Actual performance
is strongly depends on many factors (query, collection, dictionaries 
and hardware).

Collection is available for download from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/mw_titles.gz 
(377905 titles from postgresql mailing lists, about 3Mb).

0. install contrib/tsearch module
1. createdb test
2. psql test < tsearch.sql (from contrib/tsearch)
3. zcat mw_titles.gz | psql test 
   (it will creates table, copy test data and creates index)

Database contains one table:

test=# \d titles
                Table "titles"
  Column  |          Type          | Modifiers 
----------+------------------------+-----------
 title    | character varying(256) | 
 titleidx | txtidx                 | 
Indexes: t_idx

Index was created as:
   create index t_idx on titles using gist(titleidx);
   (notice: this operation takes about 14 minutes on my notebook)

Typical select looks like:
   select title from titles where titleidx @@ 'patch&gist';

Total number of lexems in collection : 1917182

1. We trust index  - we consider index is exact and no
   checking against tuples is necessary.

   update pg_amop set amopreqcheck = false where amopclaid = 
   (select oid from pg_opclass where opcname = 'gist_txtidx_ops');

using gist indices
1: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.054s 0.00%
2: titleidx @@ 'patch&gist' 0.020u 0.000s 0m0.045s 44.82%
3: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.044s 0.00%
using gist indices (morph)
1: titleidx ## 'patch&gist' 0.000u 0.010s 0m0.046s 21.62%
2: titleidx ## 'patch&gist' 0.010u 0.010s 0m0.046s 43.47%
3: titleidx ## 'patch&gist' 0.000u 0.000s 0m0.046s 0.00%
disable gist index
1: titleidx @@ 'patch&gist' 0.000u 0.010s 0m1.601s 0.62%
2: titleidx @@ 'patch&gist' 0.000u 0.000s 0m1.607s 0.00%
3: titleidx @@ 'patch&gist' 0.010u 0.000s 0m1.607s 0.62%
traditional like
1: title ~* 'gist' and title ~* 'patch'  0.010u 0.000s 0m9.206s 0.10%
2: title ~* 'gist' and title ~* 'patch'  0.000u 0.010s 0m9.205s 0.10%
3: title ~* 'gist' and title ~* 'patch'  0.010u 0.000s 0m9.208s 0.10%

2. Need to check results against tuples to avoid possible hash collision.

   update pg_amop set amopreqcheck = true where amopclaid = 
   (select oid from pg_opclass where opcname = 'gist_txtidx_ops');

using gist indices
1: titleidx @@ 'patch&gist' 0.010u 0.000s 0m0.052s 19.26%
2: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.045s 0.00%
3: titleidx @@ 'patch&gist' 0.010u 0.000s 0m0.045s 22.39%
using gist indices (morph)
1: titleidx ## 'patch&gist' 0.000u 0.000s 0m0.046s 0.00%
2: titleidx ## 'patch&gist' 0.000u 0.010s 0m0.046s 21.75%
3: titleidx ## 'patch&gist' 0.020u 0.000s 0m0.047s 42.13%

There are no visible difference between these 2 cases but your
mileage may vary.


NOTES:

1. The size of txtidx column should be lesser than size of corresponding column.
   Below some real numbers from test database (link above).

   a) After loading data
   
-rw-------    1 postgres users    23191552 Aug 29 14:08 53016937
-rw-------    1 postgres users    81059840 Aug 29 14:08 52639027

Table titles (52639027) occupies 80Mb, index on txtidx column (53016937)
occupies 22Mb. Use contrib/oid2name to get mappings from oid to names.
After doing

test=# select title  into titles_tmp from titles;
SELECT

I got size of table 'titles' without txtidx field

-rw-------    1 postgres users    30105600 Aug 29 14:14 53016938

So, txtidx column itself occupies about 50Mb. 

     b) after running 'vacuum full analyze' I got:

-rw-------    1 postgres users    30105600 Aug 29 14:26 53016938
-rw-------    1 postgres users    36880384 Aug 29 14:26 53016937
-rw-------    1 postgres users    51494912 Aug 29 14:26 52639027

53016938 = titles_tmp

So, actual size of 'txtidx' field is 20 Mb !  "quod erat demonstrandum"

2. CLUSTER command is highly recommended if you need fast searching.
   For example:

  test=# cluster t_idx on titles;

  BUT ! In 7.2 CLUSTER command forgets about other indices and permissions,
  so you need be carefull and rebuild these indices and restore permissions
  after clustering. Also, clustering isn't dynamic, so you'd need to 
  use CLUSTER from time to time. In 7.3 CLUSTER command should works
  fine.

  after clustering:

-rw-------    1 postgres users    23404544 Aug 29 14:59 53394850
-rw-------    1 postgres users    30105600 Aug 29 14:26 53016938
-rw-------    1 postgres users    50995200 Aug 29 14:45 53394845
pg@zen:/usr/local/pgsql/data/base/52638986$ oid2name -d test                 
All tables from database "test":
---------------------------------
53394850 = t_idx
53394845 = titles
53016938 = titles_tmp