postgresql/contrib/tsearch2/docs/tsearch-V2-intro.html

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title>tsearch-v2-intro</title>

<link type="text/css" rel="stylesheet" href="tsearch-V2-intro_files/tsearch.txt"></head>


<body>
  <div class="content">
    <h2>Tsearch2 - Introduction</h2>

    <p><a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html">
    [Online version]</a> of this document is available.</p>

    <p>The tsearch2 module is available to add as an extension to
    the PostgreSQL database to allow for Full Text Indexing. This
    document is an introduction to installing, configuring, using
    and maintaining the database with the tsearch2 module
    activated.</p>

    <p>Please, note, tsearch2 module is fully incompatible with old
    tsearch, which is deprecated in 7.4 and will be obsoleted in
    7.5.</p>

    <h3>USING TSEARCH2 AND POSTGRESQL FOR A WEB BASED SEARCH
    ENGINE</h3>

    <p>This documentation is provided as a short guide on how to
    quickly get up and running with tsearch2 and PostgreSQL, for
    those who want to implement a full text indexed based search
    engine. It is not meant to be a complete in-depth guide into
    the full ins and outs of the contrib/tsearch2 module, and is
    primarily aimed at beginners who want to speed up searching of
    large text fields, or those migrating from other database
    systems such as MS-SQL.</p>

    <p>The README.tsearch2 file included in the contrib/tsearch2
    directory contains a brief overview and history behind tsearch.
    This can also be found online <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/">[right
    here]</a>.</p>

    <p>Further in depth documentation such as a full function
    reference, and user guide can be found online at the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/">[tsearch
    documentation home]</a>.</p>

    <h3>ACKNOWLEDGEMENTS</h3>

    <p>Robert John Shepherd originally wrote this documentation for
    the previous version of tsearch module (v1) included with the
    postgres release. I took his documentation and updated it to
    comply with the tsearch2 modifications.</p>

    <p>Robert's original acknowledgements:</p>

    <p>"Thanks to Oleg Bartunov for taking the time to answer many
    of my questions regarding this module, and also to Teodor
    Sigaev for clearing up the process of making your own
    dictionaries. Plus of course a big thanks to the pair of them
    for writing this module in the first place!"</p>

    <p>I would also like to extend my thanks to the developers, and
    Oleg Bartunov for all of his direction and help with the new
    features of tsearch2.</p>

    <h3>OVERVIEW</h3>

    <p>MS-SQL provides a full text indexing (FTI) system which
    enables the fast searching of text based fields, very useful
    for websites (and other applications) that require a results
    set based on key words. PostgreSQL ships with a contributed
    module called tsearch2, which implements a special type of
    index that can also be used for full text indexing. Further
    more, unlike MS' offering which requires regular incremental
    rebuilds of the text indexes themselves, tsearch2 indexes are
    always up-to-date and keeping them so induces very little
    overhead.</p>

    <p>Before we get into the details, it is recommended that you
    have installed and tested PostgreSQL, are reasonably familiar
    with databases, the SQL query language and also understand the
    basics of connecting to PostgreSQL from the local shell. This
    document isn't intended for the complete PostgreSQL newbie, but
    anyone with a reasonable grasp of the basics should be able to
    follow it.</p>

    <h3>INSTALLATION</h3>

    <p>Starting with PostgreSQL version 7.4 tsearch2 is now
    included in the contrib directory with the PostgreSQL sources.
    contrib/tsearch2 is where you will find everything needed to
    install and use tsearch2. Please note that tsearch2 will also
    work with PostgreSQL version 7.3.x, but it is not the module
    included with the source distribution. You will have to
    download the module separately and install it in the same
    fashion.</p>

    <p>I installed the tsearch2 module to a PostgreSQL 7.3 database
    from the contrib directory without squashing the original (old)
    tsearch module. What I did was move the modules tsearch src
    driectory into the contrib tree under the name tsearchV2.</p>

    <p>Step one is to download the tsearch V2 module :</p>

    <p><a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/">[http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/]</a>
    (check Development History for latest stable version !)</p>
    <pre>        tar -zxvf tsearch-v2.tar.gz
        mv tsearch2 PGSQL_SRC/contrib/
        cd PGSQL_SRC/contrib/tsearch2
</pre>

    <p>If you are installing from PostgreSQL version 7.4 or higher,
    you can skip those steps and just change to the
    contrib/tsearch2 directory in the source tree and continue from
    there.</p>

    <p>Then continue with the regular building and installation
    process</p>
    <pre>        gmake
        gmake install
        gmake installcheck
</pre>

    <p>That is pretty much all you have to do, unless of course you
    get errors. However if you get those, you better go check with
    the mailing lists over at <a href="http://www.postgresql.org/">http://www.postgresql.org</a> or
    <a href="http://openfts.sourceforge.net/">http://openfts.sourceforge.net/</a>
    since its never failed for me.</p>

    <p>The directory in the contib/ and the directory from the
    archive is called tsearch2. Tsearch2 is completely incompatible
    with the previous version of tsearch. This means that both
    versions can be installed into a single database, and migration
    the new version may be much easier.</p>

    <p>NOTE: the previous version of tsearch found in the
    contrib/tsearch directory is depricated. ALthough it is still
    available and included within PostgreSQL version 7.4. It will
    be removed in version 7.5.</p>

    <h3>ADDING TSEARCH2 FUNCTIONALITY TO A DATABASE</h3>

    <p>We should create a database to use as an example for the
    remainder of this file. We can call the database "ftstest". You
    can create it from the command line like this:</p>
    <pre>        #createdb ftstest
</pre>

    <p>If you thought installation was easy, this next bit is even
    easier. Change to the PGSQL_SRC/contrib/tsearch2 directory and
    type:</p>
    <pre>        psql ftstest &lt; tsearch2.sql
</pre>

    <p>The file "tsearch2.sql" holds all the wonderful little
    goodies you need to do full text indexing. It defines numerous
    functions and operators, and creates the needed tables in the
    database. There will be 4 new tables created after running the
    tsearch2.sql file : pg_ts_dict, pg_ts_parser, pg_ts_cfg,
    pg_ts_cfgmap are added.</p>

    <p>You can check out the tables if you like:</p>
    <pre>        #psql ftstest
        ftstest=# \d
                    List of relations
         Schema |     Name     | Type  |  Owner
        --------+--------------+-------+----------
         public | pg_ts_cfg    | table | kopciuch
         public | pg_ts_cfgmap | table | kopciuch
         public | pg_ts_dict   | table | kopciuch
         public | pg_ts_parser | table | kopciuch
        (4 rows)
</pre>

    <h3>TYPES AND FUNCTIONS PROVIDED BY TSEARCH2</h3>

    <p>The first thing we can do is try out some of the types that
    are provided for us. Lets look at the tsvector type provided
    for us:</p>
    <pre>        SELECT 'Our first string used today'::tsvector;
                        tsvector
        ---------------------------------------
         'Our' 'used' 'first' 'today' 'string'
        (1 row)
</pre>

    <p>The results are the words used within our string. Notice
    they are not in any particular order. The tsvector type returns
    a string of space separated words.</p>
    <pre>        SELECT 'Our first string used today first string'::tsvector;
                            tsvector
        -----------------------------------------------
         'Our' 'used' 'again' 'first' 'today' 'string'
        (1 row)
</pre>

    <p>Notice the results string has each unique word ('first' and
    'string' only appear once in the tsvector value). Which of
    course makes sense if you are searching the full text ... you
    only need to know each unique word in the text.</p>

    <p>Those examples were just casting a text field to that of
    type tsvector. Lets check out one of the new functions created
    by the tsearch2 module.</p>

    <p>The function to_tsvector has 3 possible signatures:</p>
    <pre>        to_tsvector(oid, text);
        to_tsvector(text, text);
        to_tsvector(text);
</pre>

    <p>We will use the second method using two text fields. The
    overloaded methods provide us with a way to specifiy the way
    the searchable text is broken up into words (Stemming process).
    Right now we will specify the 'default' configuration. See the
    section on TSEARCH2 CONFIGURATION to learn more about this.</p>
    <pre>        SELECT to_tsvector('default',
                           'Our first string used today first string');
                        to_tsvector
        --------------------------------------------
         'use':4 'first':2,6 'today':5 'string':3,7
        (1 row)
</pre>

    <p>The result returned from this function is of type tsvector.
    The results came about by this reasoning: All of the words in
    the text passed in are stemmed, or not used because they are
    stop words defined in our configuration. Each lower case
    morphed word is returned with all of the positons in the
    text.</p>

    <p>In this case the word "Our" is a stop word in the default
    configuration. That means it will not be included in the
    result. The word "first" is found at positions 2 and 6
    (although "Our" is a stop word, it's position is maintained).
    The word(s) positioning is maintained exactly as in the
    original string. The word "used" is morphed to the word "use"
    based on the default configuration for word stemming, and is
    found at position 4. The rest of the results follow the same
    logic. Just a reminder again ... the order of the 'word'
    position in the output is not in any kind of order. (ie 'use':4
    appears first)</p>

    <p>If you want to view the output of the tsvector fields
    without their positions, you can do so with the function
    "strip(tsvector)".</p>
    <pre>        SELECT strip(to_tsvector('default',
                     'Our first string used today first string'));
                    strip
        --------------------------------
         'use' 'first' 'today' 'string'
</pre>

    <p>If you wish to know the number of unique words returned in
    the tsvector you can do so by using the function
    "length(tsvector)"</p>
    <pre>        SELECT length(to_tsvector('default',
                      'Our first string used today first string'));
         length
        --------
              4
        (1 row)
</pre>

    <p>Lets take a look at the function to_tsquery. It also has 3
    signatures which follow the same rational as the to_tsvector
    function:</p>
    <pre>        to_tsquery(oid, text);
        to_tsquery(text, text);
        to_tsquery(text);
</pre>

    <p>Lets try using the function with a single word :</p>
    <pre>        SELECT to_tsquery('default', 'word');
         to_tsquery
        -----------
         'word'
         (1 row)
</pre>

    <p>I call the function the same way I would a to_tsvector
    function, specifying the 'default' configuration for morphing,
    and the result is the stemmed output 'word'.</p>

    <p>Lets attempt to use the function with a string of multiple
    words:</p>
    <pre>        SELECT to_tsquery('default', 'this is many words');
        ERROR:  Syntax error
</pre>

    <p>The function can not accept a space separated string. The
    intention of the to_tsquery function is to return a type of
    "tsquery" used for searching a tsvector field. What we need to
    do is search for one to many words with some kind of logic (for
    now simple boolean).</p>
    <pre>        SELECT to_tsquery('default', 'searching|sentence');
              to_tsquery
        ----------------------
         'search' | 'sentenc'
        (1 row)
</pre>

    <p>Notice that the words are separated by the boolean logic
    "OR", the text could contain boolean operators &amp;,|,!,()
    with their usual meaning.</p>

    <p>You can not use words defined as being a stop word in your
    configuration. The function will not fail ... you will just get
    no result, and a NOTICE like this:</p>
    <pre>        SELECT to_tsquery('default', 'a|is&amp;not|!the');
        NOTICE:  Query contains only stopword(s)
                 or doesn't contain lexem(s), ignored
         to_tsquery
        -----------
        (1 row)
</pre>

    <p>That is a beginning to using the types, and functions
    defined in the tsearch2 module. There are numerous more
    functions that I have not touched on. You can read through the
    tsearch2.sql file built when compiling to get more familiar
    with what is included.</p>

    <h3>INDEXING FIELDS IN A TABLE</h3>

    <p>The next stage is to add a full text index to an existing
    table. In this example we already have a table defined as
    follows:</p>
    <pre>        CREATE TABLE tblMessages
        (
                intIndex        int4,
                strTopic        varchar(100),
                strMessage      text
        );
</pre>

    <p>We are assuming there are several rows with some kind of
    data in them. Any data will do, just do several inserts with
    test strings for a topic, and a message. here is some test data
    I inserted. (yes I know it's completely useless stuff ;-) but
    it will serve our purpose right now).</p>
    <pre>        INSERT INTO tblMessages
               VALUES ('1', 'Testing Topic', 'Testing message data input');
        INSERT INTO tblMessages
               VALUES ('2', 'Movie', 'Breakfast at Tiffany\'s');
        INSERT INTO tblMessages
               VALUES ('3', 'Famous Author', 'Stephen King');
        INSERT INTO tblMessages
               VALUES ('4', 'Political Topic',
                            'Nelson Mandella is released from prison');
        INSERT INTO tblMessages
               VALUES ('5', 'Nursery rhyme phrase',
                            'Little jack horner sat in a corner');
        INSERT INTO tblMessages
               VALUES ('6', 'Gettysburg address quotation',
                            'Four score and seven years ago'
                            ' our fathers brought forth on this'
                            ' continent a new nation, conceived in'
                            ' liberty and dedicated to the proposition'
                            ' that all men are created equal');
        INSERT INTO tblMessages
               VALUES ('7', 'Classic Rock Bands',
                            'Led Zeppelin Grateful Dead and The Sex Pistols');
        INSERT INTO tblMessages
               VALUES ('8', 'My birth address',
                            '18 Sommervile road, Regina, Saskatchewan');
        INSERT INTO tblMessages
               VALUES ('9', 'Joke', 'knock knock : who\'s there?'
                                    ' I will not finish this joke');
        INSERT INTO tblMessages
               VALUES ('10', 'Computer information',
                             'My computer is a pentium III 400 mHz'
                             ' with 192 megabytes of RAM');
</pre>

    <p>The next stage is to create a special text index which we
    will use for FTI, so we can search our table of messages for
    words or a phrase. We do this using the SQL command:</p>
    <pre>        ALTER TABLE tblMessages ADD COLUMN idxFTI tsvector;
</pre>

    <p>Note that unlike traditional indexes, this is actually a new
    field in the same table, which is then used (through the magic
    of the tsearch2 operators and functions) by a special index we
    will create in a moment.</p>

    <p>The general rule for the initial insertion of data will
    follow four steps:</p>
    <pre>    1. update table
    2. vacuum full analyze
    3. create index
    4. vacuum full analyze
</pre>

    <p>The data can be updated into the table, the vacuum full
    analyze will reclaim unused space. The index can be created on
    the table after the data has been inserted. Having the index
    created prior to the update will slow down the process. It can
    be done in that manner, this way is just more efficient. After
    the index has been created on the table, vacuum full analyze is
    run again to update postgres's statistics (ie having the index
    take effect).</p>
    <pre>        UPDATE tblMessages SET idxFTI=to_tsvector('default', strMessage);
        VACUUM FULL ANALYZE;
</pre>

    <p>Note that this only inserts the field strMessage as a
    tsvector, so if you want to also add strTopic to the
    information stored, you should instead do the following, which
    effectively concatenates the two fields into one before being
    inserted into the table:</p>
    <pre>        UPDATE tblMessages
            SET idxFTI=to_tsvector('default',coalesce(strTopic,'') ||' '|| coalesce(strMessage,''));
        VACUUM FULL ANALYZE;
</pre>

    <p><strong>Using the coalesce function makes sure this
    concatenation also works with NULL fields.</strong></p>

    <p>We need to create the index on the column idxFTI. Keep in
    mind that the database will update the index when some action
    is taken. In this case we _need_ the index (The whole point of
    Full Text INDEXINGi ;-)), so don't worry about any indexing
    overhead. We will create an index based on the gist function.
    GiST is an index structure for Generalized Search Tree.</p>
    <pre>        CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);
        VACUUM FULL ANALYZE;
</pre>

    <p>After you have converted all of your data and indexed the
    column, you can select some rows to see what actually happened.
    I will not display output here but you can play around
    yourselves and see what happened.</p>

    <p>The last thing to do is set up a trigger so every time a row
    in this table is changed, the text index is automatically
    updated. This is easily done using:</p>
    <pre>        CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
            FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, strMessage);
</pre>

    <p>Or if you are indexing both strMessage and strTopic you
    should instead do:</p>
    <pre>        CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
            FOR EACH ROW EXECUTE PROCEDURE
                tsearch2(idxFTI, strTopic, strMessage);
</pre>

    <p>Before you ask, the tsearch2 function accepts multiple
    fields as arguments so there is no need to concatenate the two
    into one like we did before.</p>

    <p>If you want to do something specific with columns, you may
    write your very own trigger function using plpgsql or other
    procedural languages (but not SQL, unfortunately) and use it
    instead of <em>tsearch2</em> trigger.</p>

    <p>You could however call other stored procedures from within
    the tsearch2 function. Lets say we want to create a function to
    remove certain characters (like the @ symbol from all
    text).</p>
    <pre>       CREATE FUNCTION dropatsymbol(text)
                     RETURNS text AS 'select replace($1, \'@\', \' \');' LANGUAGE SQL;
</pre>

    <p>Now we can use this function within the tsearch2 function on
    the trigger.</p>
    <pre>      DROP TRIGGER tsvectorupdate ON tblmessages;
        CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
            FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, dropatsymbol, strMessage);
        INSERT INTO tblmessages VALUES (69, 'Attempt for dropatsymbol', 'Test@test.com');
</pre>

    <p>If at this point you receive an error stating: ERROR: Can't
    find tsearch config by locale</p>

    <p>Do not worry. You have done nothing wrong. And tsearch2 is
    not broken. All that has happened here is that the
    configuration is setup to use a configuration based on the
    locale of the server. All you have to do is change your default
    configuration, or add a new one for your specific locale. See
    the section on TSEARCH2 CONFIGURATION.</p>
    <pre class="real">   SELECT * FROM tblmessages WHERE intindex = 69;

         intindex |         strtopic         |  strmessage   |        idxfti
        ----------+--------------------------+---------------+-----------------------
                69 | Attempt for dropatsymbol | Test@test.com | 'test':1 'test.com':2
        (1 row)
</pre>Notice that the string content was passed throught the stored
procedure dropatsymbol. The '@' character was replaced with a
single space ... and the output from the procedure was then stored
in the tsvector column.

    <p>This could be useful for removing other characters from
    indexed text, or any kind of preprocessing needed to be done on
    the text prior to insertion into the index.</p>

    <h3>QUERYING A TABLE</h3>

    <p>There are some examples in the README.tsearch2 file for
    querying a table. One major difference between tsearch and
    tsearch2 is the operator ## is no longer available. Only the
    operator @@ is defined, using the types tsvector on one side
    and tsquery on the other side.</p>

    <p>Lets search the indexed data for the word "Test". I indexed
    based on the the concatenation of the strTopic, and the
    strMessage:</p>
    <pre>        SELECT intindex, strtopic FROM tblmessages
                                  WHERE idxfti @@ 'test'::tsquery;
         intindex |   strtopic
        ----------+---------------
                1 | Testing Topic
        (1 row)
</pre>

    <p>The only result that matched was the row with a topic
    "Testing Topic". Notice that the word I search for was all
    lowercase. Let's see what happens when I query for uppercase
    "Test".</p>
    <pre>        SELECT intindex, strtopic FROM tblmessages
                                  WHERE idxfti @@ 'Test'::tsquery;
         intindex | strtopic
        ----------+----------
        (0 rows)
</pre>

    <p>We get zero rows returned. The reason is because when the
    text was inserted, it was morphed to my default configuration
    (because of the call to to_tsvector in the UPDATE statement).
    If there was no morphing done, and the tsvector field(s)
    contained the word 'Text', a match would have been found.</p>

    <p>Most likely the best way to query the field is to use the
    to_tsquery function on the right hand side of the @@ operator
    like this:</p>
    <pre>        SELECT intindex, strtopic FROM tblmessages
               WHERE idxfti @@ to_tsquery('default', 'Test | Zeppelin');
         intindex |      strtopic
        ----------+--------------------
                1 | Testing Topic
                7 | Classic Rock Bands
        (2 rows)
</pre>

    <p>That query searched for all instances of "Test" OR
    "Zeppelin". It returned two rows: the "Testing Topic" row, and
    the "Classic Rock Bands" row. The to_tsquery function performed
    the correct morphology upon the parameters, and searched the
    tsvector field appropriately.</p>

    <p>The last example here relates to searching for a phrase, for
    example "minority report". This poses a problem with regard to
    tsearch2, as it doesn't index phrases, only words. But there is
    a way around which doesn't appear to have a significant impact
    on query time, and that is to use a query such as the
    following:</p>
    <pre>        SELECT intindex, strTopic FROM tblmessages
                WHERE idxfti @@ to_tsquery('default', 'gettysburg &amp; address')
                AND strMessage ~* '.*men are created equal.*';
         intindex |           strtopic
        ----------+------------------------------
                6 | Gettysburg address quotation
        (1 row)
        SELECT intindex, strTopic FROM tblmessages
                WHERE idxfti @@ to_tsquery('default', 'gettysburg &amp; address')
                AND strMessage ~* '.*something that does not exist.*';
         intindex | strtopic
        ----------+----------
        (0 rows)
</pre>

    <p>Of course if your indexing both strTopic and strMessage, and
    want to search for this phrase on both, then you will have to
    get out the brackets and extend this query a little more.</p>

    <h3>TSEARCH2 CONFIGURATION</h3>

    <p>Some words such as "and", "the", and "who" are automatically
    not indexed, since they belong to a pre-existing dictionary of
    "Stop Words" which tsearch2 does not perform indexing on. If
    someone needs to search for "The Who" in your database, they
    are going to have a tough time coming up with any results,
    since both are ignored in the indexes. But there is a
    solution.</p>

    <p>Lets say we want to add a word into the stop word list for
    english stemming. We could edit the file
    :'/usr/local/pgsql/share/english.stop' and add a word to the
    list. I edited mine to exclude my name from indexing:</p>
    <pre>    - Edit /usr/local/pgsql/share/english.stop
    - Add 'andy' to the list
    - Save the file.
</pre>

    <p>When you connect to the database, the dict_init procedure is
    run during initialization. And in my configuration it will read
    the stop words from the file I just edited. If you were
    connected to the DB while editing the stop words, you will need
    to end the current session and re-connect. When you re-connect
    to the database, 'andy' is no longer indexed:</p>
    <pre>        SELECT to_tsvector('default', 'Andy');
         to_tsvector
        ------------
        (1 row)
</pre>

    <p>Originally I would get the result :</p>
    <pre>        SELECT to_tsvector('default', 'Andy');
         to_tsvector
        ------------
         'andi':1
        (1 row)
</pre>

    <p>But since I added it as a stop word, it would be ingnored on
    the indexing. The stop word added was used in the dictionary
    "en_stem". If I were to use a different configuration such as
    'simple', the results would be different. There are no stop
    words for the simple dictionary. It will just convert to lower
    case, and index every unique word.</p>
    <pre>        SELECT to_tsvector('simple', 'Andy andy The the in out');
                     to_tsvector
        -------------------------------------
         'in':5 'out':6 'the':3,4 'andy':1,2
        (1 row)
</pre>

    <p>All this talk about which configuration to use is leading us
    into the actual configuration of tsearch2. In the examples in
    this document the configuration has always been specified when
    using the tsearch2 functions:</p>
    <pre>        SELECT to_tsvector('default', 'Testing the default config');
        SELECT to_tsvector('simple', 'Example of simple Config');
</pre>

    <p>The pg_ts_cfg table holds each configuration you can use
    with the tsearch2 functions. As you can see the ts_name column
    contains both the 'default' configurations based on the 'C'
    locale. And the 'simple' configuration which is not based on
    any locale.</p>
    <pre>        SELECT * from pg_ts_cfg;
             ts_name     | prs_name |    locale
        -----------------+----------+--------------
         default         | default  | C
         default_russian | default  | ru_RU.KOI8-R
         simple          | default  |
        (3 rows)
</pre>

    <p>Each row in the pg_ts_cfg table contains the name of the
    tsearch2 configuration, the name of the parser to use, and the
    locale mapped to the configuration. There is only one parser to
    choose from the table pg_ts_parser called 'default'. More
    parsers could be written, but for our needs we will use the
    default.</p>

    <p>There are 3 configurations installed by tsearch2 initially.
    If your locale is set to 'en_US' for example (like my laptop),
    then as you can see there is currently no dictionary configured
    to use with that locale. You can either set up a new
    configuration or just use one that already exists. If I do not
    specify which configuration to use in the to_tsvector function,
    I receive the following error.</p>
    <pre>        SELECT to_tsvector('learning tsearch is like going to school');
        ERROR:  Can't find tsearch config by locale
</pre>

    <p>We will create a new configuration for use with the server
    encoding 'en_US'. The first step is to add a new configuration
    into the pg_ts_cfg table. We will call the configuration
    'default_english', with the default parser and use the locale
    'en_US'.</p>
    <pre>        INSERT INTO pg_ts_cfg (ts_name, prs_name, locale)
               VALUES ('default_english', 'default', 'en_US');
</pre>

    <p>We have only declared that there is a configuration called
    'default_english'. We need to set the configuration of how
    'default_english' will work. The next step is creating a new
    dictionary to use. The configuration of the dictionary is
    completlely different in tsearch2. In the prior versions to
    make changes, you would have to re-compile your changes into
    the tsearch.so. All of the configuration has now been moved
    into the system tables created by executing the SQL code from
    tsearch2.sql</p>

    <p>Lets take a first look at the pg_ts_dict table</p>
    <pre>        ftstest=# \d pg_ts_dict
                Table "public.pg_ts_dict"
         Column      |  Type   | Modifiers
        -----------------+---------+-----------
         dict_name       | text    | not null
         dict_init       | oid     |
         dict_initoption | text    |
         dict_lexize     | oid     | not null
         dict_comment    | text    |
        Indexes: pg_ts_dict_idx unique btree (dict_name)
</pre>

    <p>The dict_name column is the name of the dictionary, for
    example 'simple', 'en_stem' or 'ru_stem'. The dict_init column
    is an OID of a stored procedure to run for initialization of
    that dictionary, for example 'snb_en_init' or 'snb_ru_init'.
    The dict_init option is used for options passed to the init
    function for the stored procedure. In the cases of 'en_stem' or
    'ru_stem' it is a path to a stopword file for that dictionary,
    for example '/usr/local/pgsql/share/english.stop'. This is
    however dictated by the dictionary. ISpell dictionaries may
    require different options. The dict_lemmatize column is another
    OID of a stored procedure to the function used to lemmitize,
    for example 'snb_lemmatize'. The dict_comment column is just a
    comment.</p>

    <p>Next we will configure the use of a new dictionary based on
    ISpell. We will assume you have ISpell installed on you
    machine. (in /usr/local/lib)</p>

    <p>There has been some confusion in the past as to which files
    are used from ISpell. ISpell operates using a hash file. This
    is a binary file created by the ISpell command line utility
    "buildhash". This utility accepts a file containing the words
    from the dictionary, and the affixes file and the output is the
    hash file. The default installation of ISPell installs the
    english hash file english.hash, which is the exact same file as
    american.hash. ISpell uses this as the fallback dictionary to
    use.</p>

    <p>This hash file is not what tsearch2 requires as the ISpell
    interface. The file(s) needed are those used to create the
    hash. Tsearch uses the dictionary words for morphology, so the
    listing is needed not spellchecking. Regardless, these files
    are included in the ISpell sources, and you can use them to
    integrate into tsearch2. This is not complicated, but is not
    very obvious to begin with. The tsearch2 ISpell interface needs
    only the listing of dictionary words, it will parse and load
    those words, and use the ISpell dictionary for lexem
    processing.</p>

    <p>I found the ISPell make system to be very finicky. Their
    documentation actually states this to be the case. So I just
    did things the command line way. In the ISpell source tree
    under langauges/english there are several files in this
    directory. For a complete description, please read the ISpell
    README. Basically for the english dictionary there is the
    option to create the small, medium, large and extra large
    dictionaries. The medium dictionary is recommended. If the make
    system is configured correctly, it would build and install the
    english.has file from the medium size dictionary. Since we are
    only concerned with the dictionary word listing ... it can be
    created from the /languages/english directory with the
    following command:</p>
    <pre>   sort -u -t/ +0f -1 +0 -T /usr/tmp -o english.med english.0 english.1
</pre>

    <p>This will create a file called english.med. You can copy
    this file to whever you like. I place mine in /usr/local/lib so
    it coincides with the ISpell hash files. You can now add the
    tsearch2 configuration entry for the ISpell english dictionary.
    We will also continue to use the english word stop file that
    was installed for the en_stem dictionary. You could use a
    different one if you like. The ISpell configuration is based on
    the "ispell_template" dictionary installed by default with
    tsearch2. We will use the OIDs to the stored procedures from
    the row where the dict_name = 'ispell_template'.</p>
    <pre>        INSERT INTO pg_ts_dict
               (SELECT 'en_ispell',
                       dict_init,
                       'DictFile="/usr/local/lib/english.med",'
                       'AffFile="/usr/local/lib/english.aff",'
                       'StopFile="/usr/local/pgsql/share/english.stop"',
                       dict_lexize
                FROM pg_ts_dict
                WHERE dict_name = 'ispell_template');
</pre>

    <p>Now that we have a dictionary we can specify it's use in a
    query to get a lexem. For this we will use the lexize function.
    The lexize function takes the name of the dictionary to use as
    an argument. Just as the other tsearch2 functions operate.</p>
    <pre>   SELECT lexize('en_ispell', 'program');
          lexize
        -----------
         {program}
        (1 row)
</pre>

    <p>If you wanted to always use the ISpell english dictionary
    you have installed, you can configure tsearch2 to always use a
    specific dictionary.</p>
    <pre>  SELCECT set_curdict('en_ispell');
</pre>

    <p>Lexize is meant to turn a word into a lexem. It is possible
    to receive more than one lexem returned for a single word.</p>
    <pre> SELECT lexize('en_ispell', 'conditionally');
           lexize
        -----------------------------
         {conditionally,conditional}
        (1 row)
</pre>

    <p>The lexize function is not meant to take a full string as an
    argument to return lexems for. If you passed in an entire
    sentence, it attempts to find that entire sentence in the
    dictionary. SInce the dictionary contains only words, you will
    receive an empty result set back.</p>
    <pre>      SELECT lexize('en_ispell', 'This is a senctece to lexize');
         lexize
        --------

        (1 row)

If you parse a lexem from a word not in the dictionary, then you will receive an empty result. This makes sense because the word "tsearch" is not int the english dictionary. You can create your own additions to the dictionary if you like. This may be useful for scientific or technical glossaries that need to be indexed. SELECT lexize('en_ispell', 'tsearch'); lexize -------- (1 row)
</pre>

    <p>This is not to say that tsearch will be ignored when adding
    text information to the the tsvector index column. This will be
    explained in greater detail with the table pg_ts_cfgmap.</p>

    <p>Next we need to set up the configuration for mapping the
    dictionay use to the lexxem parsings. This will be done by
    altering the pg_ts_cfgmap table. We will insert several rows,
    specifying to using the new dictionary we installed and
    configured for use within tsearch2. There are several type of
    lexims we would be concerned with forcing the use of the ISpell
    dictionary.</p>
    <pre>        INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
               VALUES ('default_english', 'lhword', '{en_ispell,en_stem}');
        INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
               VALUES ('default_english', 'lpart_hword', '{en_ispell,en_stem}');
        INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
               VALUES ('default_english', 'lword', '{en_ispell,en_stem}');
</pre>

    <p>We have just inserted 3 records to the configuration
    mapping, specifying that the lexem types for "lhword,
    lpart_hword and lword" are to be stemmed using the 'en_ispell'
    dictionary we added into pg_ts_dict, when using the
    configuration ' default_english' which we added to
    pg_ts_cfg.</p>

    <p>There are several other lexem types used that we do not need
    to specify as using the ISpell dictionary. We can simply insert
    values using the 'simple' stemming process dictionary.</p>
    <pre>        INSERT INTO pg_ts_cfgmap
               VALUES ('default_english', 'url', '{simple}');
        INSERT INTO pg_ts_cfgmap
               VALUES ('default_english', 'host', '{simple}');
        INSERT INTO pg_ts_cfgmap
               VALUES ('default_english', 'sfloat', '{simple}');
        INSERT INTO pg_ts_cfgmap
               VALUES ('default_english', 'uri', '{simple}');
        INSERT INTO pg_ts_cfgmap
               VALUES ('default_english', 'int', '{simple}');
        INSERT INTO pg_ts_cfgmap
               VALUES ('default_english', 'float', '{simple}');
        INSERT INTO pg_ts_cfgmap
               VALUES ('default_english', 'email', '{simple}');
        INSERT INTO pg_ts_cfgmap
               VALUES ('default_english', 'word', '{simple}');
        INSERT INTO pg_ts_cfgmap
               VALUES ('default_english', 'hword', '{simple}');
        INSERT INTO pg_ts_cfgmap
               VALUES ('default_english', 'nlword', '{simple}');
        INSERT INTO pg_ts_cfgmap
               VALUES ('default_english', 'nlpart_hword', '{simple}');
        INSERT INTO pg_ts_cfgmap
               VALUES ('default_english', 'part_hword', '{simple}');
        INSERT INTO pg_ts_cfgmap
               VALUES ('default_english', 'nlhword', '{simple}');
        INSERT INTO pg_ts_cfgmap
               VALUES ('default_english', 'file', '{simple}');
        INSERT INTO pg_ts_cfgmap
               VALUES ('default_english', 'uint', '{simple}');
        INSERT INTO pg_ts_cfgmap
               VALUES ('default_english', 'version', '{simple}');
</pre>

    <p>Our addition of a configuration for 'default_english' is now
    complete. We have successfully created a new tsearch2
    configuration. At the same time we have also set the new
    configuration to be our default for en_US locale.</p>
    <pre>        SELECT to_tsvector('default_english',
                           'learning tsearch is like going to school');
                           to_tsvector
        --------------------------------------------------
         'go':5 'like':4 'learn':1 'school':7 'tsearch':2
        SELECT to_tsvector('learning tsearch is like going to school');
                            to_tsvector
        --------------------------------------------------
         'go':5 'like':4 'learn':1 'school':7 'tsearch':2
        (1 row)
</pre>

    <p>Notice here that words like "tsearch" are still parsed and
    indexed in the tsvector column. There is a lexem returned for
    the word becuase in the configuration mapping table, we specify
    words to be used from the 'en_ispell' dictionary first, but as
    a fallback to use the 'en_stem' dictionary. Therefore a lexem
    is not returned from en_ispell, but is returned from en_stem,
    and added to the tsvector.</p>
    <pre> SELECT to_tsvector('learning tsearch is like going to computer school');
                                to_tsvector
        ---------------------------------------------------------------------------
         'go':5 'like':4 'learn':1 'school':8 'compute':7 'tsearch':2 'computer':7
        (1 row)
</pre>

    <p>Notice in this last example I added the word "computer" to
    the text to be converted into a tsvector. Because we have setup
    our default configuration to use the ISpell english dictionary,
    the words are lexized, and computer returns 2 lexems at the
    same position. 'compute':7 and 'computer':7 are now both
    indexed for the word computer.</p>

    <p>You can create additional dictionarynlists, or use the extra
    large dictionary from ISpell. You can read through the ISpell
    documents, and source tree to make modifications as you see
    fit.</p>

    <p>In the case that you already have a configuration set for
    the locale, and you are changing it to your new dictionary
    configuration. You will have to set the old locale to NULL. If
    we are using the 'C' locale then we would do this:</p>
    <pre>        UPDATE pg_ts_cfg SET locale=NULL WHERE locale = 'C';
</pre>

    <p>That about wraps up the configuration of tsearch2. There is
    much more you can do with the tables provided. This was just an
    introduction to get things working rather quickly.</p>

    <h3>ADDING NEW DICTIONARIES TO TSEARCH2</h3>

    <p>To aid in the addition of new dictionaries to the tsearch2
    module you can use another additional module in combination
    with tsearch2. The gendict module is included into tsearch2
    distribution and is available from gendict/ subdirectory.</p>

    <p>I will not go into detail about installation and
    instructions on how to use gendict to it's fullest extent right
    now. You can read the README.gendict ... it has all of the
    instructions and information you will need.</p>

    <h3>BACKING UP AND RESTORING DATABASES THAT FEATURE
    TSEARCH2</h3>

    <p>Believe it or not, this isn't as straight forward as it
    should be, and you will have problems trying to backup and
    restore any database which uses tsearch2 unless you take the
    steps shown below. And before you ask using pg_dumpall will
    result in failure every time. These took a lot of trial and
    error to get working, but the process as laid down below has
    been used a dozen times now in live production environments so
    it should work fine.</p>

    <p>HOWEVER never rely on anyone elses instructions to backup
    and restore a database system, always develop and understand
    your own methodology, and test it numerous times before you
    need to do it for real.</p>

    <p>To Backup a PostgreSQL database that uses the tsearch2
    module:</p>

    <p>1) Backup any global database objects such as users and
    groups (this step is usually only necessary when you will be
    restoring to a virgin system)</p>
    <pre>        pg_dumpall -g &gt; GLOBALobjects.sql
</pre>

    <p>2) Backup the full database schema using pg_dump</p>
    <pre>        pg_dump -s DATABASE &gt; DATABASEschema.sql
</pre>

    <p>3) Backup the full database using pg_dump</p>
    <pre>        pg_dump -Fc DATABASE &gt; DATABASEdata.tar
</pre>

    <p>To Restore a PostgreSQL database that uses the tsearch2
    module:</p>

    <p>1) Create the blank database</p>
    <pre>        createdb DATABASE
</pre>

    <p>2) Restore any global database objects such as users and
    groups (this step is usually only necessary when you will be
    restoring to a virgin system)</p>
    <pre>        psql DATABASE &lt; GLOBALobjects.sql
</pre>

    <p>3) Create the tsearch2 objects, functions and operators</p>
    <pre>        psql DATABASE &lt; tsearch2.sql
</pre>

    <p>4) Edit the backed up database schema and delete all SQL
    commands which create tsearch2 related functions, operators and
    data types, BUT NOT fields in table definitions that specify
    tsvector types. If your not sure what these are, they are the
    ones listed in tsearch2.sql. Then restore the edited schema to
    the database</p>
    <pre>        psql DATABASE &lt; DATABASEschema.sql
</pre>

    <p>5) Restore the data for the database</p>
    <pre>        pg_restore -N -a -d DATABASE DATABASEdata.tar
</pre>

    <p>If you get any errors in step 4, it will most likely be
    because you forgot to remove an object that was created in
    tsearch2.sql. Any errors in step 5 will mean the database
    schema was probably restored wrongly.</p>
  </div>
</body></html>