2010-09-20 22:08:53 +02:00
|
|
|
src/backend/snowball/README
|
2008-03-20 18:55:15 +01:00
|
|
|
|
|
|
|
Snowball-Based Stemming
|
2008-03-21 14:23:29 +01:00
|
|
|
=======================
|
2007-08-21 03:11:32 +02:00
|
|
|
|
|
|
|
This module uses the word stemming code developed by the Snowball project,
|
Sync our Snowball stemmer dictionaries with current upstream.
We haven't touched these since text search functionality landed in core
in 2007 :-(. While the upstream project isn't a beehive of activity,
they do make additions and bug fixes from time to time. Update our
copies of these files.
Also update our documentation about how to keep things in sync, since
they're not making distribution tarballs these days. Fortunately,
their source code turns out to be a breeze to build.
Notable changes:
* The non-UTF8 version of the hungarian stemmer now works in LATIN2
not LATIN1.
* New stemmers have appeared for arabic, indonesian, irish, lithuanian,
nepali, and tamil. These all work in UTF8, and the indonesian and
irish ones also work in LATIN1.
(There are some new stemmers that I did not incorporate, mainly because
their names don't match the underlying languages, suggesting that they're
not to be considered mainstream.)
Worth noting: the upstream Nepali dictionary was contributed by
Arthur Zakirov.
initdb forced because the contents of snowball_create.sql have
changed.
Still TODO: see about updating the stopword lists.
Arthur Zakirov, minor mods and doc work by me
Discussion: https://postgr.es/m/20180626122025.GA12647@zakirov.localdomain
Discussion: https://postgr.es/m/20180219140849.GA9050@zakirov.localdomain
2018-09-24 23:29:08 +02:00
|
|
|
http://snowballstem.org (formerly http://snowball.tartarus.org)
|
2007-08-21 03:11:32 +02:00
|
|
|
which is released by them under a BSD-style license.
|
|
|
|
|
Sync our Snowball stemmer dictionaries with current upstream.
We haven't touched these since text search functionality landed in core
in 2007 :-(. While the upstream project isn't a beehive of activity,
they do make additions and bug fixes from time to time. Update our
copies of these files.
Also update our documentation about how to keep things in sync, since
they're not making distribution tarballs these days. Fortunately,
their source code turns out to be a breeze to build.
Notable changes:
* The non-UTF8 version of the hungarian stemmer now works in LATIN2
not LATIN1.
* New stemmers have appeared for arabic, indonesian, irish, lithuanian,
nepali, and tamil. These all work in UTF8, and the indonesian and
irish ones also work in LATIN1.
(There are some new stemmers that I did not incorporate, mainly because
their names don't match the underlying languages, suggesting that they're
not to be considered mainstream.)
Worth noting: the upstream Nepali dictionary was contributed by
Arthur Zakirov.
initdb forced because the contents of snowball_create.sql have
changed.
Still TODO: see about updating the stopword lists.
Arthur Zakirov, minor mods and doc work by me
Discussion: https://postgr.es/m/20180626122025.GA12647@zakirov.localdomain
Discussion: https://postgr.es/m/20180219140849.GA9050@zakirov.localdomain
2018-09-24 23:29:08 +02:00
|
|
|
The Snowball project is not currently making formal releases; it's best
|
|
|
|
to pull from their git repository
|
|
|
|
|
|
|
|
git clone https://github.com/snowballstem/snowball.git
|
|
|
|
|
|
|
|
and then building the derived files is as simple as
|
|
|
|
|
|
|
|
cd snowball
|
|
|
|
make
|
|
|
|
|
|
|
|
At least on Linux, no platform-specific adjustment is needed.
|
|
|
|
|
|
|
|
Postgres' files under src/backend/snowball/libstemmer/ and
|
|
|
|
src/include/snowball/libstemmer/ are taken directly from the Snowball
|
|
|
|
files, with only some minor adjustments of file inclusions. Note
|
2007-08-21 03:11:32 +02:00
|
|
|
that most of these files are in fact derived files, not master source.
|
Sync our Snowball stemmer dictionaries with current upstream.
We haven't touched these since text search functionality landed in core
in 2007 :-(. While the upstream project isn't a beehive of activity,
they do make additions and bug fixes from time to time. Update our
copies of these files.
Also update our documentation about how to keep things in sync, since
they're not making distribution tarballs these days. Fortunately,
their source code turns out to be a breeze to build.
Notable changes:
* The non-UTF8 version of the hungarian stemmer now works in LATIN2
not LATIN1.
* New stemmers have appeared for arabic, indonesian, irish, lithuanian,
nepali, and tamil. These all work in UTF8, and the indonesian and
irish ones also work in LATIN1.
(There are some new stemmers that I did not incorporate, mainly because
their names don't match the underlying languages, suggesting that they're
not to be considered mainstream.)
Worth noting: the upstream Nepali dictionary was contributed by
Arthur Zakirov.
initdb forced because the contents of snowball_create.sql have
changed.
Still TODO: see about updating the stopword lists.
Arthur Zakirov, minor mods and doc work by me
Discussion: https://postgr.es/m/20180626122025.GA12647@zakirov.localdomain
Discussion: https://postgr.es/m/20180219140849.GA9050@zakirov.localdomain
2018-09-24 23:29:08 +02:00
|
|
|
The master sources are in the Snowball language, and are built using
|
|
|
|
the Snowball-to-C compiler that is also part of the Snowball project.
|
|
|
|
We choose to include the derived files in the PostgreSQL distribution
|
|
|
|
because most installations will not have the Snowball compiler available.
|
|
|
|
|
|
|
|
We are currently synced with the Snowball git commit
|
2019-07-04 13:10:41 +02:00
|
|
|
4456b82c26c02493e8807a66f30593a98c5d2888
|
|
|
|
of 2019-06-24.
|
2007-08-21 03:11:32 +02:00
|
|
|
|
Sync our Snowball stemmer dictionaries with current upstream.
We haven't touched these since text search functionality landed in core
in 2007 :-(. While the upstream project isn't a beehive of activity,
they do make additions and bug fixes from time to time. Update our
copies of these files.
Also update our documentation about how to keep things in sync, since
they're not making distribution tarballs these days. Fortunately,
their source code turns out to be a breeze to build.
Notable changes:
* The non-UTF8 version of the hungarian stemmer now works in LATIN2
not LATIN1.
* New stemmers have appeared for arabic, indonesian, irish, lithuanian,
nepali, and tamil. These all work in UTF8, and the indonesian and
irish ones also work in LATIN1.
(There are some new stemmers that I did not incorporate, mainly because
their names don't match the underlying languages, suggesting that they're
not to be considered mainstream.)
Worth noting: the upstream Nepali dictionary was contributed by
Arthur Zakirov.
initdb forced because the contents of snowball_create.sql have
changed.
Still TODO: see about updating the stopword lists.
Arthur Zakirov, minor mods and doc work by me
Discussion: https://postgr.es/m/20180626122025.GA12647@zakirov.localdomain
Discussion: https://postgr.es/m/20180219140849.GA9050@zakirov.localdomain
2018-09-24 23:29:08 +02:00
|
|
|
To update the PostgreSQL sources from a new Snowball version:
|
2007-08-21 03:11:32 +02:00
|
|
|
|
Sync our Snowball stemmer dictionaries with current upstream.
We haven't touched these since text search functionality landed in core
in 2007 :-(. While the upstream project isn't a beehive of activity,
they do make additions and bug fixes from time to time. Update our
copies of these files.
Also update our documentation about how to keep things in sync, since
they're not making distribution tarballs these days. Fortunately,
their source code turns out to be a breeze to build.
Notable changes:
* The non-UTF8 version of the hungarian stemmer now works in LATIN2
not LATIN1.
* New stemmers have appeared for arabic, indonesian, irish, lithuanian,
nepali, and tamil. These all work in UTF8, and the indonesian and
irish ones also work in LATIN1.
(There are some new stemmers that I did not incorporate, mainly because
their names don't match the underlying languages, suggesting that they're
not to be considered mainstream.)
Worth noting: the upstream Nepali dictionary was contributed by
Arthur Zakirov.
initdb forced because the contents of snowball_create.sql have
changed.
Still TODO: see about updating the stopword lists.
Arthur Zakirov, minor mods and doc work by me
Discussion: https://postgr.es/m/20180626122025.GA12647@zakirov.localdomain
Discussion: https://postgr.es/m/20180219140849.GA9050@zakirov.localdomain
2018-09-24 23:29:08 +02:00
|
|
|
0. If you didn't do it already, "make -C snowball".
|
|
|
|
|
|
|
|
1. Copy the *.c files in snowball/src_c/ to src/backend/snowball/libstemmer
|
2007-08-21 03:11:32 +02:00
|
|
|
with replacement of "../runtime/header.h" by "header.h", for example
|
|
|
|
|
Sync our Snowball stemmer dictionaries with current upstream.
We haven't touched these since text search functionality landed in core
in 2007 :-(. While the upstream project isn't a beehive of activity,
they do make additions and bug fixes from time to time. Update our
copies of these files.
Also update our documentation about how to keep things in sync, since
they're not making distribution tarballs these days. Fortunately,
their source code turns out to be a breeze to build.
Notable changes:
* The non-UTF8 version of the hungarian stemmer now works in LATIN2
not LATIN1.
* New stemmers have appeared for arabic, indonesian, irish, lithuanian,
nepali, and tamil. These all work in UTF8, and the indonesian and
irish ones also work in LATIN1.
(There are some new stemmers that I did not incorporate, mainly because
their names don't match the underlying languages, suggesting that they're
not to be considered mainstream.)
Worth noting: the upstream Nepali dictionary was contributed by
Arthur Zakirov.
initdb forced because the contents of snowball_create.sql have
changed.
Still TODO: see about updating the stopword lists.
Arthur Zakirov, minor mods and doc work by me
Discussion: https://postgr.es/m/20180626122025.GA12647@zakirov.localdomain
Discussion: https://postgr.es/m/20180219140849.GA9050@zakirov.localdomain
2018-09-24 23:29:08 +02:00
|
|
|
for f in .../snowball/src_c/*.c
|
2007-08-21 03:11:32 +02:00
|
|
|
do
|
|
|
|
sed 's|\.\./runtime/header\.h|header.h|' $f >libstemmer/`basename $f`
|
|
|
|
done
|
|
|
|
|
Sync our Snowball stemmer dictionaries with current upstream.
We haven't touched these since text search functionality landed in core
in 2007 :-(. While the upstream project isn't a beehive of activity,
they do make additions and bug fixes from time to time. Update our
copies of these files.
Also update our documentation about how to keep things in sync, since
they're not making distribution tarballs these days. Fortunately,
their source code turns out to be a breeze to build.
Notable changes:
* The non-UTF8 version of the hungarian stemmer now works in LATIN2
not LATIN1.
* New stemmers have appeared for arabic, indonesian, irish, lithuanian,
nepali, and tamil. These all work in UTF8, and the indonesian and
irish ones also work in LATIN1.
(There are some new stemmers that I did not incorporate, mainly because
their names don't match the underlying languages, suggesting that they're
not to be considered mainstream.)
Worth noting: the upstream Nepali dictionary was contributed by
Arthur Zakirov.
initdb forced because the contents of snowball_create.sql have
changed.
Still TODO: see about updating the stopword lists.
Arthur Zakirov, minor mods and doc work by me
Discussion: https://postgr.es/m/20180626122025.GA12647@zakirov.localdomain
Discussion: https://postgr.es/m/20180219140849.GA9050@zakirov.localdomain
2018-09-24 23:29:08 +02:00
|
|
|
2. Copy the *.c files in snowball/runtime/ to
|
2007-08-21 03:11:32 +02:00
|
|
|
src/backend/snowball/libstemmer, and edit them to remove direct inclusions
|
|
|
|
of system headers such as <stdio.h> --- they should only include "header.h".
|
|
|
|
(This removal avoids portability problems on some platforms where <stdio.h>
|
|
|
|
is sensitive to largefile compilation options.)
|
|
|
|
|
Sync our Snowball stemmer dictionaries with current upstream.
We haven't touched these since text search functionality landed in core
in 2007 :-(. While the upstream project isn't a beehive of activity,
they do make additions and bug fixes from time to time. Update our
copies of these files.
Also update our documentation about how to keep things in sync, since
they're not making distribution tarballs these days. Fortunately,
their source code turns out to be a breeze to build.
Notable changes:
* The non-UTF8 version of the hungarian stemmer now works in LATIN2
not LATIN1.
* New stemmers have appeared for arabic, indonesian, irish, lithuanian,
nepali, and tamil. These all work in UTF8, and the indonesian and
irish ones also work in LATIN1.
(There are some new stemmers that I did not incorporate, mainly because
their names don't match the underlying languages, suggesting that they're
not to be considered mainstream.)
Worth noting: the upstream Nepali dictionary was contributed by
Arthur Zakirov.
initdb forced because the contents of snowball_create.sql have
changed.
Still TODO: see about updating the stopword lists.
Arthur Zakirov, minor mods and doc work by me
Discussion: https://postgr.es/m/20180626122025.GA12647@zakirov.localdomain
Discussion: https://postgr.es/m/20180219140849.GA9050@zakirov.localdomain
2018-09-24 23:29:08 +02:00
|
|
|
3. Copy the *.h files in snowball/src_c/ and snowball/runtime/
|
2007-08-21 03:11:32 +02:00
|
|
|
to src/include/snowball/libstemmer. At this writing the header files
|
|
|
|
do not require any changes.
|
|
|
|
|
|
|
|
4. Check whether any stemmer modules have been added or removed. If so, edit
|
|
|
|
the OBJS list in Makefile, the list of #include's in dict_snowball.c, and the
|
Sync our Snowball stemmer dictionaries with current upstream.
We haven't touched these since text search functionality landed in core
in 2007 :-(. While the upstream project isn't a beehive of activity,
they do make additions and bug fixes from time to time. Update our
copies of these files.
Also update our documentation about how to keep things in sync, since
they're not making distribution tarballs these days. Fortunately,
their source code turns out to be a breeze to build.
Notable changes:
* The non-UTF8 version of the hungarian stemmer now works in LATIN2
not LATIN1.
* New stemmers have appeared for arabic, indonesian, irish, lithuanian,
nepali, and tamil. These all work in UTF8, and the indonesian and
irish ones also work in LATIN1.
(There are some new stemmers that I did not incorporate, mainly because
their names don't match the underlying languages, suggesting that they're
not to be considered mainstream.)
Worth noting: the upstream Nepali dictionary was contributed by
Arthur Zakirov.
initdb forced because the contents of snowball_create.sql have
changed.
Still TODO: see about updating the stopword lists.
Arthur Zakirov, minor mods and doc work by me
Discussion: https://postgr.es/m/20180626122025.GA12647@zakirov.localdomain
Discussion: https://postgr.es/m/20180219140849.GA9050@zakirov.localdomain
2018-09-24 23:29:08 +02:00
|
|
|
stemmer_modules[] table in dict_snowball.c. You might also need to change
|
2019-07-04 13:10:41 +02:00
|
|
|
the LANGUAGES list in Makefile and tsearch_config_languages in initdb.c.
|
2007-08-21 03:11:32 +02:00
|
|
|
|
|
|
|
5. The various stopword files in stopwords/ must be downloaded
|
Sync our Snowball stemmer dictionaries with current upstream.
We haven't touched these since text search functionality landed in core
in 2007 :-(. While the upstream project isn't a beehive of activity,
they do make additions and bug fixes from time to time. Update our
copies of these files.
Also update our documentation about how to keep things in sync, since
they're not making distribution tarballs these days. Fortunately,
their source code turns out to be a breeze to build.
Notable changes:
* The non-UTF8 version of the hungarian stemmer now works in LATIN2
not LATIN1.
* New stemmers have appeared for arabic, indonesian, irish, lithuanian,
nepali, and tamil. These all work in UTF8, and the indonesian and
irish ones also work in LATIN1.
(There are some new stemmers that I did not incorporate, mainly because
their names don't match the underlying languages, suggesting that they're
not to be considered mainstream.)
Worth noting: the upstream Nepali dictionary was contributed by
Arthur Zakirov.
initdb forced because the contents of snowball_create.sql have
changed.
Still TODO: see about updating the stopword lists.
Arthur Zakirov, minor mods and doc work by me
Discussion: https://postgr.es/m/20180626122025.GA12647@zakirov.localdomain
Discussion: https://postgr.es/m/20180219140849.GA9050@zakirov.localdomain
2018-09-24 23:29:08 +02:00
|
|
|
individually from pages on the snowballstem.org website.
|
2007-08-21 03:11:32 +02:00
|
|
|
Be careful that these files must be stored in UTF-8 encoding.
|