Commit Graph

57 Commits

Author SHA1 Message Date
Tom Lane e969f1ae2b In the Snowball dictionary, don't try to stem excessively-long words.
If the input word exceeds 1000 bytes, don't pass it to the stemmer;
just return it as-is after case folding.  Such an input is surely
not a word in any human language, so whatever the stemmer might
do to it would be pretty dubious in the first place.  Adding this
restriction protects us against a known recursion-to-stack-overflow
problem in the Turkish stemmer, and it seems like good insurance
against any other safety or performance issues that may exist in
the Snowball stemmers.  (I note, for example, that they contain no
CHECK_FOR_INTERRUPTS calls, so we really don't want them running
for a long time.)  The threshold of 1000 bytes is arbitrary.

An alternative definition could have been to treat such words as
stopwords, but that seems like a bigger break from the old behavior.

Per report from Egor Chindyaskin and Alexander Lakhin.
Thanks to Olly Betts for the recommendation to fix it this way.

Discussion: https://postgr.es/m/1661334672.728714027@f473.i.mail.ru
2022-08-31 10:42:05 -04:00
Peter Eisentraut 678d0e239b Update snowball
Update to snowball tag v2.1.0.  Major changes are new stemmers for
Armenian, Serbian, and Yiddish.
2021-02-19 08:10:15 +01:00
Bruce Momjian ca3b37487b Update copyright for 2021
Backpatch-through: 9.5
2021-01-02 13:06:25 -05:00
Andres Freund a9a4a7ad56 code: replace most remaining uses of 'master'.
Author: Andres Freund
Reviewed-By: David Steele
Discussion: https://postgr.es/m/20200615182235.x7lch5n6kcjq4aue@alap3.anarazel.de
2020-07-08 13:24:35 -07:00
Peter Eisentraut cbcc8726bb Update snowball
Update to snowball tag v2.0.0.  Major changes are new stemmers for
Basque, Catalan, and Hindi.

Discussion: https://www.postgresql.org/message-id/flat/a8eeabd6-2be1-43fe-401e-a97594c38478%402ndquadrant.com
2020-06-08 08:07:15 +02:00
Bruce Momjian 7559d8ebfa Update copyrights for 2020
Backpatch-through: update all files in master, backpatch legal files through 9.4
2020-01-01 12:21:45 -05:00
Andres Freund 01368e5d9d Split all OBJS style lines in makefiles into one-line-per-entry style.
When maintaining or merging patches, one of the most common sources
for conflicts are the list of objects in makefiles. Especially when
the split across lines has been changed on both sides, which is
somewhat common due to attempting to stay below 80 columns, those
conflicts are unnecessarily laborious to resolve.

By splitting, and alphabetically sorting, OBJS style lines into one
object per line, conflicts should be less frequent, and easier to
resolve when they still occur.

Author: Andres Freund
Discussion: https://postgr.es/m/20191029200901.vww4idgcxv74cwes@alap3.anarazel.de
2019-11-05 14:41:07 -08:00
Peter Eisentraut 7b925e1270 Sync our Snowball stemmer dictionaries with current upstream
The main change is a new stemmer for Greek.  There are minor changes
in the Danish and French stemmers.

Author: Panagiotis Mavrogiorgos <pmav99@gmail.com>
2019-07-04 13:26:48 +02:00
Peter Eisentraut dedb6e0143 Clean up whitespace a bit 2019-07-04 13:26:48 +02:00
Bruce Momjian 97c39498e5 Update copyright for 2019
Backpatch-through: certain files through 9.4
2019-01-02 12:44:25 -05:00
Tom Lane fd582317e1 Sync our Snowball stemmer dictionaries with current upstream.
We haven't touched these since text search functionality landed in core
in 2007 :-(.  While the upstream project isn't a beehive of activity,
they do make additions and bug fixes from time to time.  Update our
copies of these files.

Also update our documentation about how to keep things in sync, since
they're not making distribution tarballs these days.  Fortunately,
their source code turns out to be a breeze to build.

Notable changes:

* The non-UTF8 version of the hungarian stemmer now works in LATIN2
not LATIN1.

* New stemmers have appeared for arabic, indonesian, irish, lithuanian,
nepali, and tamil.  These all work in UTF8, and the indonesian and
irish ones also work in LATIN1.

(There are some new stemmers that I did not incorporate, mainly because
their names don't match the underlying languages, suggesting that they're
not to be considered mainstream.)

Worth noting: the upstream Nepali dictionary was contributed by
Arthur Zakirov.

initdb forced because the contents of snowball_create.sql have
changed.

Still TODO: see about updating the stopword lists.

Arthur Zakirov, minor mods and doc work by me

Discussion: https://postgr.es/m/20180626122025.GA12647@zakirov.localdomain
Discussion: https://postgr.es/m/20180219140849.GA9050@zakirov.localdomain
2018-09-24 17:29:38 -04:00
Tom Lane fb8697b31a Avoid unnecessary use of pg_strcasecmp for already-downcased identifiers.
We have a lot of code in which option names, which from the user's
viewpoint are logically keywords, are passed through the grammar as plain
identifiers, and then matched to string literals during command execution.
This approach avoids making words into lexer keywords unnecessarily.  Some
places matched these strings using plain strcmp, some using pg_strcasecmp.
But the latter should be unnecessary since identifiers would have been
downcased on their way through the parser.  Aside from any efficiency
concerns (probably not a big factor), the lack of consistency in this area
creates a hazard of subtle bugs due to different places coming to different
conclusions about whether two option names are the same or different.
Hence, standardize on using strcmp() to match any option names that are
expected to have been fed through the parser.

This does create a user-visible behavioral change, which is that while
formerly all of these would work:
	alter table foo set (fillfactor = 50);
	alter table foo set (FillFactor = 50);
	alter table foo set ("fillfactor" = 50);
	alter table foo set ("FillFactor" = 50);
now the last case will fail because that double-quoted identifier is
different from the others.  However, none of our documentation says that
you can use a quoted identifier in such contexts at all, and we should
discourage doing so since it would break if we ever decide to parse such
constructs as true lexer keywords rather than poor man's substitutes.
So this shouldn't create a significant compatibility issue for users.

Daniel Gustafsson, reviewed by Michael Paquier, small changes by me

Discussion: https://postgr.es/m/29405B24-564E-476B-98C0-677A29805B84@yesql.se
2018-01-26 18:25:14 -05:00
Bruce Momjian 9d4649ca49 Update copyright for 2018
Backpatch-through: certain files through 9.3
2018-01-02 23:30:12 -05:00
Peter Eisentraut 0e1539ba0d Add some const decorations to prototypes
Reviewed-by: Fabien COELHO <coelho@cri.ensmp.fr>
2017-11-10 13:38:57 -05:00
Tom Lane e3860ffa4d Initial pgindent run with pg_bsd_indent version 2.0.
The new indent version includes numerous fixes thanks to Piotr Stefaniak.
The main changes visible in this commit are:

* Nicer formatting of function-pointer declarations.
* No longer unexpectedly removes spaces in expressions using casts,
  sizeof, or offsetof.
* No longer wants to add a space in "struct structname *varname", as
  well as some similar cases for const- or volatile-qualified pointers.
* Declarations using PG_USED_FOR_ASSERTS_ONLY are formatted more nicely.
* Fixes bug where comments following declarations were sometimes placed
  with no space separating them from the code.
* Fixes some odd decisions for comments following case labels.
* Fixes some cases where comments following code were indented to less
  than the expected column 33.

On the less good side, it now tends to put more whitespace around typedef
names that are not listed in typedefs.list.  This might encourage us to
put more effort into typedef name collection; it's not really a bug in
indent itself.

There are more changes coming after this round, having to do with comment
indentation and alignment of lines appearing within parentheses.  I wanted
to limit the size of the diffs to something that could be reviewed without
one's eyes completely glazing over, so it seemed better to split up the
changes as much as practical.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 14:39:04 -04:00
Peter Eisentraut 94c2ed0ebe Add ICU_CFLAGS to global CPPFLAGS
The original code only added ICU_CFLAGS to the backend build.  But it is
also needed for building external modules that include pg_locale.h.  So
add it to the global CPPFLAGS.  (This is only relevant if ICU is not in
a compiler default path, so it apparently hasn't bitten many.)
2017-06-12 15:57:22 -04:00
Peter Eisentraut 2e0c17dc78 Add ICU_FLAGS to one more place
Reported-by: Thomas Munro <thomas.munro@enterprisedb.com>
2017-03-23 16:53:10 -04:00
Bruce Momjian 1d25779284 Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
Bruce Momjian ee94300446 Update copyright for 2016
Backpatch certain files through 9.1
2016-01-02 13:33:40 -05:00
Tom Lane 66d947b9d3 Adjust behavior of single-user -j mode for better initdb error reporting.
Previously, -j caused the entire input file to be read in and executed as
a single command string.  That's undesirable, not least because any error
causes the entire file to be regurgitated as the "failing query".  Some
experimentation suggests a better rule: end the command string when we see
a semicolon immediately followed by two newlines, ie, an empty line after
a query.  This serves nicely to break up the existing examples such as
information_schema.sql and system_views.sql.  A limitation is that it's
no longer possible to write such a sequence within a string literal or
multiline comment in a file meant to be read with -j; but there are no
instances of such a problem within the data currently used by initdb.
(If someone does make such a mistake in future, it'll be obvious because
they'll get an unterminated-literal or unterminated-comment syntax error.)
Other than that, there shouldn't be any negative consequences; you're not
forced to end statements that way, it's just a better idea in most cases.

In passing, remove src/include/tcop/tcopdebug.h, which is dead code
because it's not included anywhere, and hasn't been for more than
ten years.  One of the debug-support symbols it purported to describe
has been unreferenced for at least the same amount of time, and the
other is removed by this commit on the grounds that it was useless:
forcing -j mode all the time would have broken initdb.  The lack of
complaints about that, or about the missing inclusion, shows that
no one has tried to use TCOP_DONTUSENEWLINE in many years.
2015-12-17 19:34:15 -05:00
Tom Lane 91e79260f6 Remove no-longer-required function declarations.
Remove a bunch of "extern Datum foo(PG_FUNCTION_ARGS);" declarations that
are no longer needed now that PG_FUNCTION_INFO_V1(foo) provides that.

Some of these were evidently missed in commit e7128e8dbb, but others
were cargo-culted in in code added since then.  Possibly that can be blamed
in part on the fact that we'd not fixed relevant documentation examples,
which I've now done.
2015-05-24 12:20:23 -04:00
Bruce Momjian 4baaf863ec Update copyright for 2015
Backpatch certain files through 9.0
2015-01-06 11:43:47 -05:00
Noah Misch ee9569e4df Finish adding file version information to installed Windows binaries.
In support of this, have the MSVC build follow GNU make in preferring
GNUmakefile over Makefile when a directory contains both.

Michael Paquier, reviewed by MauMau.
2014-08-18 22:59:53 -04:00
Bruce Momjian 6a605cd6bd Adjust blank lines around PG_MODULE_MAGIC defines, for consistency
Report by Robert Haas
2014-07-10 14:02:08 -04:00
Tom Lane fd90b5d574 Fix ancient encoding error in hungarian.stop.
When we grabbed this file off the Snowball project's website, we mistakenly
supposed that it was in LATIN1 encoding, but evidently it was actually in
LATIN2.  This resulted in ő (o-double-acute, U+0151, which is code 0xF5 in
LATIN2) being misconverted into õ (o-tilde, U+00F5), as complained of in
bug #10589 from Zoltán Sörös.  We'd have messed up u-double-acute too,
but there aren't any of those in the file.  Other characters used in the
file have the same codes in LATIN1 and LATIN2, which no doubt helped hide
the problem for so long.

The error is not only ours: the Snowball project also was confused about
which encoding is required for Hungarian.  But dealing with that will
require source-code changes that I'm not at all sure we'll wish to
back-patch.  Fixing the stopword file seems reasonably safe to back-patch
however.
2014-06-10 22:48:16 -04:00
Tom Lane 769065c1b2 Prefer pg_any_to_server/pg_server_to_any over pg_do_encoding_conversion.
A large majority of the callers of pg_do_encoding_conversion were
specifying the database encoding as either source or target of the
conversion, meaning that we can use the less general functions
pg_any_to_server/pg_server_to_any instead.

The main advantage of using the latter functions is that they can make use
of a cached conversion-function lookup in the common case that the other
encoding is the current client_encoding.  It's notationally cleaner too in
most cases, not least because of the historical artifact that the latter
functions use "char *" rather than "unsigned char *" in their APIs.

Note that pg_any_to_server will apply an encoding verification step in
some cases where pg_do_encoding_conversion would have just done nothing.
This seems to me to be a good idea at most of these call sites, though
it partially negates the performance benefit.

Per discussion of bug #9210.
2014-02-23 16:59:05 -05:00
Bruce Momjian 7e04792a1c Update copyright for 2014
Update all files in head, and files COPYRIGHT and legal.sgml in all back
branches.
2014-01-07 16:05:30 -05:00
Bruce Momjian bd61a623ac Update copyrights for 2013
Fully update git head, and update back branches in ./COPYRIGHT and
legal.sgml files.
2013-01-01 17:15:01 -05:00
Bruce Momjian 381a9ed66d Remove configure flag --disable-shared, as it is no longer used by any
port.  The last use was QNX, per Peter Eisentraut.
2012-08-30 16:26:53 -04:00
Bruce Momjian e126958c2e Update copyright notices for year 2012. 2012-01-01 18:01:58 -05:00
Bruce Momjian 6416a82a62 Remove unnecessary #include references, per pgrminclude script. 2011-09-01 10:04:27 -04:00
Bruce Momjian 5d950e3b0c Stamp copyrights for year 2011. 2011-01-01 13:18:15 -05:00
Peter Eisentraut fc946c39ae Remove useless whitespace at end of lines 2010-11-23 22:34:55 +02:00
Magnus Hagander fe9b36fd59 Convert cvsignore to gitignore, and add .gitignore for build targets. 2010-09-22 12:57:04 +02:00
Magnus Hagander 9f2e211386 Remove cvs keywords from all files. 2010-09-20 22:08:53 +02:00
Peter Eisentraut 3f11971916 Remove extra newlines at end and beginning of files, add missing newlines
at end of files.
2010-08-19 05:57:36 +00:00
Bruce Momjian 0239800893 Update copyright for the year 2010. 2010-01-02 16:58:17 +00:00
Peter Eisentraut 234c7ce9f2 Derived files that are shipped in the distribution used to be built in the
source directory even for out-of-tree builds.  They are now alsl built in
the build tree.  This should be more convenient for certain developers'
workflows, and shouldn't really break anything else.
2009-08-28 20:26:19 +00:00
Peter Eisentraut 9d182ef002 Update of install-sh, mkinstalldirs, and associated configury
Update install-sh to that from Autoconf 2.63, plus our Darwin-specific
changes (which I simplified a bit).  install-sh is now able to install
multiple files in one run, so we could simplify our makefiles sometime.

install-sh also now has a -d option to create directories, so we don't need
mkinstalldirs anymore.

Use AC_PROG_MKDIR_P in configure.in, so we can use mkdir -p when available
instead of install-sh -d.  For consistency with the rest of the world,
the corresponding make variable has been renamed from $(mkinstalldirs) to
$(MKDIR_P).
2009-08-26 22:24:44 +00:00
Bruce Momjian 511db38ace Update copyright for 2009. 2009-01-01 17:24:05 +00:00
Tom Lane 2b74d45c1b pg_do_encoding_conversion cannot return NULL (at least not unless the input
is NULL), so remove some useless tests for the case.
2008-11-10 15:18:40 +00:00
Peter Eisentraut 46e76373ec Implement a few changes to how shared libraries and dynamically loadable
modules are built.  Foremost, it creates a solid distinction between these two
types of targets based on what had already been implemented and duplicated in
ad hoc ways before.  Specifically,

- Dynamically loadable modules no longer get a soname.  The numbers previously
set in the makefiles were dummy numbers anyway, and the presence of a soname
upset a few packaging tools, so it is nicer not to have one.

- The cumbersome detour taken on installation (build a libfoo.so.0.0.0 and
then override the rule to install foo.so instead) is removed.

- Lots of duplicated code simplified.
2008-04-07 14:15:58 +00:00
Bruce Momjian fca9fff41b More README src cleanups. 2008-03-21 13:23:29 +00:00
Bruce Momjian 4e228447aa Make source code READMEs more consistent. Add CVS tags to all README files. 2008-03-20 17:55:15 +00:00
Peter Eisentraut 8c87cc370f Catch all errors in for and while loops in makefiles. Don't ignore any
errors in any commands, including in various clean targets that have so far
been handled inconsistently.  make -i is available to ignore all errors in
a consistent and official way.
2008-03-18 16:24:50 +00:00
Bruce Momjian 9098ab9e32 Update copyrights in source tree to 2008. 2008-01-01 19:46:01 +00:00
Bruce Momjian f6e8730d11 Re-run pgindent with updated list of typedefs. (Updated README should
avoid this problem in the future.)
2007-11-15 22:25:18 +00:00
Bruce Momjian fdf5a5efb7 pgindent run for 8.3. 2007-11-15 21:14:46 +00:00
Tom Lane d015d08b43 Rename default text search parser's "uri" token type to "url_path",
per recommendation from Alvaro.  This doesn't force initdb since the
numeric token type in the catalogs doesn't change; but note that
the expected regression test output changed.
2007-10-27 16:01:09 +00:00
Tom Lane dbaec70c15 Rename and slightly redefine the default text search parser's "word"
categories, as per discussion.  asciiword (formerly lword) is still
ASCII-letters-only, and numword (formerly word) is still the most general
mixed-alpha-and-digits case.  But word (formerly nlword) is now
any-group-of-letters-with-at-least-one-non-ASCII, rather than all-non-ASCII as
before.  This is no worse than before for parsing mixed Russian/English text,
which seems to have been the design center for the original coding; and it
should simplify matters for parsing most European languages.  In particular
it will not be necessary for any language to accept strings containing digits
as being regular "words".  The hyphenated-word categories are adjusted
similarly.
2007-10-23 20:46:12 +00:00