Commit Graph

100 Commits

Author SHA1 Message Date
Bruce Momjian 5d950e3b0c Stamp copyrights for year 2011. 2011-01-01 13:18:15 -05:00
Peter Eisentraut fc946c39ae Remove useless whitespace at end of lines 2010-11-23 22:34:55 +02:00
Robert Haas 5aa446c961 Cleanup various comparisons with the constant "true".
Itagaki Takahiro, with slight modifications.
2010-11-14 21:03:48 -05:00
Tom Lane 3e5f9412d0 Reduce the memory requirement for large ispell dictionaries.
This patch eliminates per-chunk palloc overhead for most small allocations
needed in the representation of an ispell dictionary.  This saves close to
a factor of 2 on the current Czech ispell data.  While it doesn't cover
every last small allocation in the ispell code, we are at the point of
diminishing returns, because about 95% of the allocations are covered
already.

Pavel Stehule, rather heavily revised by Tom
2010-10-06 19:31:05 -04:00
Tom Lane 9b910def24 Clean up temporary-memory management during ispell dictionary loading.
Add explicit initialization and cleanup functions to spell.c, and keep
all working state in the already-existing ISpellDict struct.  This lets us
get rid of a static variable along with some extremely shaky assumptions
about usage of child memory contexts.

This commit is just code beautification and has no impact on functionality
or performance, but it opens the way to a less-grotty implementation of
Pavel's memory-saving hack, which will follow shortly.
2010-10-06 15:15:15 -04:00
Magnus Hagander 9f2e211386 Remove cvs keywords from all files. 2010-09-20 22:08:53 +02:00
Peter Eisentraut 3f11971916 Remove extra newlines at end and beginning of files, add missing newlines
at end of files.
2010-08-19 05:57:36 +00:00
Robert Haas fd1843ff89 Standardize get_whatever_oid functions for other object types.
- Rename TSParserGetPrsid to get_ts_parser_oid.
- Rename TSDictionaryGetDictid to get_ts_dict_oid.
- Rename TSTemplateGetTmplid to get_ts_template_oid.
- Rename TSConfigGetCfgid to get_ts_config_oid.
- Rename FindConversionByName to get_conversion_oid.
- Rename GetConstraintName to get_constraint_oid.
- Add new functions get_opclass_oid, get_opfamily_oid, get_rewrite_oid,
  get_rewrite_oid_without_relid, get_trigger_oid, and get_cast_oid.

The name of each function matches the corresponding catalog.

Thanks to KaiGai Kohei for the review.
2010-08-05 15:25:36 +00:00
Tom Lane 97532f7c29 Add some knowledge about prefix matches to tsmatchsel(). It's not terribly
bright, but it beats assuming that a prefix match behaves identically to an
exact match, which is what the code was doing before :-(.  Noted while
experimenting with Artur Dobrowski's example.
2010-08-01 21:31:08 +00:00
Tom Lane b8c798ebc5 Tweak tsmatchsel() so that it examines the structure of the tsquery whenever
possible (ie, whenever the tsquery is a constant), even when no statistics
are available for the tsvector.  For example, foo @@ 'a & b'::tsquery
can be expected to be more selective than foo @@ 'a'::tsquery, whether
or not we know anything about foo.  We use DEFAULT_TS_MATCH_SEL as the assumed
selectivity of individual query terms when no stats are available, then
combine the terms according to the query's AND/OR structure as usual.

Per experimentation with Artur Dabrowski's example.  (The fact that there
are no stats available in that example is a problem in itself, but
nonetheless tsmatchsel should be smarter about the case.)

Back-patch to 8.4 to keep all versions of tsmatchsel() in sync.
2010-07-31 03:27:40 +00:00
Bruce Momjian 239d769e7e pgindent run for 9.0, second run 2010-07-06 19:19:02 +00:00
Tom Lane bc0f080928 Fix misuse of Lossy Counting (LC) algorithm in compute_tsvector_stats().
We must filter out hashtable entries with frequencies less than those
specified by the algorithm, else we risk emitting junk entries whose
actual frequency is much less than other lexemes that did not get
tabulated.  This is bad enough by itself, but even worse is that
tsquerysel() believes that the minimum frequency seen in pg_statistic is a
hard upper bound for lexemes not included, and was thus underestimating
the frequency of non-MCEs.

Also, set the threshold frequency to something with a little bit of theory
behind it, to wit assume that the input distribution is approximately
Zipfian.  This might need adjustment in future, but some preliminary
experiments suggest that it's not too unreasonable.

Back-patch to 8.4, where this code was introduced.

Jan Urbanski, with some editorialization by Tom
2010-05-30 21:59:02 +00:00
Tom Lane ed437e2b27 Adjust comments about avoiding use of printf's %.*s.
My initial impression that glibc was measuring the precision in characters
(which is what the Linux man page says it does) was incorrect.  It does take
the precision to be in bytes, but it also tries to truncate the string at a
character boundary.  The bottom line remains the same: it will mess up
if the string is not in the encoding it expects, so we need to avoid %.*s
anytime there's a significant risk of that.  Previous code changes are still
good, but adjust the comments to reflect this knowledge.  Per research by
Hernan Gonzalez.
2010-05-09 02:16:00 +00:00
Tom Lane 54cd4f0457 Work around a subtle portability problem in use of printf %s format.
Depending on which spec you read, field widths and precisions in %s may be
counted either in bytes or characters.  Our code was assuming bytes, which
is wrong at least for glibc's implementation, and in any case libc might
have a different idea of the prevailing encoding than we do.  Hence, for
portable results we must avoid using anything more complex than just "%s"
unless the string to be printed is known to be all-ASCII.

This patch fixes the cases I could find, including the psql formatting
failure reported by Hernan Gonzalez.  In HEAD only, I also added comments
to some places where it appears safe to continue using "%.*s".
2010-05-08 16:39:53 +00:00
Tom Lane 2c265adea3 Modify the built-in text search parser to handle URLs more nearly according
to RFC 3986.  In particular, these characters now terminate the path part
of a URL: '"', '<', '>', '\', '^', '`', '{', '|', '}'.  The previous behavior
was inconsistent and depended on whether a "?" was present in the path.
Per gripe from Donald Fraser and spec research by Kevin Grittner.

This is a pre-existing bug, but not back-patching since the risks of
breaking existing applications seem to outweigh the benefits.
2010-04-28 02:04:16 +00:00
Tom Lane 8f0ab2298f Add missing newlines in WPARSER_TRACE output. 2010-04-26 17:10:18 +00:00
Bruce Momjian 89b0095ebd Allow underscores in tsearch email addressses, per RFC 5322 and report
by Dan O'Hara.

Patch by Teodor Sigaev
2010-03-13 00:41:58 +00:00
Bruce Momjian 65e806cba1 pgindent run for 9.0 2010-02-26 02:01:40 +00:00
Tom Lane 40608e7f94 When estimating the selectivity of an inequality "column > constant" or
"column < constant", and the comparison value is in the first or last
histogram bin or outside the histogram entirely, try to fetch the actual
column min or max value using an index scan (if there is an index on the
column).  If successful, replace the lower or upper histogram bound with
that value before carrying on with the estimate.  This limits the
estimation error caused by moving min/max values when the comparison
value is close to the min or max.  Per a complaint from Josh Berkus.

It is tempting to consider using this mechanism for mergejoinscansel as well,
but that would inject index fetches into main-line join estimation not just
endpoint cases.  I'm refraining from that until we can get a better handle
on the costs of doing this type of lookup.
2010-01-04 02:44:40 +00:00
Bruce Momjian 0239800893 Update copyright for the year 2010. 2010-01-02 16:58:17 +00:00
Tom Lane 21d11e7ee2 Avoid unnecessary copying of source string when generating a cloned TParser.
For long source strings the copying results in O(N^2) behavior, and the
multiplier can be significant if wide-char conversion is involved.

Andres Freund, reviewed by Kevin Grittner.
2009-12-15 20:37:17 +00:00
Tom Lane 908854209b Avoid core dump on empty thesaurus dictionary.
Per report from Robert Gravsjö.
2009-11-30 16:38:31 +00:00
Peter Eisentraut 66363e8d6d Make text search parser accept underscores in XML attributes (bug #5075) 2009-11-15 13:57:01 +00:00
Peter Eisentraut f1c5247563 Simplify a few makefile rules since install-sh can now install multiple
files in one run.
2009-10-26 21:33:01 +00:00
Tom Lane dd6de24e69 Remove duplicate variable initializations identified by clang static checker.
One of these represents a nontrivial bug (a promptly-leaked palloc), so
backpatch.

Greg Stark
2009-08-30 16:53:31 +00:00
Peter Eisentraut 9d182ef002 Update of install-sh, mkinstalldirs, and associated configury
Update install-sh to that from Autoconf 2.63, plus our Darwin-specific
changes (which I simplified a bit).  install-sh is now able to install
multiple files in one run, so we could simplify our makefiles sometime.

install-sh also now has a -d option to create directories, so we don't need
mkinstalldirs anymore.

Use AC_PROG_MKDIR_P in configure.in, so we can use mkdir -p when available
instead of install-sh -d.  For consistency with the rest of the world,
the corresponding make variable has been renamed from $(mkinstalldirs) to
$(MKDIR_P).
2009-08-26 22:24:44 +00:00
Teodor Sigaev a88a48011c Introduce filtering dictionary support to tsearch. Propagate --nolocale option
to CREATE DATABASE command in pg_regress to allow correct checking of
locale-sensitive contrib modules.
2009-08-18 10:30:41 +00:00
Teodor Sigaev abd8c94ff9 Add prefix support for synonym dictionary 2009-08-14 14:53:20 +00:00
Peter Eisentraut de160e2c00 Make backend header files C++ safe
This alters various incidental uses of C++ key words to use other similar
identifiers, so that a C++ compiler won't choke outright.  You still
(probably) need extern "C" { }; around the inclusion of backend headers.

based on a patch by Kurt Harriman <harriman@acm.org>

Also add a script cpluspluscheck to check for C++ compatibility in the
future.  As of right now, this passes without error for me.
2009-07-16 06:33:46 +00:00
Bruce Momjian d747140279 8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef list
provided by Andrew.
2009-06-11 14:49:15 +00:00
Tom Lane a734979e0a Fix tsquerysel() to not fail on an empty TSQuery. Per report from
Tatsuo Ishii.
2009-06-03 18:42:13 +00:00
Teodor Sigaev e43bb5beb7 Some languages have symbols with zero display's width or/and vowels/signs which
are not an alphabetic character although they are not word-breakers too.
So, treat them as part of word.

Per off-list discussion with Dibyendra Hyoju <dibyendra@gmail.com> and
and Bal Krishna Bal <balkrishna7bal@gmail.com> about Nepali language and
Devanagari alphabet.
2009-03-11 16:03:40 +00:00
Teodor Sigaev 42831729f7 Prevent recursion during parse of email-like string with multiple '@'.
Patch by Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>
2009-03-10 17:32:14 +00:00
Teodor Sigaev 32032d42b5 Fix usage of char2wchar/wchar2char. Changes:
- pg_wchar and wchar_t could have different size, so char2wchar
  doesn't call pg_mb2wchar_with_len to prevent out-of-bound
  memory bug
- make char2wchar/wchar2char symmetric, now they should not be
  called with C-locale because mbstowcs/wcstombs oftenly doesn't
  work correct with C-locale.
- Text parser uses pg_mb2wchar_with_len directly in case of
  C-locale and multibyte encoding

Per bug report by Hiroshi Inoue <inoue@tpf.co.jp> and
following discussion.

Backpatch up to 8.2 when multybyte support was implemented in tsearch.
2009-03-02 15:10:09 +00:00
Teodor Sigaev b5b3134813 Fix incorrect dereferencing of char* to array's index.
Per Tommy Gildseth <tommy.gildseth@usit.uio.no> report
2009-01-29 16:22:10 +00:00
Teodor Sigaev 41d17e042b Fix URL generation in headline. Only tag lexeme will be replaced by space.
Per http://archives.postgresql.org/pgsql-bugs/2008-12/msg00013.php
2009-01-15 16:33:59 +00:00
Teodor Sigaev 8fd07a35ba Fix generation too long headline with ShortWords.
Per http://archives.postgresql.org/pgsql-hackers/2008-09/msg01088.php
2009-01-15 16:33:28 +00:00
Bruce Momjian 511db38ace Update copyright for 2009. 2009-01-01 17:24:05 +00:00
Tom Lane 301194f8ea Reduce the scaling factor for attstattarget to number-of-lexemes from 100
to 10, to compensate for the recent change in default statistics target.
The original number was pulled out of the air anyway :-(, but it was picked
in the context of the old default, so holding the default size of the
MCELEM array constant seems the best thing.  Per discussion.
2008-12-15 15:06:31 +00:00
Tom Lane 65e3ea7641 Increase the default value of default_statistics_target from 10 to 100,
and its maximum value from 1000 to 10000.  ALTER TABLE SET STATISTICS
similarly now allows a value up to 10000.  Per discussion.
2008-12-13 19:13:44 +00:00
Heikki Linnakangas a93b3b98cd Fix bug in the tsvector stats collection function, which caused a crash if
the sample contains just a one tsvector, containing only one lexeme.
2008-11-27 21:17:39 +00:00
Tom Lane 2b74d45c1b pg_do_encoding_conversion cannot return NULL (at least not unless the input
is NULL), so remove some useless tests for the case.
2008-11-10 15:18:40 +00:00
Teodor Sigaev 2a0083ede8 Improve headeline generation. Now headline can contain
several fragments a-la Google.

Sushant Sinha <sushant354@gmail.com>
2008-10-17 18:05:19 +00:00
Teodor Sigaev 906b7e5f6c Fix small bug in headline generation.
Patch from Sushant Sinha <sushant354@gmail.com>
http://archives.postgresql.org/pgsql-hackers/2008-07/msg00785.php
2008-10-17 17:27:46 +00:00
Tom Lane 4e57668da4 Create a selectivity estimation function for the text search @@ operator.
Jan Urbanski
2008-09-19 19:03:41 +00:00
Tom Lane 6f6d863258 Create a type-specific typanalyze routine for tsvector, which collects stats
on the most common individual lexemes in place of the mostly-useless default
behavior of counting duplicate tsvectors.  Future work: create selectivity
estimation functions that actually do something with these stats.

(Some other things we ought to look at doing: using the Lossy Counting
algorithm in compute_minimal_stats, and using the element-counting idea for
stats on regular arrays.)

Jan Urbanski
2008-07-14 00:51:46 +00:00
Tom Lane 30dc388a0d Fix a few places that were non-multibyte-safe in tsearch configuration file
parsing.  Per bug #4253 from Giorgio Valoti.
2008-06-19 16:52:24 +00:00
Tom Lane fbeb9da22b Improve error reporting for problems in text search configuration files
by installing an error context subroutine that will provide the file name
and line number for all errors detected while reading a config file.
Some of the reader routines were already doing that in an ad-hoc way for
errors detected directly in the reader, but it didn't help for problems
detected in subroutines, such as encoding violations.

Back-patch to 8.3 because 8.3 is where people will be trying to debug
configuration files.
2008-06-18 20:55:42 +00:00
Bruce Momjian 9de09c087d Move wchar2char() and char2wchar() from tsearch into /mb to be easier to
use for other modules;  also move pnstrdup().

Clean up code slightly.
2008-06-18 18:42:54 +00:00
Bruce Momjian dc69c0362f Move USE_WIDE_UPPER_LOWER define to c.h, and remove TS_USE_WIDE and use
USE_WIDE_UPPER_LOWER instead.
2008-06-17 16:09:06 +00:00