postgresql

Commit Graph

Author	SHA1	Message	Date
Tom Lane	1e16a8107d	Teach regular expression operators to honor collations. This involves getting the character classification and case-folding functions in the regex library to use the collations infrastructure. Most of this work had been done already in connection with the upper/lower and LIKE logic, so it was a simple matter of transposition. While at it, split out these functions into a separate source file regc_pg_locale.c, so that they can be correctly labeled with the Postgres project's license rather than the Scriptics license. These functions are 100% Postgres-written code whereas what remains in regc_locale.c is still mostly not ours, so lumping them both under the same copyright notice was getting more and more misleading.	2011-04-10 18:03:09 -04:00
Bruce Momjian	bf50caf105	pgindent run before PG 9.1 beta 1.	2011-04-10 11:42:00 -04:00
Tom Lane	52b60530f2	Fix tsmatchsel() to account properly for null rows. ts_typanalyze.c computes MCE statistics as fractions of the non-null rows, which seems fairly reasonable, and anyway changing it in released versions wouldn't be a good idea. But then ts_selfuncs.c has to account for that. Failure to do so results in overestimates in columns with a significant fraction of null documents. Back-patch to 8.4 where this stuff was introduced. Jesper Krogh	2011-02-17 19:00:49 -05:00
Bruce Momjian	135724ec35	Fix "variable not used" warnings when USE_WIDE_UPPER_LOWER is not defined.	2011-02-10 16:58:02 -05:00
Peter Eisentraut	414c5a2ea6	Per-column collation support This adds collation support for columns and domains, a COLLATE clause to override it per expression, and B-tree index support. Peter Eisentraut reviewed by Pavel Stehule, Itagaki Takahiro, Robert Haas, Noah Misch	2011-02-08 23:04:18 +02:00
Bruce Momjian	97116ca417	Rename macro DECIMAL to DECIMAL_T to help pgindent; this is already done for a few other macros in that file, for other reasons. I also remove pgindent/README mention of the file.	2011-02-06 10:48:17 -05:00
Bruce Momjian	5d950e3b0c	Stamp copyrights for year 2011.	2011-01-01 13:18:15 -05:00
Peter Eisentraut	fc946c39ae	Remove useless whitespace at end of lines	2010-11-23 22:34:55 +02:00
Robert Haas	5aa446c961	Cleanup various comparisons with the constant "true". Itagaki Takahiro, with slight modifications.	2010-11-14 21:03:48 -05:00
Tom Lane	3e5f9412d0	Reduce the memory requirement for large ispell dictionaries. This patch eliminates per-chunk palloc overhead for most small allocations needed in the representation of an ispell dictionary. This saves close to a factor of 2 on the current Czech ispell data. While it doesn't cover every last small allocation in the ispell code, we are at the point of diminishing returns, because about 95% of the allocations are covered already. Pavel Stehule, rather heavily revised by Tom	2010-10-06 19:31:05 -04:00
Tom Lane	9b910def24	Clean up temporary-memory management during ispell dictionary loading. Add explicit initialization and cleanup functions to spell.c, and keep all working state in the already-existing ISpellDict struct. This lets us get rid of a static variable along with some extremely shaky assumptions about usage of child memory contexts. This commit is just code beautification and has no impact on functionality or performance, but it opens the way to a less-grotty implementation of Pavel's memory-saving hack, which will follow shortly.	2010-10-06 15:15:15 -04:00
Magnus Hagander	9f2e211386	Remove cvs keywords from all files.	2010-09-20 22:08:53 +02:00
Peter Eisentraut	3f11971916	Remove extra newlines at end and beginning of files, add missing newlines at end of files.	2010-08-19 05:57:36 +00:00
Robert Haas	fd1843ff89	Standardize get_whatever_oid functions for other object types. - Rename TSParserGetPrsid to get_ts_parser_oid. - Rename TSDictionaryGetDictid to get_ts_dict_oid. - Rename TSTemplateGetTmplid to get_ts_template_oid. - Rename TSConfigGetCfgid to get_ts_config_oid. - Rename FindConversionByName to get_conversion_oid. - Rename GetConstraintName to get_constraint_oid. - Add new functions get_opclass_oid, get_opfamily_oid, get_rewrite_oid, get_rewrite_oid_without_relid, get_trigger_oid, and get_cast_oid. The name of each function matches the corresponding catalog. Thanks to KaiGai Kohei for the review.	2010-08-05 15:25:36 +00:00
Tom Lane	97532f7c29	Add some knowledge about prefix matches to tsmatchsel(). It's not terribly bright, but it beats assuming that a prefix match behaves identically to an exact match, which is what the code was doing before :-(. Noted while experimenting with Artur Dobrowski's example.	2010-08-01 21:31:08 +00:00
Tom Lane	b8c798ebc5	Tweak tsmatchsel() so that it examines the structure of the tsquery whenever possible (ie, whenever the tsquery is a constant), even when no statistics are available for the tsvector. For example, foo @@ 'a & b'::tsquery can be expected to be more selective than foo @@ 'a'::tsquery, whether or not we know anything about foo. We use DEFAULT_TS_MATCH_SEL as the assumed selectivity of individual query terms when no stats are available, then combine the terms according to the query's AND/OR structure as usual. Per experimentation with Artur Dabrowski's example. (The fact that there are no stats available in that example is a problem in itself, but nonetheless tsmatchsel should be smarter about the case.) Back-patch to 8.4 to keep all versions of tsmatchsel() in sync.	2010-07-31 03:27:40 +00:00
Bruce Momjian	239d769e7e	pgindent run for 9.0, second run	2010-07-06 19:19:02 +00:00
Tom Lane	bc0f080928	Fix misuse of Lossy Counting (LC) algorithm in compute_tsvector_stats(). We must filter out hashtable entries with frequencies less than those specified by the algorithm, else we risk emitting junk entries whose actual frequency is much less than other lexemes that did not get tabulated. This is bad enough by itself, but even worse is that tsquerysel() believes that the minimum frequency seen in pg_statistic is a hard upper bound for lexemes not included, and was thus underestimating the frequency of non-MCEs. Also, set the threshold frequency to something with a little bit of theory behind it, to wit assume that the input distribution is approximately Zipfian. This might need adjustment in future, but some preliminary experiments suggest that it's not too unreasonable. Back-patch to 8.4, where this code was introduced. Jan Urbanski, with some editorialization by Tom	2010-05-30 21:59:02 +00:00
Tom Lane	ed437e2b27	Adjust comments about avoiding use of printf's %.s. My initial impression that glibc was measuring the precision in characters (which is what the Linux man page says it does) was incorrect. It does take the precision to be in bytes, but it also tries to truncate the string at a character boundary. The bottom line remains the same: it will mess up if the string is not in the encoding it expects, so we need to avoid %.s anytime there's a significant risk of that. Previous code changes are still good, but adjust the comments to reflect this knowledge. Per research by Hernan Gonzalez.	2010-05-09 02:16:00 +00:00
Tom Lane	54cd4f0457	Work around a subtle portability problem in use of printf %s format. Depending on which spec you read, field widths and precisions in %s may be counted either in bytes or characters. Our code was assuming bytes, which is wrong at least for glibc's implementation, and in any case libc might have a different idea of the prevailing encoding than we do. Hence, for portable results we must avoid using anything more complex than just "%s" unless the string to be printed is known to be all-ASCII. This patch fixes the cases I could find, including the psql formatting failure reported by Hernan Gonzalez. In HEAD only, I also added comments to some places where it appears safe to continue using "%.*s".	2010-05-08 16:39:53 +00:00
Tom Lane	2c265adea3	Modify the built-in text search parser to handle URLs more nearly according to RFC 3986. In particular, these characters now terminate the path part of a URL: '"', '<', '>', '\', '^', '`', '{', '\|', '}'. The previous behavior was inconsistent and depended on whether a "?" was present in the path. Per gripe from Donald Fraser and spec research by Kevin Grittner. This is a pre-existing bug, but not back-patching since the risks of breaking existing applications seem to outweigh the benefits.	2010-04-28 02:04:16 +00:00
Tom Lane	8f0ab2298f	Add missing newlines in WPARSER_TRACE output.	2010-04-26 17:10:18 +00:00
Bruce Momjian	89b0095ebd	Allow underscores in tsearch email addressses, per RFC 5322 and report by Dan O'Hara. Patch by Teodor Sigaev	2010-03-13 00:41:58 +00:00
Bruce Momjian	65e806cba1	pgindent run for 9.0	2010-02-26 02:01:40 +00:00
Tom Lane	40608e7f94	When estimating the selectivity of an inequality "column > constant" or "column < constant", and the comparison value is in the first or last histogram bin or outside the histogram entirely, try to fetch the actual column min or max value using an index scan (if there is an index on the column). If successful, replace the lower or upper histogram bound with that value before carrying on with the estimate. This limits the estimation error caused by moving min/max values when the comparison value is close to the min or max. Per a complaint from Josh Berkus. It is tempting to consider using this mechanism for mergejoinscansel as well, but that would inject index fetches into main-line join estimation not just endpoint cases. I'm refraining from that until we can get a better handle on the costs of doing this type of lookup.	2010-01-04 02:44:40 +00:00
Bruce Momjian	0239800893	Update copyright for the year 2010.	2010-01-02 16:58:17 +00:00
Tom Lane	21d11e7ee2	Avoid unnecessary copying of source string when generating a cloned TParser. For long source strings the copying results in O(N^2) behavior, and the multiplier can be significant if wide-char conversion is involved. Andres Freund, reviewed by Kevin Grittner.	2009-12-15 20:37:17 +00:00
Tom Lane	908854209b	Avoid core dump on empty thesaurus dictionary. Per report from Robert Gravsjö.	2009-11-30 16:38:31 +00:00
Peter Eisentraut	66363e8d6d	Make text search parser accept underscores in XML attributes (bug #5075 )	2009-11-15 13:57:01 +00:00
Peter Eisentraut	f1c5247563	Simplify a few makefile rules since install-sh can now install multiple files in one run.	2009-10-26 21:33:01 +00:00
Tom Lane	dd6de24e69	Remove duplicate variable initializations identified by clang static checker. One of these represents a nontrivial bug (a promptly-leaked palloc), so backpatch. Greg Stark	2009-08-30 16:53:31 +00:00
Peter Eisentraut	9d182ef002	Update of install-sh, mkinstalldirs, and associated configury Update install-sh to that from Autoconf 2.63, plus our Darwin-specific changes (which I simplified a bit). install-sh is now able to install multiple files in one run, so we could simplify our makefiles sometime. install-sh also now has a -d option to create directories, so we don't need mkinstalldirs anymore. Use AC_PROG_MKDIR_P in configure.in, so we can use mkdir -p when available instead of install-sh -d. For consistency with the rest of the world, the corresponding make variable has been renamed from $(mkinstalldirs) to $(MKDIR_P).	2009-08-26 22:24:44 +00:00
Teodor Sigaev	a88a48011c	Introduce filtering dictionary support to tsearch. Propagate --nolocale option to CREATE DATABASE command in pg_regress to allow correct checking of locale-sensitive contrib modules.	2009-08-18 10:30:41 +00:00
Teodor Sigaev	abd8c94ff9	Add prefix support for synonym dictionary	2009-08-14 14:53:20 +00:00
Peter Eisentraut	de160e2c00	Make backend header files C++ safe This alters various incidental uses of C++ key words to use other similar identifiers, so that a C++ compiler won't choke outright. You still (probably) need extern "C" { }; around the inclusion of backend headers. based on a patch by Kurt Harriman <harriman@acm.org> Also add a script cpluspluscheck to check for C++ compatibility in the future. As of right now, this passes without error for me.	2009-07-16 06:33:46 +00:00
Bruce Momjian	d747140279	8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef list provided by Andrew.	2009-06-11 14:49:15 +00:00
Tom Lane	a734979e0a	Fix tsquerysel() to not fail on an empty TSQuery. Per report from Tatsuo Ishii.	2009-06-03 18:42:13 +00:00
Teodor Sigaev	e43bb5beb7	Some languages have symbols with zero display's width or/and vowels/signs which are not an alphabetic character although they are not word-breakers too. So, treat them as part of word. Per off-list discussion with Dibyendra Hyoju <dibyendra@gmail.com> and and Bal Krishna Bal <balkrishna7bal@gmail.com> about Nepali language and Devanagari alphabet.	2009-03-11 16:03:40 +00:00
Teodor Sigaev	42831729f7	Prevent recursion during parse of email-like string with multiple '@'. Patch by Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>	2009-03-10 17:32:14 +00:00
Teodor Sigaev	32032d42b5	Fix usage of char2wchar/wchar2char. Changes: - pg_wchar and wchar_t could have different size, so char2wchar doesn't call pg_mb2wchar_with_len to prevent out-of-bound memory bug - make char2wchar/wchar2char symmetric, now they should not be called with C-locale because mbstowcs/wcstombs oftenly doesn't work correct with C-locale. - Text parser uses pg_mb2wchar_with_len directly in case of C-locale and multibyte encoding Per bug report by Hiroshi Inoue <inoue@tpf.co.jp> and following discussion. Backpatch up to 8.2 when multybyte support was implemented in tsearch.	2009-03-02 15:10:09 +00:00
Teodor Sigaev	b5b3134813	Fix incorrect dereferencing of char* to array's index. Per Tommy Gildseth <tommy.gildseth@usit.uio.no> report	2009-01-29 16:22:10 +00:00
Teodor Sigaev	41d17e042b	Fix URL generation in headline. Only tag lexeme will be replaced by space. Per http://archives.postgresql.org/pgsql-bugs/2008-12/msg00013.php	2009-01-15 16:33:59 +00:00
Teodor Sigaev	8fd07a35ba	Fix generation too long headline with ShortWords. Per http://archives.postgresql.org/pgsql-hackers/2008-09/msg01088.php	2009-01-15 16:33:28 +00:00
Bruce Momjian	511db38ace	Update copyright for 2009.	2009-01-01 17:24:05 +00:00
Tom Lane	301194f8ea	Reduce the scaling factor for attstattarget to number-of-lexemes from 100 to 10, to compensate for the recent change in default statistics target. The original number was pulled out of the air anyway :-(, but it was picked in the context of the old default, so holding the default size of the MCELEM array constant seems the best thing. Per discussion.	2008-12-15 15:06:31 +00:00
Tom Lane	65e3ea7641	Increase the default value of default_statistics_target from 10 to 100, and its maximum value from 1000 to 10000. ALTER TABLE SET STATISTICS similarly now allows a value up to 10000. Per discussion.	2008-12-13 19:13:44 +00:00
Heikki Linnakangas	a93b3b98cd	Fix bug in the tsvector stats collection function, which caused a crash if the sample contains just a one tsvector, containing only one lexeme.	2008-11-27 21:17:39 +00:00
Tom Lane	2b74d45c1b	pg_do_encoding_conversion cannot return NULL (at least not unless the input is NULL), so remove some useless tests for the case.	2008-11-10 15:18:40 +00:00
Teodor Sigaev	2a0083ede8	Improve headeline generation. Now headline can contain several fragments a-la Google. Sushant Sinha <sushant354@gmail.com>	2008-10-17 18:05:19 +00:00
Teodor Sigaev	906b7e5f6c	Fix small bug in headline generation. Patch from Sushant Sinha <sushant354@gmail.com> http://archives.postgresql.org/pgsql-hackers/2008-07/msg00785.php	2008-10-17 17:27:46 +00:00
Tom Lane	4e57668da4	Create a selectivity estimation function for the text search @@ operator. Jan Urbanski	2008-09-19 19:03:41 +00:00
Tom Lane	6f6d863258	Create a type-specific typanalyze routine for tsvector, which collects stats on the most common individual lexemes in place of the mostly-useless default behavior of counting duplicate tsvectors. Future work: create selectivity estimation functions that actually do something with these stats. (Some other things we ought to look at doing: using the Lossy Counting algorithm in compute_minimal_stats, and using the element-counting idea for stats on regular arrays.) Jan Urbanski	2008-07-14 00:51:46 +00:00
Tom Lane	30dc388a0d	Fix a few places that were non-multibyte-safe in tsearch configuration file parsing. Per bug #4253 from Giorgio Valoti.	2008-06-19 16:52:24 +00:00
Tom Lane	fbeb9da22b	Improve error reporting for problems in text search configuration files by installing an error context subroutine that will provide the file name and line number for all errors detected while reading a config file. Some of the reader routines were already doing that in an ad-hoc way for errors detected directly in the reader, but it didn't help for problems detected in subroutines, such as encoding violations. Back-patch to 8.3 because 8.3 is where people will be trying to debug configuration files.	2008-06-18 20:55:42 +00:00
Bruce Momjian	9de09c087d	Move wchar2char() and char2wchar() from tsearch into /mb to be easier to use for other modules; also move pnstrdup(). Clean up code slightly.	2008-06-18 18:42:54 +00:00
Bruce Momjian	dc69c0362f	Move USE_WIDE_UPPER_LOWER define to c.h, and remove TS_USE_WIDE and use USE_WIDE_UPPER_LOWER instead.	2008-06-17 16:09:06 +00:00
Tom Lane	e6dbcb72fa	Extend GIN to support partial-match searches, and extend tsquery to support prefix matching using this facility. Teodor Sigaev and Oleg Bartunov	2008-05-16 16:31:02 +00:00
Alvaro Herrera	f8c4d7db60	Restructure some header files a bit, in particular heapam.h, by removing some unnecessary #include lines in it. Also, move some tuple routine prototypes and macros to htup.h, which allows removal of heapam.h inclusion from some .c files. For this to work, a new header file access/sysattr.h needed to be created, initially containing attribute numbers of system columns, for pg_dump usage. While at it, make contrib ltree, intarray and hstore header files more consistent with our header style.	2008-05-12 00:00:54 +00:00
Tom Lane	220db7ccd8	Simplify and standardize conversions between TEXT datums and ordinary C strings. This patch introduces four support functions cstring_to_text, cstring_to_text_with_len, text_to_cstring, and text_to_cstring_buffer, and two macros CStringGetTextDatum and TextDatumGetCString. A number of existing macros that provided variants on these themes were removed. Most of the places that need to make such conversions now require just one function or macro call, in place of the multiple notational layers that used to be needed. There are no longer any direct calls of textout or textin, and we got most of the places that were using handmade conversions via memcpy (there may be a few still lurking, though). This commit doesn't make any serious effort to eliminate transient memory leaks caused by detoasting toasted text objects before they reach text_to_cstring. We changed PG_GETARG_TEXT_P to PG_GETARG_TEXT_PP in a few places where it was easy, but much more could be done. Brendan Jurd and Tom Lane	2008-03-25 22:42:46 +00:00
Tom Lane	7953fdcd9e	Add a CaseSensitive parameter to synonym dictionaries. Simon Riggs	2008-03-10 03:01:28 +00:00
Teodor Sigaev	3b8bca335d	Fix memory arrangement of tsquery after removing stop words. It causes a unused memory holes in tsquery. Per report by Richard Huxton <dev@archonet.com>. It was working well because in fact tsquery->size is not used for any kind of operation except comparing tsqueries. So, in HEAD it's enough to fix to_tsquery function, but for previous version it's needed to remove optimization in CompareTSQ to prevent requirement of renew all stored tsquery.	2008-03-07 14:30:20 +00:00
Bruce Momjian	910bc51862	When text search string is too long, in error message report actual and maximum number of bytes allowed.	2008-03-05 15:50:37 +00:00
Peter Eisentraut	0474dcb608	Refactor backend makefiles to remove lots of duplicate code	2008-02-19 10:30:09 +00:00
Peter Eisentraut	a345dcd2f7	Observe errors in makefile	2008-02-18 16:04:32 +00:00
Tom Lane	716e8b8374	Fix RS_isRegis() to agree exactly with RS_compile()'s idea of what's a valid regis. Correct the latter's oversight that a bracket-expression needs to be terminated. Reduce the ereports to elogs, since they are now not expected to ever be hit (thus addressing Alvaro's original complaint). In passing, const-ify the string argument to RS_compile.	2008-01-21 02:46:11 +00:00
Teodor Sigaev	cd42dd5a17	Fix core dump with buffer-overrun by too long infinitive. Add checking of using fixed length arrays to prevent array's overrun. Per report by Hannes Dorbath <light@theendofthetunnel.de> and comments by Tom.	2008-01-16 13:01:03 +00:00
Tom Lane	deb7deda26	Tweak new error message to conform to style guidelines.	2008-01-15 18:22:47 +00:00
Teodor Sigaev	f7807f1de8	Add check of headline method presence. Per report by Yoshiyuki Asaba <y-asaba@sraoss.co.jp>	2008-01-15 17:16:01 +00:00
Bruce Momjian	9098ab9e32	Update copyrights in source tree to 2008.	2008-01-01 19:46:01 +00:00
Peter Eisentraut	f5f1355dc4	Wording improvements	2007-12-27 13:02:48 +00:00
Tom Lane	bb0e3011f8	Make a cleanup pass over error reports in tsearch code. Use ereport for user-facing errors, fix some poor choices of errcode, adhere to message style guide.	2007-11-28 21:56:30 +00:00
Peter Eisentraut	a238bd146d	Proper capitalization of Ispell	2007-11-28 15:42:46 +00:00
Peter Eisentraut	2609345c85	Improve terminology	2007-11-28 13:30:36 +00:00
Bruce Momjian	43e082fc98	Change a stop word on the right-hand-side in the thesaurus file to be an ERROR, not NOTICE.	2007-11-28 04:24:38 +00:00
Andrew Dunstan	5575826b70	Allow for X as well as x to be the prefix for hexadecimal character ref entity numbers, as in HTML.	2007-11-25 19:35:41 +00:00
Andrew Dunstan	3de1f0daac	Fix XML tag namespace change inadvertantly missed from previous fix. Add regression test for XML names and numeric entities.	2007-11-25 15:37:11 +00:00
Tom Lane	ae3ff7adf7	Fix (I think) broken usage of MultiByteToWideChar. I had missed the subtlety that this function only returns a null terminator if it's fed input that includes one; which, in the usage here, it's not. This probably fixes bugs reported by Thomas Haegi.	2007-11-24 21:20:07 +00:00
Andrew Dunstan	1157f3cc81	Change descriptions of entity and tag objects to "XML entity" and "XML tag". Allow tag and entity names that follow XML rules. Provide for hexadecimal as well as decimal numeric entities. Adjust code names to coincide with new descriptions.	2007-11-20 02:25:22 +00:00
Bruce Momjian	f6e8730d11	Re-run pgindent with updated list of typedefs. (Updated README should avoid this problem in the future.)	2007-11-15 22:25:18 +00:00
Bruce Momjian	fdf5a5efb7	pgindent run for 8.3.	2007-11-15 21:14:46 +00:00
Tom Lane	ca450a07ee	Add an Accept parameter to "simple" dictionaries. The default of true gives the old behavior; selecting false allows the dictionary to be used as a filter ahead of other dictionaries, because it will pass on rather than accept words that aren't in its stopword list. Jan Urbanski	2007-11-14 18:36:37 +00:00
Bruce Momjian	d009992ba3	Have text search thesaurus files use "?" for stop words. Throw an error for actual stop words, rather than a warning. This fixes problems with cache reloading causing warning messages. Re-enable stop words in regression tests; was disabled by Tom. Document "?" as API change.	2007-11-10 15:39:34 +00:00
Tom Lane	654dcfb9e4	Clean up ts_locale.h/.c. Fix broken and not-consistent-across-platforms behavior of wchar2char/char2wchar; this should resolve bug #3730. Avoid excess computations of pg_mblen in t_isalpha and friends. Const-ify APIs where possible.	2007-11-09 22:37:35 +00:00
Bruce Momjian	3991c3fb2b	In tsearch code, remove !(A && B) via restructuring, for clarity	2007-11-09 01:32:22 +00:00
Tom Lane	73e6f9d3b6	Change text search parsing rules for hyphenated words so that digit strings containing decimal points aren't considered part of a hyphenated word. Sync the hyphenated-word lookahead states with the subsequent part-by-part reparsing states so that we don't get different answers about how much text is part of the hyphenated word. Per my gripe of a few days ago.	2007-10-27 19:03:45 +00:00
Tom Lane	1aaf39bd20	Add some rudimentary tracing code to the default text search parser, to help in debugging its state-machine rules. Const-ify all the constant tables. Minor other code cleanup, including using "token" rather than "lexeme" to describe the output strings.	2007-10-27 17:53:15 +00:00
Tom Lane	d015d08b43	Rename default text search parser's "uri" token type to "url_path", per recommendation from Alvaro. This doesn't force initdb since the numeric token type in the catalogs doesn't change; but note that the expected regression test output changed.	2007-10-27 16:01:09 +00:00
Tom Lane	dbaec70c15	Rename and slightly redefine the default text search parser's "word" categories, as per discussion. asciiword (formerly lword) is still ASCII-letters-only, and numword (formerly word) is still the most general mixed-alpha-and-digits case. But word (formerly nlword) is now any-group-of-letters-with-at-least-one-non-ASCII, rather than all-non-ASCII as before. This is no worse than before for parsing mixed Russian/English text, which seems to have been the design center for the original coding; and it should simplify matters for parsing most European languages. In particular it will not be necessary for any language to accept strings containing digits as being regular "words". The hyphenated-word categories are adjusted similarly.	2007-10-23 20:46:12 +00:00
Tom Lane	bb36c51fcd	Fix several bugs in tsvectorin, including crash due to uninitialized field and miscomputation of required palloc size. The crash could only occur if the input contained lexemes both with and without positions, which is probably not common in practice. The miscomputation would definitely result in wasted space. Also fix some inconsistent coding around alignment of strings and positions in a tsvector value; these errors could also lead to crashes given mixed with/without position data and a machine that's picky about alignment. And be more careful about checking for overflow of string offsets. Patch is only against HEAD --- I have not looked to see if same bugs are in back-branch contrib/tsearch2 code.	2007-10-23 00:51:23 +00:00
Tom Lane	638bd34f89	Found another small glitch in tsearch API: the two versions of ts_lexize() are really redundant, since we invented a regdictionary alias type. We can have just one function, declared as taking regdictionary, and it will handle both behaviors. Noted while working on documentation.	2007-10-19 22:01:45 +00:00
Teodor Sigaev	689df1bc77	Fix crash of to_tsvector() function on huge input: compareWORD() function didn't return correct result for word position greate than limit. Per report from Stuart Bishop <stuart@stuartbishop.net>	2007-09-26 10:09:57 +00:00
Tom Lane	33b9c8bd68	Temporarily modify tsearch regression tests to suppress notice that comes out at erratic times, because it is creating a totally unacceptable level of noise in our buildfarm results. This patch can be reverted when and if the code is fixed to not issue notices during cache reload events.	2007-09-23 15:58:58 +00:00
Teodor Sigaev	8544110042	Avoid possibly-unportable initializer, per buildfarm warning per notice by Gregory Stark <stark@enterprisedb.com>	2007-09-18 15:03:23 +00:00
Teodor Sigaev	13553cbbff	Fix header's size of structs defines in ispell. Backpatch is needed for contrib version.	2007-09-11 12:57:05 +00:00
Teodor Sigaev	64def09592	Add regression tests for ispell, synonym and thesaurus dictionaries. Rename synonym.syn.sample and thesaurs.ths.sample to synonym_sample.syn and thesaurs_sample.ths accordingly to be able to use they in regression test. Ispell dictionary uses synthetic simple dictionary files.	2007-09-11 11:54:42 +00:00
Teodor Sigaev	53ef36cb4a	Fix recently introduced bugs about parsing ispell/hunspell files. In most cases it cause because of unneeded lowercasing of flags. Per experiment with regression checks with ispell dictionary.	2007-09-10 20:27:12 +00:00
Teodor Sigaev	d982daae0b	Change void* opaque argument to Datum type, add argument's name to PushFunction type definition. Per suggestion by Tome Lane <tgl@sss.pgh.pa.us>	2007-09-10 12:36:41 +00:00
Teodor Sigaev	83d0b9f3ca	Fixes from Heikki Linnakangas <heikki@enterprisedb.com>: Apparently it's a bug I introduced when I refactored spell.c to use the readline function for reading and recoding the input file. I didn't notice that some calls to STRNCMP used the non-lowercased version of the input line.	2007-09-10 10:39:56 +00:00
Teodor Sigaev	e5be89981f	Refactoring by Heikki Linnakangas <heikki@enterprisedb.com> with small editorization by me - Brake the QueryItem struct into QueryOperator and QueryOperand. Type was really the only common field between them. QueryItem still exists, and is used in the TSQuery struct as before, but it's now a union of the two. Many other changes fell from that, like separation of pushval_asis function into pushValue, pushOperator and pushStop. - Moved some structs that were for internal use only from header files to the right .c-files. - Moved tsvector parser to a new tsvector_parser.c file. Parser code was about half of the size of tsvector.c, it's also used from tsquery.c, and it has some data structures of its own, so it seems better to separate it. Cleaned up the API so that TSVectorParserState is not accessed from outside tsvector_parser.c. - Separated enumerations (#defines, really) used for QueryItem.type field and as return codes from gettoken_query. It was just accidental code sharing. - Removed ParseQueryNode struct used internally by makepol and friends. push*-functions now construct QueryItems directly. - Changed int4 variables to just ints for variables like "i" or "array size", where the storage-size was not significant.	2007-09-07 15:09:56 +00:00
Tom Lane	6d871a2538	Restrict tsearch config file base names to contain a-z, 0-9, and underscore, instead of the initial policy of whatever isalpha() likes. Per discussion.	2007-09-04 02:16:56 +00:00
Tom Lane	a13cefafb1	Fix synonym-dict breakage introduced in last patch :-(. Minor other cleanups.	2007-08-25 02:29:45 +00:00
Tom Lane	7351b5fa17	Cleanup for some problems in tsearch patch: - ispell initialization crashed on empty dictionary file - ispell initialization crashed on affix file with prefixes but no suffixes - stop words file was run through pg_verify_mbstr, with database encoding, but it's supposed to be UTF-8; similar bug for synonym files - bunch of comments added, typos fixed, and other cleanup Introduced consistent encoding checking/conversion of data read from tsearch configuration files, by doing this in a single t_readline() subroutine (replacing direct usages of fgets). Cleaned up API for readstopwords too. Heikki Linnakangas	2007-08-25 00:03:59 +00:00
Tom Lane	f4ccdb3a17	Fix VPATH-build problem in new tsearch makefile, per Chad Wagner.	2007-08-22 06:11:56 +00:00
Tom Lane	b77c6c7311	Whoops, missed updating dsynonym_init for new dictionary parameter method.	2007-08-22 04:13:15 +00:00
Tom Lane	d321421d0a	Simplify the syntax of CREATE/ALTER TEXT SEARCH DICTIONARY by treating the init options of the template as top-level options in the syntax. This also makes ALTER a bit easier to use, since options can be replaced individually. I also made these statements verify that the tmplinit method will accept the new settings before they get stored; in the original coding you didn't find out about mistakes until the dictionary got invoked. Under the hood, init methods now get options as a List of DefElem instead of a raw text string --- that lets tsearch use existing options-pushing code instead of duplicating functionality.	2007-08-22 01:39:46 +00:00
Tom Lane	140d4ebcb4	Tsearch2 functionality migrates to core. The bulk of this work is by Oleg Bartunov and Teodor Sigaev, but I did a lot of editorializing, so anything that's broken is probably my fault. Documentation is nonexistent as yet, but let's land the patch so we can get some portability testing done.	2007-08-21 01:11:32 +00:00

1 2 3 4 5

206 Commits