Depending on which spec you read, field widths and precisions in %s may be
counted either in bytes or characters. Our code was assuming bytes, which
is wrong at least for glibc's implementation, and in any case libc might
have a different idea of the prevailing encoding than we do. Hence, for
portable results we must avoid using anything more complex than just "%s"
unless the string to be printed is known to be all-ASCII.
This patch fixes the cases I could find, including the psql formatting
failure reported by Hernan Gonzalez. In HEAD only, I also added comments
to some places where it appears safe to continue using "%.*s".
to RFC 3986. In particular, these characters now terminate the path part
of a URL: '"', '<', '>', '\', '^', '`', '{', '|', '}'. The previous behavior
was inconsistent and depended on whether a "?" was present in the path.
Per gripe from Donald Fraser and spec research by Kevin Grittner.
This is a pre-existing bug, but not back-patching since the risks of
breaking existing applications seem to outweigh the benefits.
"column < constant", and the comparison value is in the first or last
histogram bin or outside the histogram entirely, try to fetch the actual
column min or max value using an index scan (if there is an index on the
column). If successful, replace the lower or upper histogram bound with
that value before carrying on with the estimate. This limits the
estimation error caused by moving min/max values when the comparison
value is close to the min or max. Per a complaint from Josh Berkus.
It is tempting to consider using this mechanism for mergejoinscansel as well,
but that would inject index fetches into main-line join estimation not just
endpoint cases. I'm refraining from that until we can get a better handle
on the costs of doing this type of lookup.
For long source strings the copying results in O(N^2) behavior, and the
multiplier can be significant if wide-char conversion is involved.
Andres Freund, reviewed by Kevin Grittner.
Update install-sh to that from Autoconf 2.63, plus our Darwin-specific
changes (which I simplified a bit). install-sh is now able to install
multiple files in one run, so we could simplify our makefiles sometime.
install-sh also now has a -d option to create directories, so we don't need
mkinstalldirs anymore.
Use AC_PROG_MKDIR_P in configure.in, so we can use mkdir -p when available
instead of install-sh -d. For consistency with the rest of the world,
the corresponding make variable has been renamed from $(mkinstalldirs) to
$(MKDIR_P).
This alters various incidental uses of C++ key words to use other similar
identifiers, so that a C++ compiler won't choke outright. You still
(probably) need extern "C" { }; around the inclusion of backend headers.
based on a patch by Kurt Harriman <harriman@acm.org>
Also add a script cpluspluscheck to check for C++ compatibility in the
future. As of right now, this passes without error for me.
are not an alphabetic character although they are not word-breakers too.
So, treat them as part of word.
Per off-list discussion with Dibyendra Hyoju <dibyendra@gmail.com> and
and Bal Krishna Bal <balkrishna7bal@gmail.com> about Nepali language and
Devanagari alphabet.
- pg_wchar and wchar_t could have different size, so char2wchar
doesn't call pg_mb2wchar_with_len to prevent out-of-bound
memory bug
- make char2wchar/wchar2char symmetric, now they should not be
called with C-locale because mbstowcs/wcstombs oftenly doesn't
work correct with C-locale.
- Text parser uses pg_mb2wchar_with_len directly in case of
C-locale and multibyte encoding
Per bug report by Hiroshi Inoue <inoue@tpf.co.jp> and
following discussion.
Backpatch up to 8.2 when multybyte support was implemented in tsearch.
to 10, to compensate for the recent change in default statistics target.
The original number was pulled out of the air anyway :-(, but it was picked
in the context of the old default, so holding the default size of the
MCELEM array constant seems the best thing. Per discussion.
on the most common individual lexemes in place of the mostly-useless default
behavior of counting duplicate tsvectors. Future work: create selectivity
estimation functions that actually do something with these stats.
(Some other things we ought to look at doing: using the Lossy Counting
algorithm in compute_minimal_stats, and using the element-counting idea for
stats on regular arrays.)
Jan Urbanski
by installing an error context subroutine that will provide the file name
and line number for all errors detected while reading a config file.
Some of the reader routines were already doing that in an ad-hoc way for
errors detected directly in the reader, but it didn't help for problems
detected in subroutines, such as encoding violations.
Back-patch to 8.3 because 8.3 is where people will be trying to debug
configuration files.
unnecessary #include lines in it. Also, move some tuple routine prototypes and
macros to htup.h, which allows removal of heapam.h inclusion from some .c
files.
For this to work, a new header file access/sysattr.h needed to be created,
initially containing attribute numbers of system columns, for pg_dump usage.
While at it, make contrib ltree, intarray and hstore header files more
consistent with our header style.
strings. This patch introduces four support functions cstring_to_text,
cstring_to_text_with_len, text_to_cstring, and text_to_cstring_buffer, and
two macros CStringGetTextDatum and TextDatumGetCString. A number of
existing macros that provided variants on these themes were removed.
Most of the places that need to make such conversions now require just one
function or macro call, in place of the multiple notational layers that used
to be needed. There are no longer any direct calls of textout or textin,
and we got most of the places that were using handmade conversions via
memcpy (there may be a few still lurking, though).
This commit doesn't make any serious effort to eliminate transient memory
leaks caused by detoasting toasted text objects before they reach
text_to_cstring. We changed PG_GETARG_TEXT_P to PG_GETARG_TEXT_PP in a few
places where it was easy, but much more could be done.
Brendan Jurd and Tom Lane
a unused memory holes in tsquery.
Per report by Richard Huxton <dev@archonet.com>.
It was working well because in fact tsquery->size is not used for any
kind of operation except comparing tsqueries. So, in HEAD it's enough to
fix to_tsquery function, but for previous version it's needed to
remove optimization in CompareTSQ to prevent requirement of renew all
stored tsquery.
regis. Correct the latter's oversight that a bracket-expression needs to be
terminated. Reduce the ereports to elogs, since they are now not expected to
ever be hit (thus addressing Alvaro's original complaint).
In passing, const-ify the string argument to RS_compile.
subtlety that this function only returns a null terminator if it's
fed input that includes one; which, in the usage here, it's not.
This probably fixes bugs reported by Thomas Haegi.
Allow tag and entity names that follow XML rules. Provide for hexadecimal
as well as decimal numeric entities. Adjust code names to coincide with
new descriptions.
gives the old behavior; selecting false allows the dictionary to be used
as a filter ahead of other dictionaries, because it will pass on rather
than accept words that aren't in its stopword list.
Jan Urbanski
Throw an error for actual stop words, rather than a warning. This fixes
problems with cache reloading causing warning messages.
Re-enable stop words in regression tests; was disabled by Tom.
Document "?" as API change.
behavior of wchar2char/char2wchar; this should resolve bug #3730. Avoid
excess computations of pg_mblen in t_isalpha and friends. Const-ify
APIs where possible.
containing decimal points aren't considered part of a hyphenated word.
Sync the hyphenated-word lookahead states with the subsequent part-by-part
reparsing states so that we don't get different answers about how much text
is part of the hyphenated word. Per my gripe of a few days ago.
in debugging its state-machine rules. Const-ify all the constant tables.
Minor other code cleanup, including using "token" rather than "lexeme" to
describe the output strings.
per recommendation from Alvaro. This doesn't force initdb since the
numeric token type in the catalogs doesn't change; but note that
the expected regression test output changed.
categories, as per discussion. asciiword (formerly lword) is still
ASCII-letters-only, and numword (formerly word) is still the most general
mixed-alpha-and-digits case. But word (formerly nlword) is now
any-group-of-letters-with-at-least-one-non-ASCII, rather than all-non-ASCII as
before. This is no worse than before for parsing mixed Russian/English text,
which seems to have been the design center for the original coding; and it
should simplify matters for parsing most European languages. In particular
it will not be necessary for any language to accept strings containing digits
as being regular "words". The hyphenated-word categories are adjusted
similarly.
miscomputation of required palloc size. The crash could only occur if the
input contained lexemes both with and without positions, which is probably not
common in practice. The miscomputation would definitely result in wasted
space. Also fix some inconsistent coding around alignment of strings and
positions in a tsvector value; these errors could also lead to crashes given
mixed with/without position data and a machine that's picky about alignment.
And be more careful about checking for overflow of string offsets.
Patch is only against HEAD --- I have not looked to see if same bugs are
in back-branch contrib/tsearch2 code.
are really redundant, since we invented a regdictionary alias type.
We can have just one function, declared as taking regdictionary, and
it will handle both behaviors. Noted while working on documentation.
out at erratic times, because it is creating a totally unacceptable level
of noise in our buildfarm results. This patch can be reverted when and if
the code is fixed to not issue notices during cache reload events.
Rename synonym.syn.sample and thesaurs.ths.sample to
synonym_sample.syn and thesaurs_sample.ths accordingly to be able to use they
in regression test.
Ispell dictionary uses synthetic simple dictionary files.
Apparently it's a bug I introduced when I refactored spell.c to use the
readline function for reading and recoding the input file. I didn't
notice that some calls to STRNCMP used the non-lowercased version of the
input line.
small editorization by me
- Brake the QueryItem struct into QueryOperator and QueryOperand.
Type was really the only common field between them. QueryItem still
exists, and is used in the TSQuery struct as before, but it's now a
union of the two. Many other changes fell from that, like separation
of pushval_asis function into pushValue, pushOperator and pushStop.
- Moved some structs that were for internal use only from header files
to the right .c-files.
- Moved tsvector parser to a new tsvector_parser.c file. Parser code was
about half of the size of tsvector.c, it's also used from tsquery.c, and
it has some data structures of its own, so it seems better to separate
it. Cleaned up the API so that TSVectorParserState is not accessed from
outside tsvector_parser.c.
- Separated enumerations (#defines, really) used for QueryItem.type
field and as return codes from gettoken_query. It was just accidental
code sharing.
- Removed ParseQueryNode struct used internally by makepol and friends.
push*-functions now construct QueryItems directly.
- Changed int4 variables to just ints for variables like "i" or "array
size", where the storage-size was not significant.
- ispell initialization crashed on empty dictionary file
- ispell initialization crashed on affix file with prefixes but no suffixes
- stop words file was run through pg_verify_mbstr, with database
encoding, but it's supposed to be UTF-8; similar bug for synonym files
- bunch of comments added, typos fixed, and other cleanup
Introduced consistent encoding checking/conversion of data read from tsearch
configuration files, by doing this in a single t_readline() subroutine
(replacing direct usages of fgets). Cleaned up API for readstopwords too.
Heikki Linnakangas
init options of the template as top-level options in the syntax. This also
makes ALTER a bit easier to use, since options can be replaced individually.
I also made these statements verify that the tmplinit method will accept
the new settings before they get stored; in the original coding you didn't
find out about mistakes until the dictionary got invoked.
Under the hood, init methods now get options as a List of DefElem instead
of a raw text string --- that lets tsearch use existing options-pushing code
instead of duplicating functionality.
Oleg Bartunov and Teodor Sigaev, but I did a lot of editorializing,
so anything that's broken is probably my fault.
Documentation is nonexistent as yet, but let's land the patch so we can
get some portability testing done.