Our usual practice for "poor man's enum" catalog columns is to define
macros for the possible values and use those, not literal constants,
in C code. But for some reason lost in the mists of time, this was
never done for typalign/attalign or typstorage/attstorage. It's never
too late to make it better though, so let's do that.
The reason I got interested in this right now is the need to duplicate
some uses of the TYPSTORAGE constants in an upcoming ALTER TYPE patch.
But in general, this sort of change aids greppability and readability,
so it's a good idea even without any specific motivation.
I may have missed a few places that could be converted, and it's even
more likely that pending patches will re-introduce some hard-coded
references. But that's not fatal --- there's no expectation that
we'd actually change any of these values. We can clean up stragglers
over time.
Discussion: https://postgr.es/m/16457.1583189537@sss.pgh.pa.us
We used to strategically place newlines after some function call left
parentheses to make pgindent move the argument list a few chars to the
left, so that the whole line would fit under 80 chars. However,
pgindent no longer does that, so the newlines just made the code
vertically longer for no reason. Remove those newlines, and reflow some
of those lines for some extra naturality.
Reviewed-by: Michael Paquier, Tom Lane
Discussion: https://postgr.es/m/20200129200401.GA6303@alvherre.pgsql
The basic rule we follow here is to always first include 'postgres.h' or
'postgres_fe.h' whichever is applicable, then system header includes and
then Postgres header includes. In this, we also follow that all the
Postgres header includes are in order based on their ASCII value. We
generally follow these rules, but the code has deviated in many places.
This commit makes it consistent just for contrib modules. The later
commits will enforce similar rules in other parts of code.
Author: Vignesh C
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/CALDaNm2Sznv8RR6Ex-iJO6xAdsxgWhCoETkaYX=+9DW3q0QCfA@mail.gmail.com
Descriptions of pg_trgm GUC options have % replaced with %% like it was
a printf-like format. But that's not needed since they are just plain strings.
This commit fixed that. Backpatch to last supported version since this error
present from the beginning.
Reported-by: Masahiko Sawada
Discussion: https://postgr.es/m/CAD21AoAgPKODUsu9gqUFiNqEOAqedStxJ-a0sapsJXWWAVp%3Dxg%40mail.gmail.com
Backpatch-through: 9.4
Update typedefs.list from current buildfarm results. Adjust pgindent's
typedef blacklist to block some more unfortunate typedef names that have
snuck in since last time. Manually tweak a few places where I didn't
like the initial results of pgindent'ing.
The new indent version includes numerous fixes thanks to Piotr Stefaniak.
The main changes visible in this commit are:
* Nicer formatting of function-pointer declarations.
* No longer unexpectedly removes spaces in expressions using casts,
sizeof, or offsetof.
* No longer wants to add a space in "struct structname *varname", as
well as some similar cases for const- or volatile-qualified pointers.
* Declarations using PG_USED_FOR_ASSERTS_ONLY are formatted more nicely.
* Fixes bug where comments following declarations were sometimes placed
with no space separating them from the code.
* Fixes some odd decisions for comments following case labels.
* Fixes some cases where comments following code were indented to less
than the expected column 33.
On the less good side, it now tends to put more whitespace around typedef
names that are not listed in typedefs.list. This might encourage us to
put more effort into typedef name collection; it's not really a bug in
indent itself.
There are more changes coming after this round, having to do with comment
indentation and alignment of lines appearing within parentheses. I wanted
to limit the size of the diffs to something that could be reviewed without
one's eyes completely glazing over, so it seemed better to split up the
changes as much as practical.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
This makes almost all core code follow the policy introduced in the
previous commit. Specific decisions:
- Text search support functions with char* and length arguments, such as
prsstart and lexize, may receive unaligned strings. I doubt
maintainers of non-core text search code will notice.
- Use plain VARDATA() on values detoasted or synthesized earlier in the
same function. Use VARDATA_ANY() on varlenas sourced outside the
function, even if they happen to always have four-byte headers. As an
exception, retain the universal practice of using VARDATA() on return
values of SendFunctionCall().
- Retain PG_GETARG_BYTEA_P() in pageinspect. (Page images are too large
for a one-byte header, so this misses no optimization.) Sites that do
not call get_page_from_raw() typically need the four-byte alignment.
- For now, do not change btree_gist. Its use of four-byte headers in
memory is partly entangled with storage of 4-byte headers inside
GBT_VARKEY, on disk.
- For now, do not change gtrgm_consistent() or gtrgm_distance(). They
incorporate the varlena header into a cache, and there are multiple
credible implementation strategies to consider.
Deprecated set_limit() is modified to use SetConfigOption() to set
similarity_threshold which is actually an instance of
pg_trgm.similarity_threshold GUC variable. Previous coding directly sets
similarity_threshold what could cause an inconsistency between states of
actual variable and GUC representation.
Per gripe from Tom Lane
Patch introduces a concept of similarity over string and just a word from
another string.
Version of extension is not changed because 1.2 was already introduced in 9.6
release cycle, so, there wasn't a public version.
Author: Alexander Korotkov, Artur Zakirov
Use GUC variable pg_trgm.similarity_threshold insead of
set_limit()/show_limit() which was introduced when defining GUC varuables
by modules was absent.
Author: Artur Zakirov
Now that we use CRC-32C in WAL and the control file, the "traditional" and
"legacy" CRC-32 variants are not used in any frontend programs anymore.
Move the code for those back from src/common to src/backend/utils/hash.
Also move the slicing-by-8 implementation (back) to src/port. This is in
preparation for next patch that will add another implementation that uses
Intel SSE 4.2 instructions to calculate CRC-32C, where available.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
Because of gcc -Wmissing-prototypes, all functions in dynamically
loadable modules must have a separate prototype declaration. This is
meant to detect global functions that are not declared in header files,
but in cases where the function is called via dfmgr, this is redundant.
Besides filling up space with boilerplate, this is a frequent source of
compiler warnings in extension modules.
We can fix that by creating the function prototype as part of the
PG_FUNCTION_INFO_V1 macro, which such modules have to use anyway. That
makes the code of modules cleaner, because there is one less place where
the entry points have to be listed, and creates an additional check that
functions have the right prototype.
Remove now redundant prototypes from contrib and other modules.
Allow for the possibility that folding a string to lower case makes it
longer (due to replacing a character with a longer multibyte character).
This doesn't change the number of trigrams that will be extracted, but
it does affect the required size of an intermediate buffer in
generate_trgm(). Per bug #8821 from Ufuk Kayserilioglu.
Also install some checks that the input string length is not so large
as to cause overflow in the calculations of palloc request sizes.
Back-patch to all supported versions.
This wasn't addressed in the original patch, but it doesn't take very
much additional code to cover the case, so let's get it done.
Since pg_trgm 1.1 hasn't been released yet, I just changed the definition
of what's in it, rather than inventing a 1.2.
This works by extracting trigrams from the given regular expression,
in generally the same spirit as the previously-existing support for
LIKE searches, though of course the details are far more complicated.
Currently, only GIN indexes are supported. We might be able to make
it work with GiST indexes later.
The implementation includes adding API functions to backend/regex/
to provide a view of the search NFA created from a regular expression.
These functions are meant to be generic enough to be supportable in
a standalone version of the regex library, should that ever happen.
Alexander Korotkov, reviewed by Heikki Linnakangas and Tom Lane
contrib/pg_trgm's make_trigrams() was coded to ignore multibyte character
boundaries and just make trigrams from bytes if USE_WIDE_UPPER_LOWER wasn't
defined. This is a bit odd, since there's no obvious reason why trigram
compaction rules should depend on the presence of towlower() and friends.
What's more, there was an Assert() that would fail if that code path was
fed any multibyte characters.
We need to do something about this since the pending regex-indexing patch
has an assumption that you get just one "trgm" from any three characters.
The best solution seems to be to remove the USE_WIDE_UPPER_LOWER
dependency, which shouldn't really have been there in the first place.
The second loop in make_trigrams() is now just a fast path and not a
potentially incompatible algorithm.
If there is anybody still using Postgres on machines without wcstombs() or
towlower(), and they have non-ASCII data indexed by pg_trgm, they'll need
to REINDEX those indexes after pg_upgrade to 9.3, else searches may fail
incorrectly. It seems likely that there are no such installations, though.
In passing, rename cnt_trigram to compact_trigram, which seems to better
describe its functionality, and improve make_trigrams' test for whether it
has to use the slow path or not (per a suggestion from Alexander Korotkov).
Cases such as similarity('', '') produced a NaN result due to computing
0/0. Per discussion, make it return zero instead.
This appears to be the basic cause of bug #7867 from Michele Baravalle,
although it remains unclear why her installation doesn't think Cyrillic
letters are letters.
Back-patch to all active branches.
Extraction of trigrams did not process LIKE escape sequences properly,
leading to possible misidentification of trigrams near escapes, resulting
in incorrect index search results.
Fujii Masao
Unlike Btree-based LIKE optimization, this works for non-left-anchored
search patterns. The effectiveness of the search depends on how many
trigrams can be extracted from the pattern. (The worst case, with no
trigrams, degrades to a full-table scan, so this isn't a panacea. But
it can be very useful.)
Alexander Korotkov, reviewed by Jan Urbanski
ways. I'm not totally sure that I caught everything, but at least now they pass
their regression tests with VARSIZE/SET_VARSIZE defined to reverse byte order.
Get rid of VARATT_SIZE and VARATT_DATA, which were simply redundant with
VARSIZE and VARDATA, and as a consequence almost no code was using the
longer names. Rename the length fields of struct varlena and various
derived structures to catch anyplace that was accessing them directly;
and clean up various places so caught. In itself this patch doesn't
change any behavior at all, but it is necessary infrastructure if we hope
to play any games with the representation of varlena headers.
Greg Stark and Tom Lane
suffix, to distinguish them from doubles. Make some function declarations
and definitions use the "const" qualifier for arguments consistently.
Ignore warning 4102 ("unreferenced label"), because such warnings
are always emitted by bison-generated code. Patch from Magnus Hagander.
more compliant with the error message style guide. In particular,
errdetail should begin with a capital letter and end with a period,
whereas errmsg should not. I also fixed a few related issues in
passing, such as fixing the repeated misspelling of "lexeme" in
contrib/tsearch2 (per Tom's suggestion).
--------------------------------------
The pg_trgm contrib module provides functions and index classes
for determining the similarity of text based on trigram
matching.