The old implementation first converted the input strings to arrays of wchars, and performed the conversion on those. However, the conversion is expensive, and for a large input string, consumes a lot of memory. Allocating the large arrays also meant that these functions could not be used on strings larger 1 GB / pg_encoding_max_length() (256 MB for UTF-8). Avoid the conversion, and instead use the single-byte algorithm even with multibyte encodings. That can get fooled, if there is a matching byte sequence in the middle of a multi-byte character, so to eliminate false positives like that, we verify any matches by walking the string character by character with pg_mblen(). Also, if the caller needs the position of the match, as a character-offset, we also need to walk the string to count the characters. Performance testing shows that walking the whole string with pg_mblen() is somewhat slower than converting the whole string to wchars. It's still often a win, though, because we don't need to do it if there is no match, and even when there is, we only need to walk up to the point where the match is, not the whole string. Even in the worst case, there would be room for optimization: Much of the CPU time in the current loop with pg_mblen() is function call overhead, and could be improved by inlining pg_mblen() and/or the encoding-specific mblen() functions. But I didn't attempt to do that as part of this patch. Most of the callers of text_position_setup/next functions were actually not interested in the position of the match, counted in characters. To cater for them, refactor the text_position_next() interface into two parts: searching for the next match (text_position_next()), and returning the current match's position as a pointer (text_position_get_match_ptr()) or as a character offset (text_position_get_match_pos()). Getting the pointer to the match is a more convenient API for many callers, and with UTF-8, it allows skipping the character-walking step altogether, because UTF-8 can't have false matches even when treated like raw byte strings. Reviewed-by: John Naylor Discussion: https://www.postgresql.org/message-id/3173d989-bc1c-fc8a-3b69-f24246f73876%40iki.fi |
||
---|---|---|
config | ||
contrib | ||
doc | ||
src | ||
.dir-locals.el | ||
.gitattributes | ||
.gitignore | ||
COPYRIGHT | ||
GNUmakefile.in | ||
HISTORY | ||
Makefile | ||
README | ||
README.git | ||
aclocal.m4 | ||
configure | ||
configure.in |
README
PostgreSQL Database Management System ===================================== This directory contains the source code distribution of the PostgreSQL database management system. PostgreSQL is an advanced object-relational database management system that supports an extended subset of the SQL standard, including transactions, foreign keys, subqueries, triggers, user-defined types and functions. This distribution also contains C language bindings. PostgreSQL has many language interfaces, many of which are listed here: https://www.postgresql.org/download See the file INSTALL for instructions on how to build and install PostgreSQL. That file also lists supported operating systems and hardware platforms and contains information regarding any other software packages that are required to build or run the PostgreSQL system. Copyright and license information can be found in the file COPYRIGHT. A comprehensive documentation set is included in this distribution; it can be read as described in the installation instructions. The latest version of this software may be obtained at https://www.postgresql.org/download/. For more information look at our web site located at https://www.postgresql.org/.