postgresql/src/backend/parser/scansup.c

/*-------------------------------------------------------------------------
 *
 * scansup.c
 *	  scanner support routines used by the core lexer
 *
 * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 *
 * IDENTIFICATION
 *	  src/backend/parser/scansup.c
 *
 *-------------------------------------------------------------------------
 */
#include "postgres.h"

#include <ctype.h>

#include "mb/pg_wchar.h"
#include "parser/scansup.h"


/*
 * downcase_truncate_identifier() --- do appropriate downcasing and
 * truncation of an unquoted identifier.  Optionally warn of truncation.
 *
 * Returns a palloc'd string containing the adjusted identifier.
 *
 * Note: in some usages the passed string is not null-terminated.
 *
 * Note: the API of this function is designed to allow for downcasing
 * transformations that increase the string length, but we don't yet
 * support that.  If you want to implement it, you'll need to fix
 * SplitIdentifierString() in utils/adt/varlena.c.
 */
char *
downcase_truncate_identifier(const char *ident, int len, bool warn)
{
	return downcase_identifier(ident, len, warn, true);
}

/*
 * a workhorse for downcase_truncate_identifier
 */
char *
downcase_identifier(const char *ident, int len, bool warn, bool truncate)
{
	char	   *result;
	int			i;
	bool		enc_is_single_byte;

	result = palloc(len + 1);
	enc_is_single_byte = pg_database_encoding_max_length() == 1;

	/*
	 * SQL99 specifies Unicode-aware case normalization, which we don't yet
	 * have the infrastructure for.  Instead we use tolower() to provide a
	 * locale-aware translation.  However, there are some locales where this
	 * is not right either (eg, Turkish may do strange things with 'i' and
	 * 'I').  Our current compromise is to use tolower() for characters with
	 * the high bit set, as long as they aren't part of a multi-byte
	 * character, and use an ASCII-only downcasing for 7-bit characters.
	 */
	for (i = 0; i < len; i++)
	{
		unsigned char ch = (unsigned char) ident[i];

		if (ch >= 'A' && ch <= 'Z')
			ch += 'a' - 'A';
		else if (enc_is_single_byte && IS_HIGHBIT_SET(ch) && isupper(ch))
			ch = tolower(ch);
		result[i] = (char) ch;
	}
	result[i] = '\0';

	if (i >= NAMEDATALEN && truncate)
		truncate_identifier(result, i, warn);

	return result;
}


/*
 * truncate_identifier() --- truncate an identifier to NAMEDATALEN-1 bytes.
 *
 * The given string is modified in-place, if necessary.  A warning is
 * issued if requested.
 *
 * We require the caller to pass in the string length since this saves a
 * strlen() call in some common usages.
 */
void
truncate_identifier(char *ident, int len, bool warn)
{
	if (len >= NAMEDATALEN)
	{
		len = pg_mbcliplen(ident, len, NAMEDATALEN - 1);
		if (warn)
			ereport(NOTICE,
					(errcode(ERRCODE_NAME_TOO_LONG),
					 errmsg("identifier \"%s\" will be truncated to \"%.*s\"",
							ident, len, ident)));
		ident[len] = '\0';
	}
}

/*
 * scanner_isspace() --- return true if flex scanner considers char whitespace
 *
 * This should be used instead of the potentially locale-dependent isspace()
 * function when it's important to match the lexer's behavior.
 *
 * In principle we might need similar functions for isalnum etc, but for the
 * moment only isspace seems needed.
 */
bool
scanner_isspace(char ch)
{
	/* This must match scan.l's list of {space} characters */
	if (ch == ' ' ||
		ch == '\t' ||
		ch == '\n' ||
		ch == '\r' ||
		ch == '\v' ||
		ch == '\f')
		return true;
	return false;
}
Postgres95 1.01 Distribution - Virgin Sources 1996-07-09 08:22:35 +02:00			`/*-------------------------------------------------------------------------`
			`*`
Change my-function-name-- to my_function_name, and optimizer renames. 1999-02-14 00:22:53 +01:00			`* scansup.c`
Make postgres.bki use the same literal-string syntax as postgresql.conf. The BKI file's string quoting conventions were previously quite weird, perhaps as a result of repurposing a function built to scan single-quoted strings to scan double-quoted ones. Change to use the same rules as we use in GUC files, allowing some simplifications in genbki.pl and initdb.c. While at it, completely remove the backend's scanstr() function, which was essentially a duplicate of the string dequoting code in guc-file.l. Instead export that one (under a less generic name than it had) and let bootscanner.l use it. Now we can clarify that scansup.c exists only to support the main lexer. We could alternatively have removed GUC_scanstr, but this way seems better since the previous arrangement could mislead a reader into thinking that scanstr() had something to do with the main lexer's handling of string literals. Maybe it did once, but if so it was a long time ago. This patch does not bump catversion, since the initially-installed catalog contents don't change. Note however that successful initdb after applying this patch will require up-to-date postgres.bki as well as postgres and initdb executables. In passing, remove a bunch of very-long-obsolete #include's in bootparse.y and bootscanner.l. John Naylor Discussion: https://postgr.es/m/CACPNZCtDpd18T0KATTmCggO2GdVC4ow86ypiq5ENff1VnauL8g@mail.gmail.com 2020-10-04 22:09:55 +02:00			`* scanner support routines used by the core lexer`
Postgres95 1.01 Distribution - Virgin Sources 1996-07-09 08:22:35 +02:00			`*`
Update copyright for 2023 Backpatch-through: 11 2023-01-02 21:00:37 +01:00			`* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group`
Add: * Portions Copyright (c) 1996-2000, PostgreSQL, Inc to all files copyright Regents of Berkeley. Man, that's a lot of files. 2000-01-26 06:58:53 +01:00			`* Portions Copyright (c) 1994, Regents of the University of California`
Postgres95 1.01 Distribution - Virgin Sources 1996-07-09 08:22:35 +02:00			`*`
			`*`
			`* IDENTIFICATION`
Remove cvs keywords from all files. 2010-09-20 22:08:53 +02:00			`* src/backend/parser/scansup.c`
Postgres95 1.01 Distribution - Virgin Sources 1996-07-09 08:22:35 +02:00			`*`
			`*-------------------------------------------------------------------------`
			`*/`
Be careful to include postgres.h before any system headers, to ensure that the right flavors of largefile-related definitions are seen. Most of these changes are probably unnecessary, but better safe than sorry. 2002-09-05 02:43:07 +02:00			`#include "postgres.h"`
added #include "config.h" for ESCAPE_PATCH define 1996-08-27 09:42:29 +02:00
Postgres95 1.01 Distribution - Virgin Sources 1996-07-09 08:22:35 +02:00			`#include <ctype.h>`
Cleanup up include files. 1997-11-26 02:14:33 +01:00
Implement a solution to the 'Turkish locale downcases I incorrectly' problem, per previous discussion. Make some additional changes to centralize the knowledge of just how identifier downcasing is done, in hopes of simplifying any future tweaking in this area. 2004-02-21 01:34:53 +01:00			`#include "mb/pg_wchar.h"`
Make the order of the header file includes consistent in backend modules. Similar to commits 7e735035f2 and dddf4cdc33, this commit makes the order of header file inclusion consistent for backend modules. In the passing, removed a couple of duplicate inclusions. Author: Vignesh C Reviewed-by: Kuntal Ghosh and Amit Kapila Discussion: https://postgr.es/m/CALDaNm2Sznv8RR6Ex-iJO6xAdsxgWhCoETkaYX=+9DW3q0QCfA@mail.gmail.com 2019-11-12 04:00:16 +01:00			`#include "parser/scansup.h"`
Postgres95 1.01 Distribution - Virgin Sources 1996-07-09 08:22:35 +02:00
Implement a solution to the 'Turkish locale downcases I incorrectly' problem, per previous discussion. Make some additional changes to centralize the knowledge of just how identifier downcasing is done, in hopes of simplifying any future tweaking in this area. 2004-02-21 01:34:53 +01:00
			`/*`
			`* downcase_truncate_identifier() --- do appropriate downcasing and`
			`* truncation of an unquoted identifier. Optionally warn of truncation.`
			`*`
			`* Returns a palloc'd string containing the adjusted identifier.`
			`*`
			`* Note: in some usages the passed string is not null-terminated.`
			`*`
			`* Note: the API of this function is designed to allow for downcasing`
			`* transformations that increase the string length, but we don't yet`
			`* support that. If you want to implement it, you'll need to fix`
			`* SplitIdentifierString() in utils/adt/varlena.c.`
			`*/`
			`char *`
			`downcase_truncate_identifier(const char *ident, int len, bool warn)`
Introduce parse_ident() SQL-layer function to split qualified identifier into array parts. Author: Pavel Stehule with minor editorization by me and Jim Nasby 2016-03-18 16:16:14 +01:00			`{`
			`return downcase_identifier(ident, len, warn, true);`
			`}`

			`/*`
			`* a workhorse for downcase_truncate_identifier`
			`*/`
			`char *`
			`downcase_identifier(const char *ident, int len, bool warn, bool truncate)`
Implement a solution to the 'Turkish locale downcases I incorrectly' problem, per previous discussion. Make some additional changes to centralize the knowledge of just how identifier downcasing is done, in hopes of simplifying any future tweaking in this area. 2004-02-21 01:34:53 +01:00			`{`
			`char *result;`
			`int i;`
Don't downcase non-ascii identifier chars in multi-byte encodings. Long-standing code has called tolower() on identifier character bytes with the high bit set. This is clearly an error and produces junk output when the encoding is multi-byte. This patch therefore restricts this activity to cases where there is a character with the high bit set AND the encoding is single-byte. There have been numerous gripes about this, most recently from Martin Schäfer. Backpatch to all live releases. 2013-06-08 16:00:09 +02:00			`bool enc_is_single_byte;`
Implement a solution to the 'Turkish locale downcases I incorrectly' problem, per previous discussion. Make some additional changes to centralize the knowledge of just how identifier downcasing is done, in hopes of simplifying any future tweaking in this area. 2004-02-21 01:34:53 +01:00
			`result = palloc(len + 1);`
Don't downcase non-ascii identifier chars in multi-byte encodings. Long-standing code has called tolower() on identifier character bytes with the high bit set. This is clearly an error and produces junk output when the encoding is multi-byte. This patch therefore restricts this activity to cases where there is a character with the high bit set AND the encoding is single-byte. There have been numerous gripes about this, most recently from Martin Schäfer. Backpatch to all live releases. 2013-06-08 16:00:09 +02:00			`enc_is_single_byte = pg_database_encoding_max_length() == 1;`
Pgindent run for 8.0. 2004-08-29 07:07:03 +02:00
Implement a solution to the 'Turkish locale downcases I incorrectly' problem, per previous discussion. Make some additional changes to centralize the knowledge of just how identifier downcasing is done, in hopes of simplifying any future tweaking in this area. 2004-02-21 01:34:53 +01:00			`/*`
			`* SQL99 specifies Unicode-aware case normalization, which we don't yet`
			`* have the infrastructure for. Instead we use tolower() to provide a`
			`* locale-aware translation. However, there are some locales where this`
			`* is not right either (eg, Turkish may do strange things with 'i' and`
			`* 'I'). Our current compromise is to use tolower() for characters with`
Don't downcase non-ascii identifier chars in multi-byte encodings. Long-standing code has called tolower() on identifier character bytes with the high bit set. This is clearly an error and produces junk output when the encoding is multi-byte. This patch therefore restricts this activity to cases where there is a character with the high bit set AND the encoding is single-byte. There have been numerous gripes about this, most recently from Martin Schäfer. Backpatch to all live releases. 2013-06-08 16:00:09 +02:00			`* the high bit set, as long as they aren't part of a multi-byte`
			`* character, and use an ASCII-only downcasing for 7-bit characters.`
Implement a solution to the 'Turkish locale downcases I incorrectly' problem, per previous discussion. Make some additional changes to centralize the knowledge of just how identifier downcasing is done, in hopes of simplifying any future tweaking in this area. 2004-02-21 01:34:53 +01:00			`*/`
			`for (i = 0; i < len; i++)`
			`{`
			`unsigned char ch = (unsigned char) ident[i];`

			`if (ch >= 'A' && ch <= 'Z')`
			`ch += 'a' - 'A';`
Don't downcase non-ascii identifier chars in multi-byte encodings. Long-standing code has called tolower() on identifier character bytes with the high bit set. This is clearly an error and produces junk output when the encoding is multi-byte. This patch therefore restricts this activity to cases where there is a character with the high bit set AND the encoding is single-byte. There have been numerous gripes about this, most recently from Martin Schäfer. Backpatch to all live releases. 2013-06-08 16:00:09 +02:00			`else if (enc_is_single_byte && IS_HIGHBIT_SET(ch) && isupper(ch))`
Implement a solution to the 'Turkish locale downcases I incorrectly' problem, per previous discussion. Make some additional changes to centralize the knowledge of just how identifier downcasing is done, in hopes of simplifying any future tweaking in this area. 2004-02-21 01:34:53 +01:00			`ch = tolower(ch);`
			`result[i] = (char) ch;`
			`}`
			`result[i] = '\0';`

Introduce parse_ident() SQL-layer function to split qualified identifier into array parts. Author: Pavel Stehule with minor editorization by me and Jim Nasby 2016-03-18 16:16:14 +01:00			`if (i >= NAMEDATALEN && truncate)`
Implement a solution to the 'Turkish locale downcases I incorrectly' problem, per previous discussion. Make some additional changes to centralize the knowledge of just how identifier downcasing is done, in hopes of simplifying any future tweaking in this area. 2004-02-21 01:34:53 +01:00			`truncate_identifier(result, i, warn);`

			`return result;`
			`}`

Introduce parse_ident() SQL-layer function to split qualified identifier into array parts. Author: Pavel Stehule with minor editorization by me and Jim Nasby 2016-03-18 16:16:14 +01:00
Implement a solution to the 'Turkish locale downcases I incorrectly' problem, per previous discussion. Make some additional changes to centralize the knowledge of just how identifier downcasing is done, in hopes of simplifying any future tweaking in this area. 2004-02-21 01:34:53 +01:00			`/*`
			`* truncate_identifier() --- truncate an identifier to NAMEDATALEN-1 bytes.`
			`*`
			`* The given string is modified in-place, if necessary. A warning is`
			`* issued if requested.`
			`*`
			`* We require the caller to pass in the string length since this saves a`
			`* strlen() call in some common usages.`
			`*/`
			`void`
			`truncate_identifier(char *ident, int len, bool warn)`
			`{`
			`if (len >= NAMEDATALEN)`
			`{`
			`len = pg_mbcliplen(ident, len, NAMEDATALEN - 1);`
			`if (warn)`
			`ereport(NOTICE,`
			`(errcode(ERRCODE_NAME_TOO_LONG),`
Mop up some no-longer-necessary hacks around printf %.s format. Commit 54cd4f045 added some kluges to work around an old glibc bug, namely that %.s could misbehave if glibc thought any characters in the supplied string were incorrectly encoded. Now that we use our own snprintf.c implementation, we need not worry about that bug (even if it still exists in the wild). Revert a couple of particularly ugly hacks, and remove or improve assorted comments. Note that there can still be encoding-related hazards here: blindly clipping at a fixed length risks producing wrongly-encoded output if the clip splits a multibyte character. However, code that's doing correct multibyte-aware clipping doesn't really need a comment about that, while code that isn't needs an explanation why not, rather than a red-herring comment about an obsolete bug. Discussion: https://postgr.es/m/279428.1593373684@sss.pgh.pa.us 2020-06-29 23:12:38 +02:00			`errmsg("identifier \"%s\" will be truncated to \"%.*s\"",`
			`ident, len, ident)));`
Implement a solution to the 'Turkish locale downcases I incorrectly' problem, per previous discussion. Make some additional changes to centralize the knowledge of just how identifier downcasing is done, in hopes of simplifying any future tweaking in this area. 2004-02-21 01:34:53 +01:00			`ident[len] = '\0';`
			`}`
			`}`
Fix bugs in plpgsql and ecpg caused by assuming that isspace() would only return true for exactly the characters treated as whitespace by their flex scanners. Per report from Victor Snezhko and subsequent investigation. Also fix a passel of unsafe usages of <ctype.h> functions, that is, ye olde char-vs-unsigned-char issue. I won't miss <ctype.h> when we are finally able to stop using it. 2006-09-22 23:39:58 +02:00
			`/*`
Change TRUE/FALSE to true/false The lower case spellings are C and C++ standard and are used in most parts of the PostgreSQL sources. The upper case spellings are only used in some files/modules. So standardize on the standard spellings. The APIs for ICU, Perl, and Windows define their own TRUE and FALSE, so those are left as is when using those APIs. In code comments, we use the lower-case spelling for the C concepts and keep the upper-case spelling for the SQL concepts. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> 2017-08-16 06:22:32 +02:00			`* scanner_isspace() --- return true if flex scanner considers char whitespace`
Fix bugs in plpgsql and ecpg caused by assuming that isspace() would only return true for exactly the characters treated as whitespace by their flex scanners. Per report from Victor Snezhko and subsequent investigation. Also fix a passel of unsafe usages of <ctype.h> functions, that is, ye olde char-vs-unsigned-char issue. I won't miss <ctype.h> when we are finally able to stop using it. 2006-09-22 23:39:58 +02:00			`*`
			`* This should be used instead of the potentially locale-dependent isspace()`
			`* function when it's important to match the lexer's behavior.`
			`*`
			`* In principle we might need similar functions for isalnum etc, but for the`
			`* moment only isspace seems needed.`
			`*/`
			`bool`
			`scanner_isspace(char ch)`
			`{`
			`/* This must match scan.l's list of {space} characters */`
			`if (ch == ' ' \|\|`
			`ch == '\t' \|\|`
			`ch == '\n' \|\|`
			`ch == '\r' \|\|`
Handle \v as a whitespace character in parsers This commit comes as a continuation of the discussion that has led to d522b05, as \v was handled inconsistently when parsing array values or anything going through the parsers, and changing a parser behavior in stable branches is a scary thing to do. The parsing of array values now uses the more central scanner_isspace() and array_isspace() is removed. As pointing out by Peter Eisentraut, fix a confusing reference to horizontal space in the parsers with the term "horiz_space". \f was included in this set since 3cfdd8f from 2000, but it is not horizontal. "horiz_space" is renamed to "non_newline_space", to refer to all whitespace characters except newlines. The changes impact the parsers for the backend, psql, seg, cube, ecpg and replication commands. Note that JSON should not escape \v, as per RFC 7159, so these are not touched. Reviewed-by: Peter Eisentraut, Tom Lane Discussion: https://postgr.es/m/ZJKcjNwWHHvw9ksQ@paquier.xyz 2023-07-06 01:16:24 +02:00			`ch == '\v' \|\|`
Fix bugs in plpgsql and ecpg caused by assuming that isspace() would only return true for exactly the characters treated as whitespace by their flex scanners. Per report from Victor Snezhko and subsequent investigation. Also fix a passel of unsafe usages of <ctype.h> functions, that is, ye olde char-vs-unsigned-char issue. I won't miss <ctype.h> when we are finally able to stop using it. 2006-09-22 23:39:58 +02:00			`ch == '\f')`
			`return true;`
			`return false;`
			`}`