1998-03-15 08:39:04 +01:00
|
|
|
/*
|
2002-09-03 23:45:44 +02:00
|
|
|
* conversion functions between pg_wchar and multibyte streams.
|
1998-03-15 08:39:04 +01:00
|
|
|
* Tatsuo Ishii
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/backend/utils/mb/wchar.c
|
1999-07-12 00:47:21 +02:00
|
|
|
*
|
1998-03-15 08:39:04 +01:00
|
|
|
*/
|
2001-02-10 03:31:31 +01:00
|
|
|
/* can be used in either frontend or backend */
|
Commit Karel's patch.
-------------------------------------------------------------------
Subject: Re: [PATCHES] encoding names
From: Karel Zak <zakkr@zf.jcu.cz>
To: Peter Eisentraut <peter_e@gmx.net>
Cc: pgsql-patches <pgsql-patches@postgresql.org>
Date: Fri, 31 Aug 2001 17:24:38 +0200
On Thu, Aug 30, 2001 at 01:30:40AM +0200, Peter Eisentraut wrote:
> > - convert encoding 'name' to 'id'
>
> I thought we decided not to add functions returning "new" names until we
> know exactly what the new names should be, and pending schema
Ok, the patch not to add functions.
> better
>
> ...(): encoding name too long
Fixed.
I found new bug in command/variable.c in parse_client_encoding(), nobody
probably never see this error:
if (pg_set_client_encoding(encoding))
{
elog(ERROR, "Conversion between %s and %s is not supported",
value, GetDatabaseEncodingName());
}
because pg_set_client_encoding() returns -1 for error and 0 as true.
It's fixed too.
IMHO it can be apply.
Karel
PS:
* following files are renamed:
src/utils/mb/Unicode/KOI8_to_utf8.map -->
src/utils/mb/Unicode/koi8r_to_utf8.map
src/utils/mb/Unicode/WIN_to_utf8.map -->
src/utils/mb/Unicode/win1251_to_utf8.map
src/utils/mb/Unicode/utf8_to_KOI8.map -->
src/utils/mb/Unicode/utf8_to_koi8r.map
src/utils/mb/Unicode/utf8_to_WIN.map -->
src/utils/mb/Unicode/utf8_to_win1251.map
* new file:
src/utils/mb/encname.c
* removed file:
src/utils/mb/common.c
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
2001-09-06 06:57:30 +02:00
|
|
|
#ifdef FRONTEND
|
2001-09-21 17:27:38 +02:00
|
|
|
#include "postgres_fe.h"
|
|
|
|
#define Assert(condition)
|
Commit Karel's patch.
-------------------------------------------------------------------
Subject: Re: [PATCHES] encoding names
From: Karel Zak <zakkr@zf.jcu.cz>
To: Peter Eisentraut <peter_e@gmx.net>
Cc: pgsql-patches <pgsql-patches@postgresql.org>
Date: Fri, 31 Aug 2001 17:24:38 +0200
On Thu, Aug 30, 2001 at 01:30:40AM +0200, Peter Eisentraut wrote:
> > - convert encoding 'name' to 'id'
>
> I thought we decided not to add functions returning "new" names until we
> know exactly what the new names should be, and pending schema
Ok, the patch not to add functions.
> better
>
> ...(): encoding name too long
Fixed.
I found new bug in command/variable.c in parse_client_encoding(), nobody
probably never see this error:
if (pg_set_client_encoding(encoding))
{
elog(ERROR, "Conversion between %s and %s is not supported",
value, GetDatabaseEncodingName());
}
because pg_set_client_encoding() returns -1 for error and 0 as true.
It's fixed too.
IMHO it can be apply.
Karel
PS:
* following files are renamed:
src/utils/mb/Unicode/KOI8_to_utf8.map -->
src/utils/mb/Unicode/koi8r_to_utf8.map
src/utils/mb/Unicode/WIN_to_utf8.map -->
src/utils/mb/Unicode/win1251_to_utf8.map
src/utils/mb/Unicode/utf8_to_KOI8.map -->
src/utils/mb/Unicode/utf8_to_koi8r.map
src/utils/mb/Unicode/utf8_to_WIN.map -->
src/utils/mb/Unicode/utf8_to_win1251.map
* new file:
src/utils/mb/encname.c
* removed file:
src/utils/mb/common.c
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
2001-09-06 06:57:30 +02:00
|
|
|
#else
|
2001-09-21 17:27:38 +02:00
|
|
|
#include "postgres.h"
|
Commit Karel's patch.
-------------------------------------------------------------------
Subject: Re: [PATCHES] encoding names
From: Karel Zak <zakkr@zf.jcu.cz>
To: Peter Eisentraut <peter_e@gmx.net>
Cc: pgsql-patches <pgsql-patches@postgresql.org>
Date: Fri, 31 Aug 2001 17:24:38 +0200
On Thu, Aug 30, 2001 at 01:30:40AM +0200, Peter Eisentraut wrote:
> > - convert encoding 'name' to 'id'
>
> I thought we decided not to add functions returning "new" names until we
> know exactly what the new names should be, and pending schema
Ok, the patch not to add functions.
> better
>
> ...(): encoding name too long
Fixed.
I found new bug in command/variable.c in parse_client_encoding(), nobody
probably never see this error:
if (pg_set_client_encoding(encoding))
{
elog(ERROR, "Conversion between %s and %s is not supported",
value, GetDatabaseEncodingName());
}
because pg_set_client_encoding() returns -1 for error and 0 as true.
It's fixed too.
IMHO it can be apply.
Karel
PS:
* following files are renamed:
src/utils/mb/Unicode/KOI8_to_utf8.map -->
src/utils/mb/Unicode/koi8r_to_utf8.map
src/utils/mb/Unicode/WIN_to_utf8.map -->
src/utils/mb/Unicode/win1251_to_utf8.map
src/utils/mb/Unicode/utf8_to_KOI8.map -->
src/utils/mb/Unicode/utf8_to_koi8r.map
src/utils/mb/Unicode/utf8_to_WIN.map -->
src/utils/mb/Unicode/utf8_to_win1251.map
* new file:
src/utils/mb/encname.c
* removed file:
src/utils/mb/common.c
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
2001-09-06 06:57:30 +02:00
|
|
|
#endif
|
|
|
|
|
2001-09-21 17:27:38 +02:00
|
|
|
#include "mb/pg_wchar.h"
|
|
|
|
|
Commit Karel's patch.
-------------------------------------------------------------------
Subject: Re: [PATCHES] encoding names
From: Karel Zak <zakkr@zf.jcu.cz>
To: Peter Eisentraut <peter_e@gmx.net>
Cc: pgsql-patches <pgsql-patches@postgresql.org>
Date: Fri, 31 Aug 2001 17:24:38 +0200
On Thu, Aug 30, 2001 at 01:30:40AM +0200, Peter Eisentraut wrote:
> > - convert encoding 'name' to 'id'
>
> I thought we decided not to add functions returning "new" names until we
> know exactly what the new names should be, and pending schema
Ok, the patch not to add functions.
> better
>
> ...(): encoding name too long
Fixed.
I found new bug in command/variable.c in parse_client_encoding(), nobody
probably never see this error:
if (pg_set_client_encoding(encoding))
{
elog(ERROR, "Conversion between %s and %s is not supported",
value, GetDatabaseEncodingName());
}
because pg_set_client_encoding() returns -1 for error and 0 as true.
It's fixed too.
IMHO it can be apply.
Karel
PS:
* following files are renamed:
src/utils/mb/Unicode/KOI8_to_utf8.map -->
src/utils/mb/Unicode/koi8r_to_utf8.map
src/utils/mb/Unicode/WIN_to_utf8.map -->
src/utils/mb/Unicode/win1251_to_utf8.map
src/utils/mb/Unicode/utf8_to_KOI8.map -->
src/utils/mb/Unicode/utf8_to_koi8r.map
src/utils/mb/Unicode/utf8_to_WIN.map -->
src/utils/mb/Unicode/utf8_to_win1251.map
* new file:
src/utils/mb/encname.c
* removed file:
src/utils/mb/common.c
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
2001-09-06 06:57:30 +02:00
|
|
|
|
1998-03-15 08:39:04 +01:00
|
|
|
/*
|
1998-06-16 09:29:54 +02:00
|
|
|
* conversion to pg_wchar is done by "table driven."
|
2006-05-21 22:05:21 +02:00
|
|
|
* to add an encoding support, define mb2wchar_with_len(), mblen(), dsplen()
|
1998-06-16 09:29:54 +02:00
|
|
|
* for the particular encoding. Note that if the encoding is only
|
1998-09-01 06:40:42 +02:00
|
|
|
* supported in the client, you don't need to define
|
1998-06-16 09:29:54 +02:00
|
|
|
* mb2wchar_with_len() function (SJIS is the case).
|
2006-02-10 01:39:04 +01:00
|
|
|
*
|
2006-05-21 22:05:21 +02:00
|
|
|
* These functions generally assume that their input is validly formed.
|
|
|
|
* The "verifier" functions, further down in the file, have to be more
|
|
|
|
* paranoid. We expect that mblen() does not need to examine more than
|
|
|
|
* the first byte of the character to discover the correct length.
|
|
|
|
*
|
2006-02-10 01:39:04 +01:00
|
|
|
* Note: for the display output of psql to work properly, the return values
|
2006-05-21 22:05:21 +02:00
|
|
|
* of the dsplen functions must conform to the Unicode standard. In particular
|
2006-02-10 01:39:04 +01:00
|
|
|
* the NUL character is zero width and control characters are generally
|
|
|
|
* width -1. It is recommended that non-ASCII encodings refer their ASCII
|
2006-05-21 22:05:21 +02:00
|
|
|
* subset to the ASCII routines to ensure consistency.
|
1998-03-15 08:39:04 +01:00
|
|
|
*/
|
1998-08-25 06:19:16 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* SQL/ASCII
|
|
|
|
*/
|
2005-12-24 17:49:48 +01:00
|
|
|
static int
|
2007-10-16 00:46:27 +02:00
|
|
|
pg_ascii2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
|
1998-08-25 06:19:16 +02:00
|
|
|
{
|
2001-03-22 05:01:46 +01:00
|
|
|
int cnt = 0;
|
2000-08-27 12:40:48 +02:00
|
|
|
|
2001-03-08 01:24:34 +01:00
|
|
|
while (len > 0 && *from)
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
|
|
|
*to++ = *from++;
|
|
|
|
len--;
|
2000-08-27 12:40:48 +02:00
|
|
|
cnt++;
|
1998-09-01 06:40:42 +02:00
|
|
|
}
|
|
|
|
*to = 0;
|
2005-12-24 17:49:48 +01:00
|
|
|
return cnt;
|
1998-08-25 06:19:16 +02:00
|
|
|
}
|
|
|
|
|
1998-09-01 06:40:42 +02:00
|
|
|
static int
|
|
|
|
pg_ascii_mblen(const unsigned char *s)
|
1998-08-25 06:19:16 +02:00
|
|
|
{
|
2005-12-24 17:49:48 +01:00
|
|
|
return 1;
|
1998-08-25 06:19:16 +02:00
|
|
|
}
|
|
|
|
|
2004-03-15 11:41:26 +01:00
|
|
|
static int
|
|
|
|
pg_ascii_dsplen(const unsigned char *s)
|
|
|
|
{
|
2006-02-10 01:39:04 +01:00
|
|
|
if (*s == '\0')
|
|
|
|
return 0;
|
|
|
|
if (*s < 0x20 || *s == 0x7f)
|
|
|
|
return -1;
|
2006-10-04 02:30:14 +02:00
|
|
|
|
2005-12-24 17:49:48 +01:00
|
|
|
return 1;
|
2004-03-15 11:41:26 +01:00
|
|
|
}
|
|
|
|
|
1998-08-25 06:19:16 +02:00
|
|
|
/*
|
|
|
|
* EUC
|
|
|
|
*/
|
2007-10-16 00:46:27 +02:00
|
|
|
static int
|
|
|
|
pg_euc2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
|
1998-03-15 08:39:04 +01:00
|
|
|
{
|
2001-03-22 05:01:46 +01:00
|
|
|
int cnt = 0;
|
2000-08-27 12:40:48 +02:00
|
|
|
|
2001-03-08 01:24:34 +01:00
|
|
|
while (len > 0 && *from)
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
if (*from == SS2 && len >= 2) /* JIS X 0201 (so called "1 byte
|
|
|
|
* KANA") */
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
|
|
|
from++;
|
2005-12-24 10:35:36 +01:00
|
|
|
*to = (SS2 << 8) | *from++;
|
2001-03-08 01:24:34 +01:00
|
|
|
len -= 2;
|
1998-09-01 06:40:42 +02:00
|
|
|
}
|
2005-12-24 10:35:36 +01:00
|
|
|
else if (*from == SS3 && len >= 3) /* JIS X 0212 KANJI */
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
|
|
|
from++;
|
2005-12-24 10:35:36 +01:00
|
|
|
*to = (SS3 << 16) | (*from++ << 8);
|
|
|
|
*to |= *from++;
|
1998-09-01 06:40:42 +02:00
|
|
|
len -= 3;
|
|
|
|
}
|
2006-10-04 02:30:14 +02:00
|
|
|
else if (IS_HIGHBIT_SET(*from) && len >= 2) /* JIS X 0208 KANJI */
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
|
|
|
*to = *from++ << 8;
|
|
|
|
*to |= *from++;
|
|
|
|
len -= 2;
|
|
|
|
}
|
2006-10-04 02:30:14 +02:00
|
|
|
else
|
|
|
|
/* must be ASCII */
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
|
|
|
*to = *from++;
|
|
|
|
len--;
|
|
|
|
}
|
|
|
|
to++;
|
2000-08-27 12:40:48 +02:00
|
|
|
cnt++;
|
1998-09-01 06:40:42 +02:00
|
|
|
}
|
|
|
|
*to = 0;
|
2005-12-24 17:49:48 +01:00
|
|
|
return cnt;
|
1998-03-15 08:39:04 +01:00
|
|
|
}
|
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
static inline int
|
1998-09-01 06:40:42 +02:00
|
|
|
pg_euc_mblen(const unsigned char *s)
|
1998-03-15 08:39:04 +01:00
|
|
|
{
|
1998-09-01 06:40:42 +02:00
|
|
|
int len;
|
1998-06-16 09:29:54 +02:00
|
|
|
|
1998-09-01 06:40:42 +02:00
|
|
|
if (*s == SS2)
|
|
|
|
len = 2;
|
|
|
|
else if (*s == SS3)
|
|
|
|
len = 3;
|
2005-12-25 03:14:19 +01:00
|
|
|
else if (IS_HIGHBIT_SET(*s))
|
1998-09-01 06:40:42 +02:00
|
|
|
len = 2;
|
|
|
|
else
|
|
|
|
len = 1;
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
1998-03-15 08:39:04 +01:00
|
|
|
}
|
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
static inline int
|
2004-03-15 11:41:26 +01:00
|
|
|
pg_euc_dsplen(const unsigned char *s)
|
|
|
|
{
|
|
|
|
int len;
|
|
|
|
|
|
|
|
if (*s == SS2)
|
|
|
|
len = 2;
|
|
|
|
else if (*s == SS3)
|
|
|
|
len = 2;
|
2005-12-25 03:14:19 +01:00
|
|
|
else if (IS_HIGHBIT_SET(*s))
|
2004-03-15 11:41:26 +01:00
|
|
|
len = 2;
|
|
|
|
else
|
2006-02-10 01:39:04 +01:00
|
|
|
len = pg_ascii_dsplen(s);
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
2004-03-15 11:41:26 +01:00
|
|
|
}
|
|
|
|
|
1998-03-15 08:39:04 +01:00
|
|
|
/*
|
1998-06-16 09:29:54 +02:00
|
|
|
* EUC_JP
|
1998-03-15 08:39:04 +01:00
|
|
|
*/
|
2007-10-16 00:46:27 +02:00
|
|
|
static int
|
|
|
|
pg_eucjp2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
|
1998-03-15 08:39:04 +01:00
|
|
|
{
|
2005-12-24 17:49:48 +01:00
|
|
|
return pg_euc2wchar_with_len(from, to, len);
|
1998-03-15 08:39:04 +01:00
|
|
|
}
|
|
|
|
|
1998-09-01 06:40:42 +02:00
|
|
|
static int
|
|
|
|
pg_eucjp_mblen(const unsigned char *s)
|
1998-03-15 08:39:04 +01:00
|
|
|
{
|
2005-12-24 17:49:48 +01:00
|
|
|
return pg_euc_mblen(s);
|
1998-03-15 08:39:04 +01:00
|
|
|
}
|
|
|
|
|
2004-03-15 11:41:26 +01:00
|
|
|
static int
|
|
|
|
pg_eucjp_dsplen(const unsigned char *s)
|
|
|
|
{
|
|
|
|
int len;
|
|
|
|
|
|
|
|
if (*s == SS2)
|
|
|
|
len = 1;
|
|
|
|
else if (*s == SS3)
|
|
|
|
len = 2;
|
2005-12-25 03:14:19 +01:00
|
|
|
else if (IS_HIGHBIT_SET(*s))
|
2004-03-15 11:41:26 +01:00
|
|
|
len = 2;
|
|
|
|
else
|
2006-02-10 01:39:04 +01:00
|
|
|
len = pg_ascii_dsplen(s);
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
2004-03-15 11:41:26 +01:00
|
|
|
}
|
|
|
|
|
1998-03-15 08:39:04 +01:00
|
|
|
/*
|
1998-06-16 09:29:54 +02:00
|
|
|
* EUC_KR
|
1998-03-15 08:39:04 +01:00
|
|
|
*/
|
2007-10-16 00:46:27 +02:00
|
|
|
static int
|
|
|
|
pg_euckr2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
|
1998-03-15 08:39:04 +01:00
|
|
|
{
|
2005-12-24 17:49:48 +01:00
|
|
|
return pg_euc2wchar_with_len(from, to, len);
|
1998-03-15 08:39:04 +01:00
|
|
|
}
|
|
|
|
|
1998-09-01 06:40:42 +02:00
|
|
|
static int
|
|
|
|
pg_euckr_mblen(const unsigned char *s)
|
1998-03-15 08:39:04 +01:00
|
|
|
{
|
2005-12-24 17:49:48 +01:00
|
|
|
return pg_euc_mblen(s);
|
1998-03-15 08:39:04 +01:00
|
|
|
}
|
|
|
|
|
2004-03-15 11:41:26 +01:00
|
|
|
static int
|
|
|
|
pg_euckr_dsplen(const unsigned char *s)
|
|
|
|
{
|
2005-12-24 17:49:48 +01:00
|
|
|
return pg_euc_dsplen(s);
|
2004-03-15 11:41:26 +01:00
|
|
|
}
|
|
|
|
|
1998-06-16 09:29:54 +02:00
|
|
|
/*
|
|
|
|
* EUC_CN
|
2005-12-24 10:35:36 +01:00
|
|
|
*
|
1998-06-16 09:29:54 +02:00
|
|
|
*/
|
2007-10-16 00:46:27 +02:00
|
|
|
static int
|
|
|
|
pg_euccn2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
|
1998-03-15 08:39:04 +01:00
|
|
|
{
|
2001-03-22 05:01:46 +01:00
|
|
|
int cnt = 0;
|
2000-08-27 12:40:48 +02:00
|
|
|
|
2001-03-08 01:24:34 +01:00
|
|
|
while (len > 0 && *from)
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
2005-12-24 10:35:36 +01:00
|
|
|
if (*from == SS2 && len >= 3) /* code set 2 (unused?) */
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
|
|
|
from++;
|
2005-12-24 10:35:36 +01:00
|
|
|
*to = (SS2 << 16) | (*from++ << 8);
|
|
|
|
*to |= *from++;
|
2001-03-08 01:24:34 +01:00
|
|
|
len -= 3;
|
1998-09-01 06:40:42 +02:00
|
|
|
}
|
2005-12-24 10:35:36 +01:00
|
|
|
else if (*from == SS3 && len >= 3) /* code set 3 (unsed ?) */
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
|
|
|
from++;
|
2005-12-24 10:35:36 +01:00
|
|
|
*to = (SS3 << 16) | (*from++ << 8);
|
|
|
|
*to |= *from++;
|
1998-09-01 06:40:42 +02:00
|
|
|
len -= 3;
|
|
|
|
}
|
2006-10-04 02:30:14 +02:00
|
|
|
else if (IS_HIGHBIT_SET(*from) && len >= 2) /* code set 1 */
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
|
|
|
*to = *from++ << 8;
|
|
|
|
*to |= *from++;
|
|
|
|
len -= 2;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
*to = *from++;
|
|
|
|
len--;
|
|
|
|
}
|
|
|
|
to++;
|
2000-08-27 12:40:48 +02:00
|
|
|
cnt++;
|
1998-09-01 06:40:42 +02:00
|
|
|
}
|
|
|
|
*to = 0;
|
2005-12-24 17:49:48 +01:00
|
|
|
return cnt;
|
1998-03-15 08:39:04 +01:00
|
|
|
}
|
|
|
|
|
1998-09-01 06:40:42 +02:00
|
|
|
static int
|
|
|
|
pg_euccn_mblen(const unsigned char *s)
|
1998-06-16 09:29:54 +02:00
|
|
|
{
|
1998-09-01 06:40:42 +02:00
|
|
|
int len;
|
1998-06-16 09:29:54 +02:00
|
|
|
|
2005-12-25 03:14:19 +01:00
|
|
|
if (IS_HIGHBIT_SET(*s))
|
1998-09-01 06:40:42 +02:00
|
|
|
len = 2;
|
|
|
|
else
|
|
|
|
len = 1;
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
1998-06-16 09:29:54 +02:00
|
|
|
}
|
|
|
|
|
2004-03-15 11:41:26 +01:00
|
|
|
static int
|
|
|
|
pg_euccn_dsplen(const unsigned char *s)
|
|
|
|
{
|
|
|
|
int len;
|
|
|
|
|
2005-12-25 03:14:19 +01:00
|
|
|
if (IS_HIGHBIT_SET(*s))
|
2004-03-15 11:41:26 +01:00
|
|
|
len = 2;
|
|
|
|
else
|
2006-02-10 01:39:04 +01:00
|
|
|
len = pg_ascii_dsplen(s);
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
2004-03-15 11:41:26 +01:00
|
|
|
}
|
|
|
|
|
1998-06-16 09:29:54 +02:00
|
|
|
/*
|
|
|
|
* EUC_TW
|
2005-12-24 10:35:36 +01:00
|
|
|
*
|
1998-06-16 09:29:54 +02:00
|
|
|
*/
|
2007-10-16 00:46:27 +02:00
|
|
|
static int
|
|
|
|
pg_euctw2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
|
1998-03-15 08:39:04 +01:00
|
|
|
{
|
2001-03-22 05:01:46 +01:00
|
|
|
int cnt = 0;
|
2000-08-27 12:40:48 +02:00
|
|
|
|
2001-03-08 01:24:34 +01:00
|
|
|
while (len > 0 && *from)
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
2005-12-24 10:35:36 +01:00
|
|
|
if (*from == SS2 && len >= 4) /* code set 2 */
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
|
|
|
from++;
|
2007-07-12 23:17:09 +02:00
|
|
|
*to = (((uint32) SS2) << 24) | (*from++ << 16);
|
1998-09-01 06:40:42 +02:00
|
|
|
*to |= *from++ << 8;
|
|
|
|
*to |= *from++;
|
2001-03-08 01:24:34 +01:00
|
|
|
len -= 4;
|
1998-09-01 06:40:42 +02:00
|
|
|
}
|
2005-12-24 10:35:36 +01:00
|
|
|
else if (*from == SS3 && len >= 3) /* code set 3 (unused?) */
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
|
|
|
from++;
|
2005-12-24 10:35:36 +01:00
|
|
|
*to = (SS3 << 16) | (*from++ << 8);
|
|
|
|
*to |= *from++;
|
1998-09-01 06:40:42 +02:00
|
|
|
len -= 3;
|
|
|
|
}
|
2006-10-04 02:30:14 +02:00
|
|
|
else if (IS_HIGHBIT_SET(*from) && len >= 2) /* code set 2 */
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
|
|
|
*to = *from++ << 8;
|
|
|
|
*to |= *from++;
|
|
|
|
len -= 2;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
*to = *from++;
|
|
|
|
len--;
|
|
|
|
}
|
|
|
|
to++;
|
2000-08-27 12:40:48 +02:00
|
|
|
cnt++;
|
1998-09-01 06:40:42 +02:00
|
|
|
}
|
|
|
|
*to = 0;
|
2005-12-24 17:49:48 +01:00
|
|
|
return cnt;
|
1998-03-15 08:39:04 +01:00
|
|
|
}
|
|
|
|
|
1998-09-01 06:40:42 +02:00
|
|
|
static int
|
|
|
|
pg_euctw_mblen(const unsigned char *s)
|
1998-06-16 09:29:54 +02:00
|
|
|
{
|
1998-09-01 06:40:42 +02:00
|
|
|
int len;
|
1998-06-16 09:29:54 +02:00
|
|
|
|
1998-09-01 06:40:42 +02:00
|
|
|
if (*s == SS2)
|
|
|
|
len = 4;
|
|
|
|
else if (*s == SS3)
|
|
|
|
len = 3;
|
2005-12-25 03:14:19 +01:00
|
|
|
else if (IS_HIGHBIT_SET(*s))
|
1998-09-01 06:40:42 +02:00
|
|
|
len = 2;
|
|
|
|
else
|
2006-05-21 22:05:21 +02:00
|
|
|
len = 1;
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
1998-06-16 09:29:54 +02:00
|
|
|
}
|
|
|
|
|
2004-03-15 11:41:26 +01:00
|
|
|
static int
|
|
|
|
pg_euctw_dsplen(const unsigned char *s)
|
|
|
|
{
|
|
|
|
int len;
|
|
|
|
|
|
|
|
if (*s == SS2)
|
|
|
|
len = 2;
|
|
|
|
else if (*s == SS3)
|
|
|
|
len = 2;
|
2005-12-25 03:14:19 +01:00
|
|
|
else if (IS_HIGHBIT_SET(*s))
|
2004-03-15 11:41:26 +01:00
|
|
|
len = 2;
|
|
|
|
else
|
2006-02-10 01:39:04 +01:00
|
|
|
len = pg_ascii_dsplen(s);
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
2004-03-15 11:41:26 +01:00
|
|
|
}
|
|
|
|
|
2002-03-05 06:52:50 +01:00
|
|
|
/*
|
|
|
|
* JOHAB
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
pg_johab_mblen(const unsigned char *s)
|
|
|
|
{
|
2005-12-24 17:49:48 +01:00
|
|
|
return pg_euc_mblen(s);
|
2002-03-05 06:52:50 +01:00
|
|
|
}
|
|
|
|
|
2004-03-15 11:41:26 +01:00
|
|
|
static int
|
|
|
|
pg_johab_dsplen(const unsigned char *s)
|
|
|
|
{
|
2005-12-24 17:49:48 +01:00
|
|
|
return pg_euc_dsplen(s);
|
2004-03-15 11:41:26 +01:00
|
|
|
}
|
|
|
|
|
1998-03-15 08:39:04 +01:00
|
|
|
/*
|
Get pg_utf_mblen(), pg_utf2wchar_with_len(), and utf2ucs() all on the same
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that). pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character. The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there. Per report from James Russell and Michael Fuhr.
Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.
This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
2007-01-24 18:12:17 +01:00
|
|
|
* convert UTF8 string to pg_wchar (UCS-4)
|
|
|
|
* caller must allocate enough space for "to", including a trailing zero!
|
1998-03-15 08:39:04 +01:00
|
|
|
* len: length of from.
|
|
|
|
* "from" not necessarily null terminated.
|
|
|
|
*/
|
2000-08-27 12:40:48 +02:00
|
|
|
static int
|
2001-10-25 07:50:21 +02:00
|
|
|
pg_utf2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
|
1998-03-15 08:39:04 +01:00
|
|
|
{
|
2001-03-22 05:01:46 +01:00
|
|
|
int cnt = 0;
|
Get pg_utf_mblen(), pg_utf2wchar_with_len(), and utf2ucs() all on the same
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that). pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character. The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there. Per report from James Russell and Michael Fuhr.
Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.
This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
2007-01-24 18:12:17 +01:00
|
|
|
uint32 c1,
|
|
|
|
c2,
|
|
|
|
c3,
|
|
|
|
c4;
|
1998-09-01 06:40:42 +02:00
|
|
|
|
2001-03-08 01:24:34 +01:00
|
|
|
while (len > 0 && *from)
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
Get pg_utf_mblen(), pg_utf2wchar_with_len(), and utf2ucs() all on the same
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that). pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character. The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there. Per report from James Russell and Michael Fuhr.
Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.
This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
2007-01-24 18:12:17 +01:00
|
|
|
if ((*from & 0x80) == 0)
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
|
|
|
*to = *from++;
|
|
|
|
len--;
|
|
|
|
}
|
Get pg_utf_mblen(), pg_utf2wchar_with_len(), and utf2ucs() all on the same
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that). pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character. The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there. Per report from James Russell and Michael Fuhr.
Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.
This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
2007-01-24 18:12:17 +01:00
|
|
|
else if ((*from & 0xe0) == 0xc0)
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
Get pg_utf_mblen(), pg_utf2wchar_with_len(), and utf2ucs() all on the same
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that). pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character. The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there. Per report from James Russell and Michael Fuhr.
Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.
This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
2007-01-24 18:12:17 +01:00
|
|
|
if (len < 2)
|
|
|
|
break; /* drop trailing incomplete char */
|
1998-09-01 06:40:42 +02:00
|
|
|
c1 = *from++ & 0x1f;
|
|
|
|
c2 = *from++ & 0x3f;
|
Get pg_utf_mblen(), pg_utf2wchar_with_len(), and utf2ucs() all on the same
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that). pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character. The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there. Per report from James Russell and Michael Fuhr.
Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.
This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
2007-01-24 18:12:17 +01:00
|
|
|
*to = (c1 << 6) | c2;
|
2001-03-08 01:24:34 +01:00
|
|
|
len -= 2;
|
1998-09-01 06:40:42 +02:00
|
|
|
}
|
Get pg_utf_mblen(), pg_utf2wchar_with_len(), and utf2ucs() all on the same
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that). pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character. The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there. Per report from James Russell and Michael Fuhr.
Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.
This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
2007-01-24 18:12:17 +01:00
|
|
|
else if ((*from & 0xf0) == 0xe0)
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
Get pg_utf_mblen(), pg_utf2wchar_with_len(), and utf2ucs() all on the same
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that). pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character. The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there. Per report from James Russell and Michael Fuhr.
Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.
This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
2007-01-24 18:12:17 +01:00
|
|
|
if (len < 3)
|
|
|
|
break; /* drop trailing incomplete char */
|
1998-09-01 06:40:42 +02:00
|
|
|
c1 = *from++ & 0x0f;
|
|
|
|
c2 = *from++ & 0x3f;
|
|
|
|
c3 = *from++ & 0x3f;
|
Get pg_utf_mblen(), pg_utf2wchar_with_len(), and utf2ucs() all on the same
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that). pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character. The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there. Per report from James Russell and Michael Fuhr.
Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.
This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
2007-01-24 18:12:17 +01:00
|
|
|
*to = (c1 << 12) | (c2 << 6) | c3;
|
2001-03-08 01:24:34 +01:00
|
|
|
len -= 3;
|
1998-09-01 06:40:42 +02:00
|
|
|
}
|
Get pg_utf_mblen(), pg_utf2wchar_with_len(), and utf2ucs() all on the same
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that). pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character. The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there. Per report from James Russell and Michael Fuhr.
Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.
This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
2007-01-24 18:12:17 +01:00
|
|
|
else if ((*from & 0xf8) == 0xf0)
|
|
|
|
{
|
|
|
|
if (len < 4)
|
|
|
|
break; /* drop trailing incomplete char */
|
|
|
|
c1 = *from++ & 0x07;
|
|
|
|
c2 = *from++ & 0x3f;
|
|
|
|
c3 = *from++ & 0x3f;
|
|
|
|
c4 = *from++ & 0x3f;
|
|
|
|
*to = (c1 << 18) | (c2 << 12) | (c3 << 6) | c4;
|
|
|
|
len -= 4;
|
|
|
|
}
|
1999-04-25 22:35:51 +02:00
|
|
|
else
|
|
|
|
{
|
Get pg_utf_mblen(), pg_utf2wchar_with_len(), and utf2ucs() all on the same
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that). pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character. The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there. Per report from James Russell and Michael Fuhr.
Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.
This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
2007-01-24 18:12:17 +01:00
|
|
|
/* treat a bogus char as length 1; not ours to raise error */
|
1999-04-25 22:35:51 +02:00
|
|
|
*to = *from++;
|
|
|
|
len--;
|
|
|
|
}
|
1998-09-01 06:40:42 +02:00
|
|
|
to++;
|
2000-08-27 12:40:48 +02:00
|
|
|
cnt++;
|
1998-09-01 06:40:42 +02:00
|
|
|
}
|
|
|
|
*to = 0;
|
2005-12-24 17:49:48 +01:00
|
|
|
return cnt;
|
1998-03-15 08:39:04 +01:00
|
|
|
}
|
|
|
|
|
2008-10-29 09:04:54 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Map a Unicode code point to UTF-8. utf8string must have 4 bytes of
|
|
|
|
* space allocated.
|
|
|
|
*/
|
|
|
|
unsigned char *
|
|
|
|
unicode_to_utf8(pg_wchar c, unsigned char *utf8string)
|
|
|
|
{
|
|
|
|
if (c <= 0x7F)
|
|
|
|
{
|
|
|
|
utf8string[0] = c;
|
|
|
|
}
|
|
|
|
else if (c <= 0x7FF)
|
|
|
|
{
|
|
|
|
utf8string[0] = 0xC0 | ((c >> 6) & 0x1F);
|
|
|
|
utf8string[1] = 0x80 | (c & 0x3F);
|
|
|
|
}
|
|
|
|
else if (c <= 0xFFFF)
|
|
|
|
{
|
|
|
|
utf8string[0] = 0xE0 | ((c >> 12) & 0x0F);
|
|
|
|
utf8string[1] = 0x80 | ((c >> 6) & 0x3F);
|
|
|
|
utf8string[2] = 0x80 | (c & 0x3F);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
utf8string[0] = 0xF0 | ((c >> 18) & 0x07);
|
|
|
|
utf8string[1] = 0x80 | ((c >> 12) & 0x3F);
|
|
|
|
utf8string[2] = 0x80 | ((c >> 6) & 0x3F);
|
|
|
|
utf8string[3] = 0x80 | (c & 0x3F);
|
|
|
|
}
|
|
|
|
|
|
|
|
return utf8string;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2000-10-12 08:06:50 +02:00
|
|
|
/*
|
Get pg_utf_mblen(), pg_utf2wchar_with_len(), and utf2ucs() all on the same
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that). pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character. The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there. Per report from James Russell and Michael Fuhr.
Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.
This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
2007-01-24 18:12:17 +01:00
|
|
|
* Return the byte length of a UTF8 character pointed to by s
|
|
|
|
*
|
|
|
|
* Note: in the current implementation we do not support UTF8 sequences
|
|
|
|
* of more than 4 bytes; hence do NOT return a value larger than 4.
|
|
|
|
* We return "1" for any leading byte that is either flat-out illegal or
|
|
|
|
* indicates a length larger than we support.
|
|
|
|
*
|
2010-08-18 21:54:01 +02:00
|
|
|
* pg_utf2wchar_with_len(), utf8_to_unicode(), pg_utf8_islegal(), and perhaps
|
Get pg_utf_mblen(), pg_utf2wchar_with_len(), and utf2ucs() all on the same
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that). pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character. The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there. Per report from James Russell and Michael Fuhr.
Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.
This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
2007-01-24 18:12:17 +01:00
|
|
|
* other places would need to be fixed to change this.
|
2000-10-12 08:06:50 +02:00
|
|
|
*/
|
|
|
|
int
|
2004-12-03 02:20:33 +01:00
|
|
|
pg_utf_mblen(const unsigned char *s)
|
1998-06-16 09:29:54 +02:00
|
|
|
{
|
Get pg_utf_mblen(), pg_utf2wchar_with_len(), and utf2ucs() all on the same
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that). pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character. The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there. Per report from James Russell and Michael Fuhr.
Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.
This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
2007-01-24 18:12:17 +01:00
|
|
|
int len;
|
1998-06-16 09:29:54 +02:00
|
|
|
|
1998-09-01 06:40:42 +02:00
|
|
|
if ((*s & 0x80) == 0)
|
|
|
|
len = 1;
|
|
|
|
else if ((*s & 0xe0) == 0xc0)
|
|
|
|
len = 2;
|
2005-10-15 04:49:52 +02:00
|
|
|
else if ((*s & 0xf0) == 0xe0)
|
|
|
|
len = 3;
|
|
|
|
else if ((*s & 0xf8) == 0xf0)
|
|
|
|
len = 4;
|
Get pg_utf_mblen(), pg_utf2wchar_with_len(), and utf2ucs() all on the same
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that). pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character. The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there. Per report from James Russell and Michael Fuhr.
Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.
This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
2007-01-24 18:12:17 +01:00
|
|
|
#ifdef NOT_USED
|
2005-10-15 04:49:52 +02:00
|
|
|
else if ((*s & 0xfc) == 0xf8)
|
|
|
|
len = 5;
|
|
|
|
else if ((*s & 0xfe) == 0xfc)
|
|
|
|
len = 6;
|
Get pg_utf_mblen(), pg_utf2wchar_with_len(), and utf2ucs() all on the same
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that). pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character. The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there. Per report from James Russell and Michael Fuhr.
Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.
This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
2007-01-24 18:12:17 +01:00
|
|
|
#endif
|
|
|
|
else
|
|
|
|
len = 1;
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
1998-06-16 09:29:54 +02:00
|
|
|
}
|
|
|
|
|
2006-02-10 01:39:04 +01:00
|
|
|
/*
|
|
|
|
* This is an implementation of wcwidth() and wcswidth() as defined in
|
|
|
|
* "The Single UNIX Specification, Version 2, The Open Group, 1997"
|
|
|
|
* <http://www.UNIX-systems.org/online.html>
|
|
|
|
*
|
|
|
|
* Markus Kuhn -- 2001-09-08 -- public domain
|
|
|
|
*
|
|
|
|
* customised for PostgreSQL
|
|
|
|
*
|
|
|
|
* original available at : http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
|
|
|
|
*/
|
|
|
|
|
|
|
|
struct mbinterval
|
|
|
|
{
|
|
|
|
unsigned short first;
|
|
|
|
unsigned short last;
|
|
|
|
};
|
|
|
|
|
|
|
|
/* auxiliary function for binary search in interval table */
|
|
|
|
static int
|
2006-10-04 02:30:14 +02:00
|
|
|
mbbisearch(pg_wchar ucs, const struct mbinterval * table, int max)
|
2006-02-10 01:39:04 +01:00
|
|
|
{
|
|
|
|
int min = 0;
|
|
|
|
int mid;
|
|
|
|
|
|
|
|
if (ucs < table[0].first || ucs > table[max].last)
|
|
|
|
return 0;
|
|
|
|
while (max >= min)
|
|
|
|
{
|
|
|
|
mid = (min + max) / 2;
|
|
|
|
if (ucs > table[mid].last)
|
|
|
|
min = mid + 1;
|
|
|
|
else if (ucs < table[mid].first)
|
|
|
|
max = mid - 1;
|
|
|
|
else
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* The following functions define the column width of an ISO 10646
|
|
|
|
* character as follows:
|
|
|
|
*
|
|
|
|
* - The null character (U+0000) has a column width of 0.
|
|
|
|
*
|
|
|
|
* - Other C0/C1 control characters and DEL will lead to a return
|
|
|
|
* value of -1.
|
|
|
|
*
|
|
|
|
* - Non-spacing and enclosing combining characters (general
|
|
|
|
* category code Mn or Me in the Unicode database) have a
|
|
|
|
* column width of 0.
|
|
|
|
*
|
|
|
|
* - Other format characters (general category code Cf in the Unicode
|
|
|
|
* database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.
|
|
|
|
*
|
|
|
|
* - Hangul Jamo medial vowels and final consonants (U+1160-U+11FF)
|
|
|
|
* have a column width of 0.
|
|
|
|
*
|
|
|
|
* - Spacing characters in the East Asian Wide (W) or East Asian
|
|
|
|
* FullWidth (F) category as defined in Unicode Technical
|
|
|
|
* Report #11 have a column width of 2.
|
|
|
|
*
|
|
|
|
* - All remaining characters (including all printable
|
|
|
|
* ISO 8859-1 and WGL4 characters, Unicode control characters,
|
|
|
|
* etc.) have a column width of 1.
|
|
|
|
*
|
|
|
|
* This implementation assumes that wchar_t characters are encoded
|
|
|
|
* in ISO 10646.
|
|
|
|
*/
|
|
|
|
|
|
|
|
static int
|
|
|
|
ucs_wcwidth(pg_wchar ucs)
|
|
|
|
{
|
|
|
|
/* sorted list of non-overlapping intervals of non-spacing characters */
|
|
|
|
static const struct mbinterval combining[] = {
|
|
|
|
{0x0300, 0x034E}, {0x0360, 0x0362}, {0x0483, 0x0486},
|
|
|
|
{0x0488, 0x0489}, {0x0591, 0x05A1}, {0x05A3, 0x05B9},
|
|
|
|
{0x05BB, 0x05BD}, {0x05BF, 0x05BF}, {0x05C1, 0x05C2},
|
|
|
|
{0x05C4, 0x05C4}, {0x064B, 0x0655}, {0x0670, 0x0670},
|
|
|
|
{0x06D6, 0x06E4}, {0x06E7, 0x06E8}, {0x06EA, 0x06ED},
|
|
|
|
{0x070F, 0x070F}, {0x0711, 0x0711}, {0x0730, 0x074A},
|
|
|
|
{0x07A6, 0x07B0}, {0x0901, 0x0902}, {0x093C, 0x093C},
|
|
|
|
{0x0941, 0x0948}, {0x094D, 0x094D}, {0x0951, 0x0954},
|
|
|
|
{0x0962, 0x0963}, {0x0981, 0x0981}, {0x09BC, 0x09BC},
|
|
|
|
{0x09C1, 0x09C4}, {0x09CD, 0x09CD}, {0x09E2, 0x09E3},
|
|
|
|
{0x0A02, 0x0A02}, {0x0A3C, 0x0A3C}, {0x0A41, 0x0A42},
|
|
|
|
{0x0A47, 0x0A48}, {0x0A4B, 0x0A4D}, {0x0A70, 0x0A71},
|
|
|
|
{0x0A81, 0x0A82}, {0x0ABC, 0x0ABC}, {0x0AC1, 0x0AC5},
|
|
|
|
{0x0AC7, 0x0AC8}, {0x0ACD, 0x0ACD}, {0x0B01, 0x0B01},
|
|
|
|
{0x0B3C, 0x0B3C}, {0x0B3F, 0x0B3F}, {0x0B41, 0x0B43},
|
|
|
|
{0x0B4D, 0x0B4D}, {0x0B56, 0x0B56}, {0x0B82, 0x0B82},
|
|
|
|
{0x0BC0, 0x0BC0}, {0x0BCD, 0x0BCD}, {0x0C3E, 0x0C40},
|
|
|
|
{0x0C46, 0x0C48}, {0x0C4A, 0x0C4D}, {0x0C55, 0x0C56},
|
|
|
|
{0x0CBF, 0x0CBF}, {0x0CC6, 0x0CC6}, {0x0CCC, 0x0CCD},
|
|
|
|
{0x0D41, 0x0D43}, {0x0D4D, 0x0D4D}, {0x0DCA, 0x0DCA},
|
|
|
|
{0x0DD2, 0x0DD4}, {0x0DD6, 0x0DD6}, {0x0E31, 0x0E31},
|
|
|
|
{0x0E34, 0x0E3A}, {0x0E47, 0x0E4E}, {0x0EB1, 0x0EB1},
|
|
|
|
{0x0EB4, 0x0EB9}, {0x0EBB, 0x0EBC}, {0x0EC8, 0x0ECD},
|
|
|
|
{0x0F18, 0x0F19}, {0x0F35, 0x0F35}, {0x0F37, 0x0F37},
|
|
|
|
{0x0F39, 0x0F39}, {0x0F71, 0x0F7E}, {0x0F80, 0x0F84},
|
|
|
|
{0x0F86, 0x0F87}, {0x0F90, 0x0F97}, {0x0F99, 0x0FBC},
|
|
|
|
{0x0FC6, 0x0FC6}, {0x102D, 0x1030}, {0x1032, 0x1032},
|
|
|
|
{0x1036, 0x1037}, {0x1039, 0x1039}, {0x1058, 0x1059},
|
|
|
|
{0x1160, 0x11FF}, {0x17B7, 0x17BD}, {0x17C6, 0x17C6},
|
|
|
|
{0x17C9, 0x17D3}, {0x180B, 0x180E}, {0x18A9, 0x18A9},
|
|
|
|
{0x200B, 0x200F}, {0x202A, 0x202E}, {0x206A, 0x206F},
|
|
|
|
{0x20D0, 0x20E3}, {0x302A, 0x302F}, {0x3099, 0x309A},
|
|
|
|
{0xFB1E, 0xFB1E}, {0xFE20, 0xFE23}, {0xFEFF, 0xFEFF},
|
|
|
|
{0xFFF9, 0xFFFB}
|
|
|
|
};
|
|
|
|
|
|
|
|
/* test for 8-bit control characters */
|
|
|
|
if (ucs == 0)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (ucs < 0x20 || (ucs >= 0x7f && ucs < 0xa0) || ucs > 0x0010ffff)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
/* binary search in table of non-spacing characters */
|
|
|
|
if (mbbisearch(ucs, combining,
|
|
|
|
sizeof(combining) / sizeof(struct mbinterval) - 1))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* if we arrive here, ucs is not a combining or C0/C1 control character
|
|
|
|
*/
|
|
|
|
|
|
|
|
return 1 +
|
|
|
|
(ucs >= 0x1100 &&
|
|
|
|
(ucs <= 0x115f || /* Hangul Jamo init. consonants */
|
|
|
|
(ucs >= 0x2e80 && ucs <= 0xa4cf && (ucs & ~0x0011) != 0x300a &&
|
|
|
|
ucs != 0x303f) || /* CJK ... Yi */
|
|
|
|
(ucs >= 0xac00 && ucs <= 0xd7a3) || /* Hangul Syllables */
|
|
|
|
(ucs >= 0xf900 && ucs <= 0xfaff) || /* CJK Compatibility
|
|
|
|
* Ideographs */
|
|
|
|
(ucs >= 0xfe30 && ucs <= 0xfe6f) || /* CJK Compatibility Forms */
|
|
|
|
(ucs >= 0xff00 && ucs <= 0xff5f) || /* Fullwidth Forms */
|
|
|
|
(ucs >= 0xffe0 && ucs <= 0xffe6) ||
|
|
|
|
(ucs >= 0x20000 && ucs <= 0x2ffff)));
|
|
|
|
}
|
|
|
|
|
2010-08-18 21:54:01 +02:00
|
|
|
/*
|
|
|
|
* Convert a UTF-8 character to a Unicode code point.
|
|
|
|
* This is a one-character version of pg_utf2wchar_with_len.
|
|
|
|
*
|
|
|
|
* No error checks here, c must point to a long-enough string.
|
|
|
|
*/
|
|
|
|
pg_wchar
|
|
|
|
utf8_to_unicode(const unsigned char *c)
|
2006-02-10 01:39:04 +01:00
|
|
|
{
|
|
|
|
if ((*c & 0x80) == 0)
|
|
|
|
return (pg_wchar) c[0];
|
|
|
|
else if ((*c & 0xe0) == 0xc0)
|
|
|
|
return (pg_wchar) (((c[0] & 0x1f) << 6) |
|
|
|
|
(c[1] & 0x3f));
|
|
|
|
else if ((*c & 0xf0) == 0xe0)
|
|
|
|
return (pg_wchar) (((c[0] & 0x0f) << 12) |
|
|
|
|
((c[1] & 0x3f) << 6) |
|
|
|
|
(c[2] & 0x3f));
|
Get pg_utf_mblen(), pg_utf2wchar_with_len(), and utf2ucs() all on the same
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that). pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character. The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there. Per report from James Russell and Michael Fuhr.
Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.
This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
2007-01-24 18:12:17 +01:00
|
|
|
else if ((*c & 0xf8) == 0xf0)
|
2006-02-10 01:39:04 +01:00
|
|
|
return (pg_wchar) (((c[0] & 0x07) << 18) |
|
|
|
|
((c[1] & 0x3f) << 12) |
|
|
|
|
((c[2] & 0x3f) << 6) |
|
|
|
|
(c[3] & 0x3f));
|
|
|
|
else
|
|
|
|
/* that is an invalid code on purpose */
|
|
|
|
return 0xffffffff;
|
|
|
|
}
|
|
|
|
|
2004-03-15 11:41:26 +01:00
|
|
|
static int
|
2004-12-03 02:20:33 +01:00
|
|
|
pg_utf_dsplen(const unsigned char *s)
|
2004-03-15 11:41:26 +01:00
|
|
|
{
|
2010-08-18 21:54:01 +02:00
|
|
|
return ucs_wcwidth(utf8_to_unicode(s));
|
2004-03-15 11:41:26 +01:00
|
|
|
}
|
|
|
|
|
1998-03-15 08:39:04 +01:00
|
|
|
/*
|
|
|
|
* convert mule internal code to pg_wchar
|
|
|
|
* caller should allocate enough space for "to"
|
|
|
|
* len: length of from.
|
|
|
|
* "from" not necessarily null terminated.
|
|
|
|
*/
|
2000-08-27 12:40:48 +02:00
|
|
|
static int
|
2001-10-25 07:50:21 +02:00
|
|
|
pg_mule2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
|
1998-03-15 08:39:04 +01:00
|
|
|
{
|
2001-03-22 05:01:46 +01:00
|
|
|
int cnt = 0;
|
2000-08-27 12:40:48 +02:00
|
|
|
|
2001-03-08 01:24:34 +01:00
|
|
|
while (len > 0 && *from)
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
2001-03-08 01:24:34 +01:00
|
|
|
if (IS_LC1(*from) && len >= 2)
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
|
|
|
*to = *from++ << 16;
|
|
|
|
*to |= *from++;
|
|
|
|
len -= 2;
|
|
|
|
}
|
2001-03-08 01:24:34 +01:00
|
|
|
else if (IS_LCPRV1(*from) && len >= 3)
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
|
|
|
from++;
|
|
|
|
*to = *from++ << 16;
|
|
|
|
*to |= *from++;
|
|
|
|
len -= 3;
|
|
|
|
}
|
2001-03-08 01:24:34 +01:00
|
|
|
else if (IS_LC2(*from) && len >= 3)
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
|
|
|
*to = *from++ << 16;
|
|
|
|
*to |= *from++ << 8;
|
|
|
|
*to |= *from++;
|
|
|
|
len -= 3;
|
|
|
|
}
|
2001-03-08 01:24:34 +01:00
|
|
|
else if (IS_LCPRV2(*from) && len >= 4)
|
1998-09-01 06:40:42 +02:00
|
|
|
{
|
|
|
|
from++;
|
|
|
|
*to = *from++ << 16;
|
|
|
|
*to |= *from++ << 8;
|
|
|
|
*to |= *from++;
|
|
|
|
len -= 4;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{ /* assume ASCII */
|
|
|
|
*to = (unsigned char) *from++;
|
|
|
|
len--;
|
|
|
|
}
|
|
|
|
to++;
|
2000-08-27 12:40:48 +02:00
|
|
|
cnt++;
|
1998-09-01 06:40:42 +02:00
|
|
|
}
|
|
|
|
*to = 0;
|
2005-12-24 17:49:48 +01:00
|
|
|
return cnt;
|
1998-03-15 08:39:04 +01:00
|
|
|
}
|
|
|
|
|
1998-09-01 06:40:42 +02:00
|
|
|
int
|
|
|
|
pg_mule_mblen(const unsigned char *s)
|
1998-04-27 19:10:50 +02:00
|
|
|
{
|
1998-09-01 06:40:42 +02:00
|
|
|
int len;
|
1998-04-27 19:10:50 +02:00
|
|
|
|
1998-09-01 06:40:42 +02:00
|
|
|
if (IS_LC1(*s))
|
|
|
|
len = 2;
|
|
|
|
else if (IS_LCPRV1(*s))
|
|
|
|
len = 3;
|
|
|
|
else if (IS_LC2(*s))
|
|
|
|
len = 3;
|
|
|
|
else if (IS_LCPRV2(*s))
|
|
|
|
len = 4;
|
|
|
|
else
|
2006-10-04 02:30:14 +02:00
|
|
|
len = 1; /* assume ASCII */
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
1998-04-27 19:10:50 +02:00
|
|
|
}
|
|
|
|
|
2004-03-15 11:41:26 +01:00
|
|
|
static int
|
|
|
|
pg_mule_dsplen(const unsigned char *s)
|
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
int len;
|
2006-05-21 22:05:21 +02:00
|
|
|
|
|
|
|
if (IS_LC1(*s))
|
|
|
|
len = 1;
|
|
|
|
else if (IS_LCPRV1(*s))
|
|
|
|
len = 1;
|
|
|
|
else if (IS_LC2(*s))
|
|
|
|
len = 2;
|
|
|
|
else if (IS_LCPRV2(*s))
|
|
|
|
len = 2;
|
|
|
|
else
|
2006-10-04 02:30:14 +02:00
|
|
|
len = 1; /* assume ASCII */
|
2006-05-21 22:05:21 +02:00
|
|
|
|
|
|
|
return len;
|
2004-03-15 11:41:26 +01:00
|
|
|
}
|
|
|
|
|
1998-06-16 09:29:54 +02:00
|
|
|
/*
|
|
|
|
* ISO8859-1
|
|
|
|
*/
|
2000-08-27 12:40:48 +02:00
|
|
|
static int
|
2001-10-25 07:50:21 +02:00
|
|
|
pg_latin12wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
|
1998-04-27 19:10:50 +02:00
|
|
|
{
|
2001-03-22 05:01:46 +01:00
|
|
|
int cnt = 0;
|
2000-08-27 12:40:48 +02:00
|
|
|
|
2001-03-08 01:24:34 +01:00
|
|
|
while (len > 0 && *from)
|
2000-08-27 12:40:48 +02:00
|
|
|
{
|
1998-09-01 06:40:42 +02:00
|
|
|
*to++ = *from++;
|
2001-03-08 01:24:34 +01:00
|
|
|
len--;
|
2000-08-27 12:40:48 +02:00
|
|
|
cnt++;
|
|
|
|
}
|
1998-09-01 06:40:42 +02:00
|
|
|
*to = 0;
|
2005-12-24 17:49:48 +01:00
|
|
|
return cnt;
|
1998-04-27 19:10:50 +02:00
|
|
|
}
|
|
|
|
|
1998-09-01 06:40:42 +02:00
|
|
|
static int
|
|
|
|
pg_latin1_mblen(const unsigned char *s)
|
1998-04-27 19:10:50 +02:00
|
|
|
{
|
2005-12-24 17:49:48 +01:00
|
|
|
return 1;
|
1998-04-27 19:10:50 +02:00
|
|
|
}
|
|
|
|
|
2004-03-15 11:41:26 +01:00
|
|
|
static int
|
|
|
|
pg_latin1_dsplen(const unsigned char *s)
|
|
|
|
{
|
2006-02-10 01:39:04 +01:00
|
|
|
return pg_ascii_dsplen(s);
|
2004-03-15 11:41:26 +01:00
|
|
|
}
|
|
|
|
|
1998-06-16 09:29:54 +02:00
|
|
|
/*
|
|
|
|
* SJIS
|
|
|
|
*/
|
1998-09-01 06:40:42 +02:00
|
|
|
static int
|
|
|
|
pg_sjis_mblen(const unsigned char *s)
|
1998-04-27 19:10:50 +02:00
|
|
|
{
|
1998-09-01 06:40:42 +02:00
|
|
|
int len;
|
1998-04-27 19:10:50 +02:00
|
|
|
|
1998-09-01 06:40:42 +02:00
|
|
|
if (*s >= 0xa1 && *s <= 0xdf)
|
2006-10-04 02:30:14 +02:00
|
|
|
len = 1; /* 1 byte kana? */
|
2005-12-26 20:30:45 +01:00
|
|
|
else if (IS_HIGHBIT_SET(*s))
|
2006-10-04 02:30:14 +02:00
|
|
|
len = 2; /* kanji? */
|
1998-09-01 06:40:42 +02:00
|
|
|
else
|
2006-10-04 02:30:14 +02:00
|
|
|
len = 1; /* should be ASCII */
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
1998-04-27 19:10:50 +02:00
|
|
|
}
|
|
|
|
|
2004-03-15 11:41:26 +01:00
|
|
|
static int
|
|
|
|
pg_sjis_dsplen(const unsigned char *s)
|
|
|
|
{
|
|
|
|
int len;
|
|
|
|
|
|
|
|
if (*s >= 0xa1 && *s <= 0xdf)
|
2006-10-04 02:30:14 +02:00
|
|
|
len = 1; /* 1 byte kana? */
|
2005-12-26 20:30:45 +01:00
|
|
|
else if (IS_HIGHBIT_SET(*s))
|
2006-10-04 02:30:14 +02:00
|
|
|
len = 2; /* kanji? */
|
2004-03-15 11:41:26 +01:00
|
|
|
else
|
2006-10-04 02:30:14 +02:00
|
|
|
len = pg_ascii_dsplen(s); /* should be ASCII */
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
2004-03-15 11:41:26 +01:00
|
|
|
}
|
|
|
|
|
1999-02-02 19:51:40 +01:00
|
|
|
/*
|
|
|
|
* Big5
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
pg_big5_mblen(const unsigned char *s)
|
|
|
|
{
|
|
|
|
int len;
|
|
|
|
|
2005-12-26 20:30:45 +01:00
|
|
|
if (IS_HIGHBIT_SET(*s))
|
2006-10-04 02:30:14 +02:00
|
|
|
len = 2; /* kanji? */
|
1999-02-02 19:51:40 +01:00
|
|
|
else
|
2006-10-04 02:30:14 +02:00
|
|
|
len = 1; /* should be ASCII */
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
1999-02-02 19:51:40 +01:00
|
|
|
}
|
|
|
|
|
2004-03-15 11:41:26 +01:00
|
|
|
static int
|
|
|
|
pg_big5_dsplen(const unsigned char *s)
|
|
|
|
{
|
|
|
|
int len;
|
|
|
|
|
2005-12-26 20:30:45 +01:00
|
|
|
if (IS_HIGHBIT_SET(*s))
|
2006-10-04 02:30:14 +02:00
|
|
|
len = 2; /* kanji? */
|
2004-03-15 11:41:26 +01:00
|
|
|
else
|
2006-10-04 02:30:14 +02:00
|
|
|
len = pg_ascii_dsplen(s); /* should be ASCII */
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
2004-03-15 11:41:26 +01:00
|
|
|
}
|
|
|
|
|
2002-03-05 06:52:50 +01:00
|
|
|
/*
|
|
|
|
* GBK
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
pg_gbk_mblen(const unsigned char *s)
|
|
|
|
{
|
|
|
|
int len;
|
|
|
|
|
2005-12-26 20:30:45 +01:00
|
|
|
if (IS_HIGHBIT_SET(*s))
|
2006-10-04 02:30:14 +02:00
|
|
|
len = 2; /* kanji? */
|
2002-03-05 06:52:50 +01:00
|
|
|
else
|
2006-10-04 02:30:14 +02:00
|
|
|
len = 1; /* should be ASCII */
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
2002-03-05 06:52:50 +01:00
|
|
|
}
|
|
|
|
|
2004-03-15 11:41:26 +01:00
|
|
|
static int
|
|
|
|
pg_gbk_dsplen(const unsigned char *s)
|
|
|
|
{
|
|
|
|
int len;
|
|
|
|
|
2005-12-26 20:30:45 +01:00
|
|
|
if (IS_HIGHBIT_SET(*s))
|
2006-10-04 02:30:14 +02:00
|
|
|
len = 2; /* kanji? */
|
2004-03-15 11:41:26 +01:00
|
|
|
else
|
2006-10-04 02:30:14 +02:00
|
|
|
len = pg_ascii_dsplen(s); /* should be ASCII */
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
2004-03-15 11:41:26 +01:00
|
|
|
}
|
|
|
|
|
2002-03-05 06:52:50 +01:00
|
|
|
/*
|
|
|
|
* UHC
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
pg_uhc_mblen(const unsigned char *s)
|
|
|
|
{
|
|
|
|
int len;
|
|
|
|
|
2005-12-26 20:30:45 +01:00
|
|
|
if (IS_HIGHBIT_SET(*s))
|
2006-10-04 02:30:14 +02:00
|
|
|
len = 2; /* 2byte? */
|
2002-03-05 06:52:50 +01:00
|
|
|
else
|
2006-10-04 02:30:14 +02:00
|
|
|
len = 1; /* should be ASCII */
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
2002-03-05 06:52:50 +01:00
|
|
|
}
|
|
|
|
|
2004-03-15 11:41:26 +01:00
|
|
|
static int
|
|
|
|
pg_uhc_dsplen(const unsigned char *s)
|
|
|
|
{
|
|
|
|
int len;
|
|
|
|
|
2005-12-26 20:30:45 +01:00
|
|
|
if (IS_HIGHBIT_SET(*s))
|
2006-10-04 02:30:14 +02:00
|
|
|
len = 2; /* 2byte? */
|
2004-03-15 11:41:26 +01:00
|
|
|
else
|
2006-10-04 02:30:14 +02:00
|
|
|
len = pg_ascii_dsplen(s); /* should be ASCII */
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
2004-03-15 11:41:26 +01:00
|
|
|
}
|
|
|
|
|
2002-06-13 10:30:22 +02:00
|
|
|
/*
|
2002-09-04 22:31:48 +02:00
|
|
|
* * GB18030
|
|
|
|
* * Added by Bill Huang <bhuang@redhat.com>,<bill_huanghb@ybb.ne.jp>
|
|
|
|
* */
|
2002-06-13 10:30:22 +02:00
|
|
|
static int
|
|
|
|
pg_gb18030_mblen(const unsigned char *s)
|
|
|
|
{
|
2002-09-04 22:31:48 +02:00
|
|
|
int len;
|
|
|
|
|
2005-12-26 20:30:45 +01:00
|
|
|
if (!IS_HIGHBIT_SET(*s))
|
2006-10-04 02:30:14 +02:00
|
|
|
len = 1; /* ASCII */
|
2002-09-04 22:31:48 +02:00
|
|
|
else
|
|
|
|
{
|
|
|
|
if ((*(s + 1) >= 0x40 && *(s + 1) <= 0x7e) || (*(s + 1) >= 0x80 && *(s + 1) <= 0xfe))
|
|
|
|
len = 2;
|
|
|
|
else if (*(s + 1) >= 0x30 && *(s + 1) <= 0x39)
|
|
|
|
len = 4;
|
|
|
|
else
|
|
|
|
len = 2;
|
|
|
|
}
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
2002-06-13 10:30:22 +02:00
|
|
|
}
|
|
|
|
|
2004-03-15 11:41:26 +01:00
|
|
|
static int
|
|
|
|
pg_gb18030_dsplen(const unsigned char *s)
|
|
|
|
{
|
|
|
|
int len;
|
|
|
|
|
2006-02-10 01:39:04 +01:00
|
|
|
if (IS_HIGHBIT_SET(*s))
|
2004-03-15 11:41:26 +01:00
|
|
|
len = 2;
|
2006-02-10 01:39:04 +01:00
|
|
|
else
|
2006-10-04 02:30:14 +02:00
|
|
|
len = pg_ascii_dsplen(s); /* ASCII */
|
2005-12-24 17:49:48 +01:00
|
|
|
return len;
|
2004-03-15 11:41:26 +01:00
|
|
|
}
|
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
/*
|
|
|
|
*-------------------------------------------------------------------
|
|
|
|
* multibyte sequence validators
|
|
|
|
*
|
|
|
|
* These functions accept "s", a pointer to the first byte of a string,
|
|
|
|
* and "len", the remaining length of the string. If there is a validly
|
|
|
|
* encoded character beginning at *s, return its length in bytes; else
|
|
|
|
* return -1.
|
|
|
|
*
|
|
|
|
* The functions can assume that len > 0 and that *s != '\0', but they must
|
|
|
|
* test for and reject zeroes in any additional bytes of a multibyte character.
|
|
|
|
*
|
|
|
|
* Note that this definition allows the function for a single-byte
|
|
|
|
* encoding to be just "return 1".
|
|
|
|
*-------------------------------------------------------------------
|
|
|
|
*/
|
2002-09-04 22:31:48 +02:00
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
static int
|
|
|
|
pg_ascii_verifier(const unsigned char *s, int len)
|
|
|
|
{
|
|
|
|
return 1;
|
|
|
|
}
|
1998-06-16 09:29:54 +02:00
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
#define IS_EUC_RANGE_VALID(c) ((c) >= 0xa1 && (c) <= 0xfe)
|
|
|
|
|
|
|
|
static int
|
|
|
|
pg_eucjp_verifier(const unsigned char *s, int len)
|
1998-06-16 09:29:54 +02:00
|
|
|
{
|
2006-05-21 22:05:21 +02:00
|
|
|
int l;
|
2006-10-04 02:30:14 +02:00
|
|
|
unsigned char c1,
|
|
|
|
c2;
|
2006-05-21 22:05:21 +02:00
|
|
|
|
|
|
|
c1 = *s++;
|
|
|
|
|
|
|
|
switch (c1)
|
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
case SS2: /* JIS X 0201 */
|
2006-05-21 22:05:21 +02:00
|
|
|
l = 2;
|
|
|
|
if (l > len)
|
|
|
|
return -1;
|
|
|
|
c2 = *s++;
|
|
|
|
if (c2 < 0xa1 || c2 > 0xdf)
|
|
|
|
return -1;
|
|
|
|
break;
|
|
|
|
|
2006-10-04 02:30:14 +02:00
|
|
|
case SS3: /* JIS X 0212 */
|
2006-05-21 22:05:21 +02:00
|
|
|
l = 3;
|
|
|
|
if (l > len)
|
|
|
|
return -1;
|
|
|
|
c2 = *s++;
|
|
|
|
if (!IS_EUC_RANGE_VALID(c2))
|
|
|
|
return -1;
|
|
|
|
c2 = *s++;
|
|
|
|
if (!IS_EUC_RANGE_VALID(c2))
|
|
|
|
return -1;
|
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
|
|
|
if (IS_HIGHBIT_SET(c1)) /* JIS X 0208? */
|
|
|
|
{
|
|
|
|
l = 2;
|
|
|
|
if (l > len)
|
|
|
|
return -1;
|
|
|
|
if (!IS_EUC_RANGE_VALID(c1))
|
|
|
|
return -1;
|
|
|
|
c2 = *s++;
|
|
|
|
if (!IS_EUC_RANGE_VALID(c2))
|
|
|
|
return -1;
|
|
|
|
}
|
2006-10-04 02:30:14 +02:00
|
|
|
else
|
|
|
|
/* must be ASCII */
|
2006-05-21 22:05:21 +02:00
|
|
|
{
|
|
|
|
l = 1;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return l;
|
1998-06-16 09:29:54 +02:00
|
|
|
}
|
2001-02-11 02:59:22 +01:00
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
static int
|
|
|
|
pg_euckr_verifier(const unsigned char *s, int len)
|
2001-02-11 02:59:22 +01:00
|
|
|
{
|
2006-05-21 22:05:21 +02:00
|
|
|
int l;
|
2006-10-04 02:30:14 +02:00
|
|
|
unsigned char c1,
|
|
|
|
c2;
|
Commit Karel's patch.
-------------------------------------------------------------------
Subject: Re: [PATCHES] encoding names
From: Karel Zak <zakkr@zf.jcu.cz>
To: Peter Eisentraut <peter_e@gmx.net>
Cc: pgsql-patches <pgsql-patches@postgresql.org>
Date: Fri, 31 Aug 2001 17:24:38 +0200
On Thu, Aug 30, 2001 at 01:30:40AM +0200, Peter Eisentraut wrote:
> > - convert encoding 'name' to 'id'
>
> I thought we decided not to add functions returning "new" names until we
> know exactly what the new names should be, and pending schema
Ok, the patch not to add functions.
> better
>
> ...(): encoding name too long
Fixed.
I found new bug in command/variable.c in parse_client_encoding(), nobody
probably never see this error:
if (pg_set_client_encoding(encoding))
{
elog(ERROR, "Conversion between %s and %s is not supported",
value, GetDatabaseEncodingName());
}
because pg_set_client_encoding() returns -1 for error and 0 as true.
It's fixed too.
IMHO it can be apply.
Karel
PS:
* following files are renamed:
src/utils/mb/Unicode/KOI8_to_utf8.map -->
src/utils/mb/Unicode/koi8r_to_utf8.map
src/utils/mb/Unicode/WIN_to_utf8.map -->
src/utils/mb/Unicode/win1251_to_utf8.map
src/utils/mb/Unicode/utf8_to_KOI8.map -->
src/utils/mb/Unicode/utf8_to_koi8r.map
src/utils/mb/Unicode/utf8_to_WIN.map -->
src/utils/mb/Unicode/utf8_to_win1251.map
* new file:
src/utils/mb/encname.c
* removed file:
src/utils/mb/common.c
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
2001-09-06 06:57:30 +02:00
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
c1 = *s++;
|
|
|
|
|
|
|
|
if (IS_HIGHBIT_SET(c1))
|
|
|
|
{
|
|
|
|
l = 2;
|
|
|
|
if (l > len)
|
|
|
|
return -1;
|
|
|
|
if (!IS_EUC_RANGE_VALID(c1))
|
|
|
|
return -1;
|
|
|
|
c2 = *s++;
|
|
|
|
if (!IS_EUC_RANGE_VALID(c2))
|
|
|
|
return -1;
|
|
|
|
}
|
2006-10-04 02:30:14 +02:00
|
|
|
else
|
|
|
|
/* must be ASCII */
|
2006-05-21 22:05:21 +02:00
|
|
|
{
|
|
|
|
l = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return l;
|
2001-02-11 02:59:22 +01:00
|
|
|
}
|
2001-09-11 06:50:36 +02:00
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
/* EUC-CN byte sequences are exactly same as EUC-KR */
|
|
|
|
#define pg_euccn_verifier pg_euckr_verifier
|
|
|
|
|
|
|
|
static int
|
|
|
|
pg_euctw_verifier(const unsigned char *s, int len)
|
2004-03-15 11:41:26 +01:00
|
|
|
{
|
2006-05-21 22:05:21 +02:00
|
|
|
int l;
|
2006-10-04 02:30:14 +02:00
|
|
|
unsigned char c1,
|
|
|
|
c2;
|
2004-03-15 11:41:26 +01:00
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
c1 = *s++;
|
|
|
|
|
|
|
|
switch (c1)
|
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
case SS2: /* CNS 11643 Plane 1-7 */
|
2006-05-21 22:05:21 +02:00
|
|
|
l = 4;
|
|
|
|
if (l > len)
|
|
|
|
return -1;
|
|
|
|
c2 = *s++;
|
|
|
|
if (c2 < 0xa1 || c2 > 0xa7)
|
|
|
|
return -1;
|
|
|
|
c2 = *s++;
|
|
|
|
if (!IS_EUC_RANGE_VALID(c2))
|
|
|
|
return -1;
|
|
|
|
c2 = *s++;
|
|
|
|
if (!IS_EUC_RANGE_VALID(c2))
|
|
|
|
return -1;
|
|
|
|
break;
|
|
|
|
|
2006-10-04 02:30:14 +02:00
|
|
|
case SS3: /* unused */
|
2006-05-21 22:05:21 +02:00
|
|
|
return -1;
|
|
|
|
|
|
|
|
default:
|
|
|
|
if (IS_HIGHBIT_SET(c1)) /* CNS 11643 Plane 1 */
|
|
|
|
{
|
|
|
|
l = 2;
|
|
|
|
if (l > len)
|
|
|
|
return -1;
|
|
|
|
/* no further range check on c1? */
|
|
|
|
c2 = *s++;
|
|
|
|
if (!IS_EUC_RANGE_VALID(c2))
|
|
|
|
return -1;
|
|
|
|
}
|
2006-10-04 02:30:14 +02:00
|
|
|
else
|
|
|
|
/* must be ASCII */
|
2006-05-21 22:05:21 +02:00
|
|
|
{
|
|
|
|
l = 1;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return l;
|
2004-03-15 11:41:26 +01:00
|
|
|
}
|
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
static int
|
|
|
|
pg_johab_verifier(const unsigned char *s, int len)
|
2001-09-21 17:27:38 +02:00
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
int l,
|
|
|
|
mbl;
|
2006-05-21 22:05:21 +02:00
|
|
|
unsigned char c;
|
2001-09-21 17:27:38 +02:00
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
l = mbl = pg_johab_mblen(s);
|
|
|
|
|
|
|
|
if (len < l)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
if (!IS_HIGHBIT_SET(*s))
|
|
|
|
return mbl;
|
|
|
|
|
|
|
|
while (--l > 0)
|
|
|
|
{
|
|
|
|
c = *++s;
|
|
|
|
if (!IS_EUC_RANGE_VALID(c))
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
return mbl;
|
2001-09-21 17:27:38 +02:00
|
|
|
}
|
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
static int
|
|
|
|
pg_mule_verifier(const unsigned char *s, int len)
|
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
int l,
|
|
|
|
mbl;
|
2006-05-21 22:05:21 +02:00
|
|
|
unsigned char c;
|
|
|
|
|
|
|
|
l = mbl = pg_mule_mblen(s);
|
|
|
|
|
|
|
|
if (len < l)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
while (--l > 0)
|
|
|
|
{
|
|
|
|
c = *++s;
|
|
|
|
if (!IS_HIGHBIT_SET(c))
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
return mbl;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
pg_latin1_verifier(const unsigned char *s, int len)
|
|
|
|
{
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
pg_sjis_verifier(const unsigned char *s, int len)
|
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
int l,
|
|
|
|
mbl;
|
|
|
|
unsigned char c1,
|
|
|
|
c2;
|
2006-05-21 22:05:21 +02:00
|
|
|
|
|
|
|
l = mbl = pg_sjis_mblen(s);
|
|
|
|
|
|
|
|
if (len < l)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
if (l == 1) /* pg_sjis_mblen already verified it */
|
|
|
|
return mbl;
|
|
|
|
|
|
|
|
c1 = *s++;
|
|
|
|
c2 = *s;
|
|
|
|
if (!ISSJISHEAD(c1) || !ISSJISTAIL(c2))
|
|
|
|
return -1;
|
|
|
|
return mbl;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
pg_big5_verifier(const unsigned char *s, int len)
|
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
int l,
|
|
|
|
mbl;
|
2006-05-21 22:05:21 +02:00
|
|
|
|
|
|
|
l = mbl = pg_big5_mblen(s);
|
|
|
|
|
|
|
|
if (len < l)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
while (--l > 0)
|
|
|
|
{
|
|
|
|
if (*++s == '\0')
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return mbl;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
pg_gbk_verifier(const unsigned char *s, int len)
|
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
int l,
|
|
|
|
mbl;
|
2006-05-21 22:05:21 +02:00
|
|
|
|
|
|
|
l = mbl = pg_gbk_mblen(s);
|
|
|
|
|
|
|
|
if (len < l)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
while (--l > 0)
|
|
|
|
{
|
|
|
|
if (*++s == '\0')
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return mbl;
|
|
|
|
}
|
2003-07-27 06:53:12 +02:00
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
static int
|
|
|
|
pg_uhc_verifier(const unsigned char *s, int len)
|
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
int l,
|
|
|
|
mbl;
|
2006-05-21 22:05:21 +02:00
|
|
|
|
|
|
|
l = mbl = pg_uhc_mblen(s);
|
|
|
|
|
|
|
|
if (len < l)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
while (--l > 0)
|
|
|
|
{
|
|
|
|
if (*++s == '\0')
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return mbl;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
pg_gb18030_verifier(const unsigned char *s, int len)
|
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
int l,
|
|
|
|
mbl;
|
2006-05-21 22:05:21 +02:00
|
|
|
|
|
|
|
l = mbl = pg_gb18030_mblen(s);
|
|
|
|
|
|
|
|
if (len < l)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
while (--l > 0)
|
|
|
|
{
|
|
|
|
if (*++s == '\0')
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return mbl;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
pg_utf8_verifier(const unsigned char *s, int len)
|
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
int l = pg_utf_mblen(s);
|
2006-05-21 22:05:21 +02:00
|
|
|
|
|
|
|
if (len < l)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
if (!pg_utf8_islegal(s, l))
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
return l;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check for validity of a single UTF-8 encoded character
|
|
|
|
*
|
|
|
|
* This directly implements the rules in RFC3629. The bizarre-looking
|
|
|
|
* restrictions on the second byte are meant to ensure that there isn't
|
|
|
|
* more than one encoding of a given Unicode character point; that is,
|
|
|
|
* you may not use a longer-than-necessary byte sequence with high order
|
|
|
|
* zero bits to represent a character that would fit in fewer bytes.
|
|
|
|
* To do otherwise is to create security hazards (eg, create an apparent
|
|
|
|
* non-ASCII character that decodes to plain ASCII).
|
|
|
|
*
|
|
|
|
* length is assumed to have been obtained by pg_utf_mblen(), and the
|
|
|
|
* caller must have checked that that many bytes are present in the buffer.
|
|
|
|
*/
|
2005-10-15 04:49:52 +02:00
|
|
|
bool
|
|
|
|
pg_utf8_islegal(const unsigned char *source, int length)
|
|
|
|
{
|
|
|
|
unsigned char a;
|
|
|
|
|
|
|
|
switch (length)
|
|
|
|
{
|
|
|
|
default:
|
2006-05-21 22:05:21 +02:00
|
|
|
/* reject lengths 5 and 6 for now */
|
2005-10-15 04:49:52 +02:00
|
|
|
return false;
|
|
|
|
case 4:
|
2006-05-21 22:05:21 +02:00
|
|
|
a = source[3];
|
|
|
|
if (a < 0x80 || a > 0xBF)
|
2005-10-15 04:49:52 +02:00
|
|
|
return false;
|
2006-05-21 22:05:21 +02:00
|
|
|
/* FALL THRU */
|
2005-10-15 04:49:52 +02:00
|
|
|
case 3:
|
2006-05-21 22:05:21 +02:00
|
|
|
a = source[2];
|
|
|
|
if (a < 0x80 || a > 0xBF)
|
2005-10-15 04:49:52 +02:00
|
|
|
return false;
|
2006-05-21 22:05:21 +02:00
|
|
|
/* FALL THRU */
|
2005-10-15 04:49:52 +02:00
|
|
|
case 2:
|
2006-05-21 22:05:21 +02:00
|
|
|
a = source[1];
|
2005-10-15 04:49:52 +02:00
|
|
|
switch (*source)
|
|
|
|
{
|
|
|
|
case 0xE0:
|
2006-05-21 22:05:21 +02:00
|
|
|
if (a < 0xA0 || a > 0xBF)
|
2005-10-15 04:49:52 +02:00
|
|
|
return false;
|
|
|
|
break;
|
|
|
|
case 0xED:
|
2006-05-21 22:05:21 +02:00
|
|
|
if (a < 0x80 || a > 0x9F)
|
2005-10-15 04:49:52 +02:00
|
|
|
return false;
|
|
|
|
break;
|
|
|
|
case 0xF0:
|
2006-05-21 22:05:21 +02:00
|
|
|
if (a < 0x90 || a > 0xBF)
|
2005-10-15 04:49:52 +02:00
|
|
|
return false;
|
|
|
|
break;
|
|
|
|
case 0xF4:
|
2006-05-21 22:05:21 +02:00
|
|
|
if (a < 0x80 || a > 0x8F)
|
2005-10-15 04:49:52 +02:00
|
|
|
return false;
|
|
|
|
break;
|
|
|
|
default:
|
2006-05-21 22:05:21 +02:00
|
|
|
if (a < 0x80 || a > 0xBF)
|
2005-10-15 04:49:52 +02:00
|
|
|
return false;
|
2006-05-21 22:05:21 +02:00
|
|
|
break;
|
2005-10-15 04:49:52 +02:00
|
|
|
}
|
2006-05-21 22:05:21 +02:00
|
|
|
/* FALL THRU */
|
2005-10-15 04:49:52 +02:00
|
|
|
case 1:
|
2006-05-21 22:05:21 +02:00
|
|
|
a = *source;
|
|
|
|
if (a >= 0x80 && a < 0xC2)
|
|
|
|
return false;
|
|
|
|
if (a > 0xF4)
|
2005-10-15 04:49:52 +02:00
|
|
|
return false;
|
2006-05-21 22:05:21 +02:00
|
|
|
break;
|
2005-10-15 04:49:52 +02:00
|
|
|
}
|
|
|
|
return true;
|
2005-06-15 02:15:08 +02:00
|
|
|
}
|
|
|
|
|
2011-10-29 20:22:20 +02:00
|
|
|
#ifndef FRONTEND
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Generic character increment function.
|
|
|
|
*
|
|
|
|
* Not knowing anything about the properties of the encoding in use, we just
|
|
|
|
* keep incrementing the last byte until pg_verifymbstr() likes the result,
|
|
|
|
* or we run out of values to try.
|
|
|
|
*
|
|
|
|
* Like all character-increment functions, we must restore the original input
|
|
|
|
* string on failure.
|
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
pg_generic_charinc(unsigned char *charptr, int len)
|
|
|
|
{
|
|
|
|
unsigned char *lastchar = (unsigned char *) (charptr + len - 1);
|
|
|
|
unsigned char savelastchar = *lastchar;
|
|
|
|
const char *const_charptr = (const char *)charptr;
|
|
|
|
|
|
|
|
while (*lastchar < (unsigned char) 255)
|
|
|
|
{
|
|
|
|
(*lastchar)++;
|
|
|
|
if (!pg_verifymbstr(const_charptr, len, true))
|
|
|
|
continue;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
*lastchar = savelastchar;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* UTF-8 character increment function.
|
|
|
|
*
|
|
|
|
* For a one-byte character less than 0x7F, we just increment the byte.
|
|
|
|
*
|
|
|
|
* For a multibyte character, every byte but the first must fall between 0x80
|
|
|
|
* and 0xBF; and the first byte must be between 0xC0 and 0xF4. We increment
|
|
|
|
* the last byte that's not already at its maximum value, and set any following
|
|
|
|
* bytes back to 0x80. If we can't find a byte that's less than the maximum
|
|
|
|
* allowable vale, we simply fail. We also have some special-case logic to
|
|
|
|
* skip regions used for surrogate pair handling, as those should not occur in
|
|
|
|
* valid UTF-8.
|
|
|
|
*
|
|
|
|
* Like all character-increment functions, we must restore the original input
|
|
|
|
* string on failure.
|
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
pg_utf8_increment(unsigned char *charptr, int length)
|
|
|
|
{
|
|
|
|
unsigned char a;
|
|
|
|
unsigned char bak[4];
|
|
|
|
unsigned char limit;
|
|
|
|
|
|
|
|
switch (length)
|
|
|
|
{
|
|
|
|
default:
|
|
|
|
/* reject lengths 5 and 6 for now */
|
|
|
|
return false;
|
|
|
|
case 4:
|
|
|
|
bak[3] = charptr[3];
|
|
|
|
a = charptr[3];
|
|
|
|
if (a < 0xBF)
|
|
|
|
{
|
|
|
|
charptr[3]++;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
charptr[3] = 0x80;
|
|
|
|
/* FALL THRU */
|
|
|
|
case 3:
|
|
|
|
bak[2] = charptr[2];
|
|
|
|
a = charptr[2];
|
|
|
|
if (a < 0xBF)
|
|
|
|
{
|
|
|
|
charptr[2]++;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
charptr[2] = 0x80;
|
|
|
|
/* FALL THRU */
|
|
|
|
case 2:
|
|
|
|
bak[1] = charptr[1];
|
|
|
|
a = charptr[1];
|
|
|
|
switch (*charptr)
|
|
|
|
{
|
|
|
|
case 0xED:
|
|
|
|
limit = 0x9F;
|
|
|
|
break;
|
|
|
|
case 0xF4:
|
|
|
|
limit = 0x8F;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
limit = 0xBF;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (a < limit)
|
|
|
|
{
|
|
|
|
charptr[1]++;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
charptr[1] = 0x80;
|
|
|
|
/* FALL THRU */
|
|
|
|
case 1:
|
|
|
|
bak[0] = *charptr;
|
|
|
|
a = *charptr;
|
|
|
|
if (a == 0x7F || a == 0xDF || a == 0xEF || a == 0xF4)
|
|
|
|
{
|
|
|
|
/* Restore original string. */
|
|
|
|
memcpy(charptr, bak, length);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
charptr[0]++;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* EUC-JP character increment function.
|
|
|
|
*
|
|
|
|
* If the sequence starts with SS2(0x8e), it must be a two-byte sequence
|
|
|
|
* representing JIS X 0201 characters with the second byte ranges between
|
|
|
|
* 0xa1 and 0xde. We just increment the last byte if it's less than 0xde,
|
|
|
|
* and otherwise rewrite whole the sequence to 0xa1 0xa1.
|
|
|
|
*
|
|
|
|
* If the sequence starts with SS3(0x8f), it must be a three-byte sequence
|
|
|
|
* which the last two bytes ranges between 0xa1 and 0xfe. The last byte
|
|
|
|
* is incremented, carrying overflow to the second-to-last byte.
|
|
|
|
*
|
|
|
|
* If the sequence starts with the values other than the aboves and its MSB
|
|
|
|
* is set, it must be a two-byte sequence representing JIS X 0208 characters
|
|
|
|
* with both bytes ranges between 0xa1 and 0xfe. The last byte is incremented,
|
|
|
|
* carrying overflow to the second-to-last byte.
|
|
|
|
*
|
|
|
|
* Otherwise the sequence is consists of single byte representing ASCII
|
|
|
|
* characters. It is incremented up to 0x7f.
|
|
|
|
*
|
|
|
|
* Only three EUC-JP byte sequences shown below - which have no character
|
|
|
|
* allocated - make this function to fail in spite of its validity: 0x7f,
|
|
|
|
* 0xfe 0xfe, 0x8f 0xfe 0xfe.
|
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
pg_eucjp_increment(unsigned char *charptr, int length)
|
|
|
|
{
|
|
|
|
unsigned char bak[3];
|
|
|
|
unsigned char c1, c2;
|
|
|
|
signed int i;
|
|
|
|
|
|
|
|
c1 = *charptr;
|
|
|
|
|
|
|
|
switch (c1)
|
|
|
|
{
|
|
|
|
case SS2: /* JIS X 0201 */
|
|
|
|
if (length != 2)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
c2 = charptr[1];
|
|
|
|
|
|
|
|
if (c2 > 0xde)
|
|
|
|
charptr[0] = charptr[1] = 0xa1;
|
|
|
|
else if (c2 < 0xa1)
|
|
|
|
charptr[1] = 0xa1;
|
|
|
|
else
|
|
|
|
charptr[1]++;
|
|
|
|
|
|
|
|
break;
|
|
|
|
|
|
|
|
case SS3: /* JIS X 0212 */
|
|
|
|
if (length != 3)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
for (i = 2; i > 0; i--)
|
|
|
|
{
|
|
|
|
bak[i] = charptr[i];
|
|
|
|
c2 = charptr[i];
|
|
|
|
if (c2 < 0xa1)
|
|
|
|
{
|
|
|
|
charptr[i] = 0xa1;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
else if (c2 < 0xfe)
|
|
|
|
{
|
|
|
|
charptr[i]++;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
charptr[i] = 0xa1;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (i == 0) /* Out of 3-byte code region */
|
|
|
|
{
|
|
|
|
charptr[1] = bak[1];
|
|
|
|
charptr[2] = bak[2];
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
|
|
|
if (IS_HIGHBIT_SET(c1)) /* JIS X 0208? */
|
|
|
|
{
|
|
|
|
if (length != 2)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
for (i = 1 ; i >= 0 ; i--) /* i must be signed */
|
|
|
|
{
|
|
|
|
bak[i] = charptr[i];
|
|
|
|
c2 = charptr[i];
|
|
|
|
if (c2 < 0xa1)
|
|
|
|
{
|
|
|
|
charptr[i] = 0xa1;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
else if (c2 < 0xfe)
|
|
|
|
{
|
|
|
|
charptr[i]++;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
charptr[i] = 0xa1;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (i < 0) /* Out of 2 byte code region */
|
|
|
|
{
|
|
|
|
charptr[0] = bak[0];
|
|
|
|
charptr[1] = bak[1];
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{ /* ASCII, single byte */
|
|
|
|
if (c1 > 0x7e)
|
|
|
|
return false;
|
|
|
|
(*charptr)++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
/*
|
|
|
|
*-------------------------------------------------------------------
|
|
|
|
* encoding info table
|
2007-10-16 00:46:27 +02:00
|
|
|
* XXX must be sorted by the same order as enum pg_enc (in mb/pg_wchar.h)
|
2006-05-21 22:05:21 +02:00
|
|
|
*-------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
pg_wchar_tbl pg_wchar_table[] = {
|
2009-06-11 16:49:15 +02:00
|
|
|
{pg_ascii2wchar_with_len, pg_ascii_mblen, pg_ascii_dsplen, pg_ascii_verifier, 1}, /* PG_SQL_ASCII */
|
2009-02-10 17:44:44 +01:00
|
|
|
{pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3}, /* PG_EUC_JP */
|
|
|
|
{pg_euccn2wchar_with_len, pg_euccn_mblen, pg_euccn_dsplen, pg_euccn_verifier, 2}, /* PG_EUC_CN */
|
|
|
|
{pg_euckr2wchar_with_len, pg_euckr_mblen, pg_euckr_dsplen, pg_euckr_verifier, 3}, /* PG_EUC_KR */
|
|
|
|
{pg_euctw2wchar_with_len, pg_euctw_mblen, pg_euctw_dsplen, pg_euctw_verifier, 4}, /* PG_EUC_TW */
|
|
|
|
{pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3}, /* PG_EUC_JIS_2004 */
|
|
|
|
{pg_utf2wchar_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_verifier, 4}, /* PG_UTF8 */
|
|
|
|
{pg_mule2wchar_with_len, pg_mule_mblen, pg_mule_dsplen, pg_mule_verifier, 4}, /* PG_MULE_INTERNAL */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN1 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN2 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN3 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN4 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN5 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN6 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN7 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN8 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN9 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN10 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1256 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1258 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN866 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN874 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_KOI8R */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1251 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1252 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-5 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-6 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-7 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-8 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1250 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1253 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1254 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1255 */
|
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1257 */
|
2009-02-10 20:29:39 +01:00
|
|
|
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_KOI8U */
|
2009-02-10 17:44:44 +01:00
|
|
|
{0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2}, /* PG_SJIS */
|
|
|
|
{0, pg_big5_mblen, pg_big5_dsplen, pg_big5_verifier, 2}, /* PG_BIG5 */
|
|
|
|
{0, pg_gbk_mblen, pg_gbk_dsplen, pg_gbk_verifier, 2}, /* PG_GBK */
|
|
|
|
{0, pg_uhc_mblen, pg_uhc_dsplen, pg_uhc_verifier, 2}, /* PG_UHC */
|
|
|
|
{0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifier, 4}, /* PG_GB18030 */
|
|
|
|
{0, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifier, 3}, /* PG_JOHAB */
|
|
|
|
{0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2} /* PG_SHIFT_JIS_2004 */
|
2006-05-21 22:05:21 +02:00
|
|
|
};
|
|
|
|
|
|
|
|
/* returns the byte length of a word for mule internal code */
|
|
|
|
int
|
|
|
|
pg_mic_mblen(const unsigned char *mbstr)
|
|
|
|
{
|
|
|
|
return pg_mule_mblen(mbstr);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns the byte length of a multibyte character.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
pg_encoding_mblen(int encoding, const char *mbstr)
|
|
|
|
{
|
|
|
|
Assert(PG_VALID_ENCODING(encoding));
|
|
|
|
|
|
|
|
return ((encoding >= 0 &&
|
|
|
|
encoding < sizeof(pg_wchar_table) / sizeof(pg_wchar_tbl)) ?
|
|
|
|
((*pg_wchar_table[encoding].mblen) ((const unsigned char *) mbstr)) :
|
|
|
|
((*pg_wchar_table[PG_SQL_ASCII].mblen) ((const unsigned char *) mbstr)));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns the display length of a multibyte character.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
pg_encoding_dsplen(int encoding, const char *mbstr)
|
|
|
|
{
|
|
|
|
Assert(PG_VALID_ENCODING(encoding));
|
|
|
|
|
|
|
|
return ((encoding >= 0 &&
|
|
|
|
encoding < sizeof(pg_wchar_table) / sizeof(pg_wchar_tbl)) ?
|
|
|
|
((*pg_wchar_table[encoding].dsplen) ((const unsigned char *) mbstr)) :
|
|
|
|
((*pg_wchar_table[PG_SQL_ASCII].dsplen) ((const unsigned char *) mbstr)));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify the first multibyte character of the given string.
|
|
|
|
* Return its byte length if good, -1 if bad. (See comments above for
|
|
|
|
* full details of the mbverify API.)
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
pg_encoding_verifymb(int encoding, const char *mbstr, int len)
|
|
|
|
{
|
|
|
|
Assert(PG_VALID_ENCODING(encoding));
|
|
|
|
|
|
|
|
return ((encoding >= 0 &&
|
|
|
|
encoding < sizeof(pg_wchar_table) / sizeof(pg_wchar_tbl)) ?
|
2006-10-04 02:30:14 +02:00
|
|
|
((*pg_wchar_table[encoding].mbverify) ((const unsigned char *) mbstr, len)) :
|
|
|
|
((*pg_wchar_table[PG_SQL_ASCII].mbverify) ((const unsigned char *) mbstr, len)));
|
2006-05-21 22:05:21 +02:00
|
|
|
}
|
2005-06-15 02:15:08 +02:00
|
|
|
|
2001-09-11 06:50:36 +02:00
|
|
|
/*
|
2006-05-21 22:05:21 +02:00
|
|
|
* fetch maximum length of a given encoding
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
pg_encoding_max_length(int encoding)
|
|
|
|
{
|
|
|
|
Assert(PG_VALID_ENCODING(encoding));
|
|
|
|
|
|
|
|
return pg_wchar_table[encoding].maxmblen;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifndef FRONTEND
|
|
|
|
|
|
|
|
/*
|
|
|
|
* fetch maximum length of the encoding for the current database
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
pg_database_encoding_max_length(void)
|
|
|
|
{
|
|
|
|
return pg_wchar_table[GetDatabaseEncoding()].maxmblen;
|
|
|
|
}
|
|
|
|
|
2011-10-29 20:22:20 +02:00
|
|
|
/*
|
|
|
|
* give the character incrementer for the encoding for the current database
|
|
|
|
*/
|
|
|
|
mbcharacter_incrementer
|
|
|
|
pg_database_encoding_character_incrementer(void)
|
|
|
|
{
|
|
|
|
switch (GetDatabaseEncoding())
|
|
|
|
{
|
|
|
|
case PG_UTF8:
|
|
|
|
return pg_utf8_increment;
|
|
|
|
|
|
|
|
case PG_EUC_JP:
|
|
|
|
return pg_eucjp_increment;
|
|
|
|
|
|
|
|
default:
|
|
|
|
return pg_generic_charinc;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
/*
|
|
|
|
* Verify mbstr to make sure that it is validly encoded in the current
|
|
|
|
* database encoding. Otherwise same as pg_verify_mbstr().
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
pg_verifymbstr(const char *mbstr, int len, bool noError)
|
|
|
|
{
|
2007-11-15 22:14:46 +01:00
|
|
|
return
|
2007-09-18 19:41:17 +02:00
|
|
|
pg_verify_mbstr_len(GetDatabaseEncoding(), mbstr, len, noError) >= 0;
|
2006-05-21 22:05:21 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2007-09-18 19:41:17 +02:00
|
|
|
* Verify mbstr to make sure that it is validly encoded in the specified
|
|
|
|
* encoding.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
pg_verify_mbstr(int encoding, const char *mbstr, int len, bool noError)
|
|
|
|
{
|
|
|
|
return pg_verify_mbstr_len(encoding, mbstr, len, noError) >= 0;
|
|
|
|
}
|
|
|
|
|
2007-11-15 22:14:46 +01:00
|
|
|
/*
|
2006-05-21 22:05:21 +02:00
|
|
|
* Verify mbstr to make sure that it is validly encoded in the specified
|
|
|
|
* encoding.
|
|
|
|
*
|
|
|
|
* mbstr is not necessarily zero terminated; length of mbstr is
|
2003-07-27 06:53:12 +02:00
|
|
|
* specified by len.
|
|
|
|
*
|
2007-11-15 22:14:46 +01:00
|
|
|
* If OK, return length of string in the encoding.
|
2007-09-18 19:41:17 +02:00
|
|
|
* If a problem is found, return -1 when noError is
|
2003-07-27 06:53:12 +02:00
|
|
|
* true; when noError is false, ereport() a descriptive message.
|
2007-11-15 22:14:46 +01:00
|
|
|
*/
|
2007-09-18 19:41:17 +02:00
|
|
|
int
|
|
|
|
pg_verify_mbstr_len(int encoding, const char *mbstr, int len, bool noError)
|
2001-09-11 06:50:36 +02:00
|
|
|
{
|
2006-05-21 22:05:21 +02:00
|
|
|
mbverifier mbverify;
|
2007-11-15 22:14:46 +01:00
|
|
|
int mb_len;
|
2006-05-21 22:05:21 +02:00
|
|
|
|
|
|
|
Assert(PG_VALID_ENCODING(encoding));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In single-byte encodings, we need only reject nulls (\0).
|
|
|
|
*/
|
|
|
|
if (pg_encoding_max_length(encoding) <= 1)
|
|
|
|
{
|
|
|
|
const char *nullpos = memchr(mbstr, 0, len);
|
2001-09-11 06:50:36 +02:00
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
if (nullpos == NULL)
|
2007-09-18 19:41:17 +02:00
|
|
|
return len;
|
2006-05-21 22:05:21 +02:00
|
|
|
if (noError)
|
2007-09-18 19:41:17 +02:00
|
|
|
return -1;
|
2006-05-21 22:05:21 +02:00
|
|
|
report_invalid_encoding(encoding, nullpos, 1);
|
|
|
|
}
|
2003-07-27 06:53:12 +02:00
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
/* fetch function pointer just once */
|
|
|
|
mbverify = pg_wchar_table[encoding].mbverify;
|
2007-11-15 22:14:46 +01:00
|
|
|
|
2007-09-18 19:41:17 +02:00
|
|
|
mb_len = 0;
|
2001-09-11 06:50:36 +02:00
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
while (len > 0)
|
2001-09-11 06:50:36 +02:00
|
|
|
{
|
2006-05-21 22:05:21 +02:00
|
|
|
int l;
|
2005-10-15 04:49:52 +02:00
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
/* fast path for ASCII-subset characters */
|
|
|
|
if (!IS_HIGHBIT_SET(*mbstr))
|
2005-09-24 19:53:28 +02:00
|
|
|
{
|
2006-05-21 22:05:21 +02:00
|
|
|
if (*mbstr != '\0')
|
2005-09-24 19:53:28 +02:00
|
|
|
{
|
2007-09-18 19:41:17 +02:00
|
|
|
mb_len++;
|
2006-05-21 22:05:21 +02:00
|
|
|
mbstr++;
|
|
|
|
len--;
|
|
|
|
continue;
|
2005-06-15 02:15:08 +02:00
|
|
|
}
|
2006-05-21 22:05:21 +02:00
|
|
|
if (noError)
|
2007-09-18 19:41:17 +02:00
|
|
|
return -1;
|
2006-05-21 22:05:21 +02:00
|
|
|
report_invalid_encoding(encoding, mbstr, len);
|
2005-10-15 04:49:52 +02:00
|
|
|
}
|
2003-07-27 06:53:12 +02:00
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
l = (*mbverify) ((const unsigned char *) mbstr, len);
|
2003-07-27 06:53:12 +02:00
|
|
|
|
2006-05-21 22:05:21 +02:00
|
|
|
if (l < 0)
|
|
|
|
{
|
|
|
|
if (noError)
|
2007-09-18 19:41:17 +02:00
|
|
|
return -1;
|
2006-05-21 22:05:21 +02:00
|
|
|
report_invalid_encoding(encoding, mbstr, len);
|
2004-12-02 23:37:14 +01:00
|
|
|
}
|
2006-05-21 22:05:21 +02:00
|
|
|
|
2001-09-11 06:50:36 +02:00
|
|
|
mbstr += l;
|
2006-05-21 22:05:21 +02:00
|
|
|
len -= l;
|
2007-09-18 19:41:17 +02:00
|
|
|
mb_len++;
|
2001-09-11 06:50:36 +02:00
|
|
|
}
|
2007-09-18 19:41:17 +02:00
|
|
|
return mb_len;
|
2001-09-11 06:50:36 +02:00
|
|
|
}
|
2001-09-23 12:59:45 +02:00
|
|
|
|
2009-01-29 20:23:42 +01:00
|
|
|
/*
|
|
|
|
* check_encoding_conversion_args: check arguments of a conversion function
|
|
|
|
*
|
|
|
|
* "expected" arguments can be either an encoding ID or -1 to indicate that
|
|
|
|
* the caller will check whether it accepts the ID.
|
|
|
|
*
|
|
|
|
* Note: the errors here are not really user-facing, so elog instead of
|
|
|
|
* ereport seems sufficient. Also, we trust that the "expected" encoding
|
|
|
|
* arguments are valid encoding IDs, but we don't trust the actuals.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
check_encoding_conversion_args(int src_encoding,
|
|
|
|
int dest_encoding,
|
|
|
|
int len,
|
|
|
|
int expected_src_encoding,
|
|
|
|
int expected_dest_encoding)
|
|
|
|
{
|
|
|
|
if (!PG_VALID_ENCODING(src_encoding))
|
|
|
|
elog(ERROR, "invalid source encoding ID: %d", src_encoding);
|
|
|
|
if (src_encoding != expected_src_encoding && expected_src_encoding >= 0)
|
|
|
|
elog(ERROR, "expected source encoding \"%s\", but got \"%s\"",
|
|
|
|
pg_enc2name_tbl[expected_src_encoding].name,
|
|
|
|
pg_enc2name_tbl[src_encoding].name);
|
|
|
|
if (!PG_VALID_ENCODING(dest_encoding))
|
|
|
|
elog(ERROR, "invalid destination encoding ID: %d", dest_encoding);
|
|
|
|
if (dest_encoding != expected_dest_encoding && expected_dest_encoding >= 0)
|
|
|
|
elog(ERROR, "expected destination encoding \"%s\", but got \"%s\"",
|
|
|
|
pg_enc2name_tbl[expected_dest_encoding].name,
|
|
|
|
pg_enc2name_tbl[dest_encoding].name);
|
|
|
|
if (len < 0)
|
|
|
|
elog(ERROR, "encoding conversion length must not be negative");
|
|
|
|
}
|
|
|
|
|
2001-09-23 12:59:45 +02:00
|
|
|
/*
|
2006-05-21 22:05:21 +02:00
|
|
|
* report_invalid_encoding: complain about invalid multibyte character
|
|
|
|
*
|
|
|
|
* note: len is remaining length of string, not length of character;
|
|
|
|
* len must be greater than zero, as we always examine the first byte.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
report_invalid_encoding(int encoding, const char *mbstr, int len)
|
|
|
|
{
|
|
|
|
int l = pg_encoding_mblen(encoding, mbstr);
|
2011-09-05 22:36:06 +02:00
|
|
|
char buf[8 * 5 + 1];
|
2006-05-21 22:05:21 +02:00
|
|
|
char *p = buf;
|
|
|
|
int j,
|
|
|
|
jlimit;
|
|
|
|
|
|
|
|
jlimit = Min(l, len);
|
|
|
|
jlimit = Min(jlimit, 8); /* prevent buffer overrun */
|
|
|
|
|
|
|
|
for (j = 0; j < jlimit; j++)
|
2011-09-05 22:36:06 +02:00
|
|
|
{
|
|
|
|
p += sprintf(p, "0x%02x", (unsigned char) mbstr[j]);
|
|
|
|
if (j < jlimit - 1)
|
|
|
|
p += sprintf(p, " ");
|
|
|
|
}
|
2006-05-21 22:05:21 +02:00
|
|
|
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
|
2011-09-05 22:36:06 +02:00
|
|
|
errmsg("invalid byte sequence for encoding \"%s\": %s",
|
2006-05-21 22:05:21 +02:00
|
|
|
pg_enc2name_tbl[encoding].name,
|
2010-01-04 21:38:31 +01:00
|
|
|
buf)));
|
2006-05-21 22:05:21 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* report_untranslatable_char: complain about untranslatable character
|
|
|
|
*
|
|
|
|
* note: len is remaining length of string, not length of character;
|
|
|
|
* len must be greater than zero, as we always examine the first byte.
|
2001-09-23 12:59:45 +02:00
|
|
|
*/
|
2006-05-21 22:05:21 +02:00
|
|
|
void
|
|
|
|
report_untranslatable_char(int src_encoding, int dest_encoding,
|
|
|
|
const char *mbstr, int len)
|
2001-09-23 12:59:45 +02:00
|
|
|
{
|
2006-05-21 22:05:21 +02:00
|
|
|
int l = pg_encoding_mblen(src_encoding, mbstr);
|
2011-09-05 22:36:06 +02:00
|
|
|
char buf[8 * 5 + 1];
|
2006-05-21 22:05:21 +02:00
|
|
|
char *p = buf;
|
|
|
|
int j,
|
|
|
|
jlimit;
|
|
|
|
|
|
|
|
jlimit = Min(l, len);
|
|
|
|
jlimit = Min(jlimit, 8); /* prevent buffer overrun */
|
|
|
|
|
|
|
|
for (j = 0; j < jlimit; j++)
|
2011-09-05 22:36:06 +02:00
|
|
|
{
|
|
|
|
p += sprintf(p, "0x%02x", (unsigned char) mbstr[j]);
|
|
|
|
if (j < jlimit - 1)
|
|
|
|
p += sprintf(p, " ");
|
|
|
|
}
|
2006-05-21 22:05:21 +02:00
|
|
|
|
2009-03-02 22:18:43 +01:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_UNTRANSLATABLE_CHARACTER),
|
2011-09-05 22:36:06 +02:00
|
|
|
errmsg("character with byte sequence %s in encoding \"%s\" has no equivalent in encoding \"%s\"",
|
2009-06-11 16:49:15 +02:00
|
|
|
buf,
|
|
|
|
pg_enc2name_tbl[src_encoding].name,
|
|
|
|
pg_enc2name_tbl[dest_encoding].name)));
|
2001-09-23 12:59:45 +02:00
|
|
|
}
|
2001-10-28 07:26:15 +01:00
|
|
|
|
2001-09-11 06:50:36 +02:00
|
|
|
#endif
|