Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
/*
|
|
|
|
* psql - the PostgreSQL interactive terminal
|
|
|
|
*
|
2010-01-02 17:58:17 +01:00
|
|
|
* Copyright (c) 2000-2010, PostgreSQL Global Development Group
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
*
|
2010-08-16 02:06:18 +02:00
|
|
|
* $PostgreSQL: pgsql/src/bin/psql/mbprint.c,v 1.39 2010/08/16 00:06:18 tgl Exp $
|
2007-10-13 22:18:42 +02:00
|
|
|
*
|
|
|
|
* XXX this file does not really belong in psql/. Perhaps move to libpq?
|
|
|
|
* It also seems that the mbvalidate function is redundant with existing
|
|
|
|
* functionality.
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include "postgres_fe.h"
|
2007-10-13 22:18:42 +02:00
|
|
|
#include "mbprint.h"
|
|
|
|
#include "libpq-fe.h"
|
2003-09-07 05:43:57 +02:00
|
|
|
#ifndef PGSCRIPTS
|
|
|
|
#include "settings.h"
|
|
|
|
#endif
|
2007-10-13 22:18:42 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* To avoid version-skew problems, this file must not use declarations
|
|
|
|
* from pg_wchar.h: the encoding IDs we are dealing with are determined
|
|
|
|
* by the libpq.so we are linked with, and that might not match the
|
2007-11-16 00:23:44 +01:00
|
|
|
* numbers we see at compile time. (If this file were inside libpq,
|
2007-10-13 22:18:42 +02:00
|
|
|
* the problem would go away...)
|
|
|
|
*
|
|
|
|
* Hence, we have our own definition of pg_wchar, and we get the values
|
|
|
|
* of any needed encoding IDs on-the-fly.
|
|
|
|
*/
|
|
|
|
|
|
|
|
typedef unsigned int pg_wchar;
|
|
|
|
|
2009-11-25 21:26:31 +01:00
|
|
|
static int
|
2009-10-13 23:04:01 +02:00
|
|
|
pg_get_utf8_id(void)
|
2007-10-13 22:18:42 +02:00
|
|
|
{
|
|
|
|
static int utf8_id = -1;
|
|
|
|
|
|
|
|
if (utf8_id < 0)
|
|
|
|
utf8_id = pg_char_to_encoding("utf8");
|
|
|
|
return utf8_id;
|
|
|
|
}
|
|
|
|
|
2009-10-13 23:04:01 +02:00
|
|
|
#define PG_UTF8 pg_get_utf8_id()
|
2007-10-13 22:18:42 +02:00
|
|
|
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
|
2005-09-24 19:53:28 +02:00
|
|
|
static pg_wchar
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
utf2ucs(const unsigned char *c)
|
|
|
|
{
|
2001-10-25 07:50:21 +02:00
|
|
|
/*
|
|
|
|
* one char version of pg_utf2wchar_with_len. no control here, c must
|
|
|
|
* point to a large enough string
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
*/
|
2001-10-25 07:50:21 +02:00
|
|
|
if ((*c & 0x80) == 0)
|
|
|
|
return (pg_wchar) c[0];
|
|
|
|
else if ((*c & 0xe0) == 0xc0)
|
|
|
|
return (pg_wchar) (((c[0] & 0x1f) << 6) |
|
|
|
|
(c[1] & 0x3f));
|
|
|
|
else if ((*c & 0xf0) == 0xe0)
|
|
|
|
return (pg_wchar) (((c[0] & 0x0f) << 12) |
|
|
|
|
((c[1] & 0x3f) << 6) |
|
|
|
|
(c[2] & 0x3f));
|
2010-08-16 02:06:18 +02:00
|
|
|
else if ((*c & 0xf8) == 0xf0)
|
2001-10-25 07:50:21 +02:00
|
|
|
return (pg_wchar) (((c[0] & 0x07) << 18) |
|
|
|
|
((c[1] & 0x3f) << 12) |
|
|
|
|
((c[2] & 0x3f) << 6) |
|
|
|
|
(c[3] & 0x3f));
|
|
|
|
else
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
/* that is an invalid code on purpose */
|
|
|
|
return 0xffffffff;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2006-02-10 01:39:04 +01:00
|
|
|
/*
|
|
|
|
* Unicode 3.1 compliant validation : for each category, it checks the
|
|
|
|
* combination of each byte to make sure it maps to a valid range. It also
|
|
|
|
* returns -1 for the following UCS values: ucs > 0x10ffff ucs & 0xfffe =
|
|
|
|
* 0xfffe 0xfdd0 < ucs < 0xfdef ucs & 0xdb00 = 0xd800 (surrogates)
|
|
|
|
*/
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
static int
|
|
|
|
utf_charcheck(const unsigned char *c)
|
|
|
|
{
|
2001-10-25 07:50:21 +02:00
|
|
|
if ((*c & 0x80) == 0)
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
return 1;
|
2001-10-25 07:50:21 +02:00
|
|
|
else if ((*c & 0xe0) == 0xc0)
|
|
|
|
{
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
/* two-byte char */
|
2001-10-25 07:50:21 +02:00
|
|
|
if (((c[1] & 0xc0) == 0x80) && ((c[0] & 0x1f) > 0x01))
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
return 2;
|
|
|
|
return -1;
|
|
|
|
}
|
2001-10-25 07:50:21 +02:00
|
|
|
else if ((*c & 0xf0) == 0xe0)
|
|
|
|
{
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
/* three-byte char */
|
|
|
|
if (((c[1] & 0xc0) == 0x80) &&
|
|
|
|
(((c[0] & 0x0f) != 0x00) || ((c[1] & 0x20) == 0x20)) &&
|
2001-10-25 07:50:21 +02:00
|
|
|
((c[2] & 0xc0) == 0x80))
|
|
|
|
{
|
|
|
|
int z = c[0] & 0x0f;
|
|
|
|
int yx = ((c[1] & 0x3f) << 6) | (c[0] & 0x3f);
|
|
|
|
int lx = yx & 0x7f;
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
|
|
|
|
/* check 0xfffe/0xffff, 0xfdd0..0xfedf range, surrogates */
|
|
|
|
if (((z == 0x0f) &&
|
|
|
|
(((yx & 0xffe) == 0xffe) ||
|
2005-10-15 04:49:52 +02:00
|
|
|
(((yx & 0xf80) == 0xd80) && (lx >= 0x30) && (lx <= 0x4f)))) ||
|
2001-10-25 07:50:21 +02:00
|
|
|
((z == 0x0d) && ((yx & 0xb00) == 0x800)))
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
return -1;
|
|
|
|
return 3;
|
|
|
|
}
|
|
|
|
return -1;
|
|
|
|
}
|
2001-10-25 07:50:21 +02:00
|
|
|
else if ((*c & 0xf8) == 0xf0)
|
|
|
|
{
|
|
|
|
int u = ((c[0] & 0x07) << 2) | ((c[1] & 0x30) >> 4);
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
|
|
|
|
/* four-byte char */
|
|
|
|
if (((c[1] & 0xc0) == 0x80) &&
|
|
|
|
(u > 0x00) && (u <= 0x10) &&
|
2001-10-25 07:50:21 +02:00
|
|
|
((c[2] & 0xc0) == 0x80) && ((c[3] & 0xc0) == 0x80))
|
|
|
|
{
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
/* test for 0xzzzzfffe/0xzzzzfffff */
|
|
|
|
if (((c[1] & 0x0f) == 0x0f) && ((c[2] & 0x3f) == 0x3f) &&
|
2001-10-25 07:50:21 +02:00
|
|
|
((c[3] & 0x3e) == 0x3e))
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
return -1;
|
|
|
|
return 4;
|
|
|
|
}
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
return -1;
|
|
|
|
}
|
2001-10-25 07:50:21 +02:00
|
|
|
|
2006-02-10 01:39:04 +01:00
|
|
|
|
2005-09-24 19:53:28 +02:00
|
|
|
static void
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
mb_utf_validate(unsigned char *pwcs)
|
|
|
|
{
|
|
|
|
unsigned char *p = pwcs;
|
|
|
|
|
2001-10-25 07:50:21 +02:00
|
|
|
while (*pwcs)
|
|
|
|
{
|
2006-02-10 01:39:04 +01:00
|
|
|
int len;
|
2005-09-24 19:53:28 +02:00
|
|
|
|
2006-02-10 01:39:04 +01:00
|
|
|
if ((len = utf_charcheck(pwcs)) > 0)
|
2001-10-25 07:50:21 +02:00
|
|
|
{
|
|
|
|
if (p != pwcs)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2006-02-10 01:39:04 +01:00
|
|
|
for (i = 0; i < len; i++)
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
*p++ = *pwcs++;
|
|
|
|
}
|
2001-10-25 07:50:21 +02:00
|
|
|
else
|
|
|
|
{
|
2006-02-10 01:39:04 +01:00
|
|
|
pwcs += len;
|
|
|
|
p += len;
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
}
|
|
|
|
}
|
2001-10-25 07:50:21 +02:00
|
|
|
else
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
/* we skip the char */
|
|
|
|
pwcs++;
|
|
|
|
}
|
2001-10-25 07:50:21 +02:00
|
|
|
if (p != pwcs)
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
*p = '\0';
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* public functions : wcswidth and mbvalidate
|
|
|
|
*/
|
|
|
|
|
2006-02-10 01:39:04 +01:00
|
|
|
/*
|
|
|
|
* pg_wcswidth is the dumb width function. It assumes that everything will
|
|
|
|
* only appear on one line. OTOH it is easier to use if this applies to you.
|
|
|
|
*/
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
int
|
2006-02-10 01:39:04 +01:00
|
|
|
pg_wcswidth(const unsigned char *pwcs, size_t len, int encoding)
|
2001-10-25 07:50:21 +02:00
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
int width = 0;
|
2006-02-10 01:39:04 +01:00
|
|
|
|
|
|
|
while (len > 0)
|
2001-10-25 07:50:21 +02:00
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
int chlen,
|
|
|
|
chwidth;
|
2006-02-10 01:39:04 +01:00
|
|
|
|
2006-10-04 02:30:14 +02:00
|
|
|
chlen = PQmblen((const char *) pwcs, encoding);
|
2006-02-10 01:39:04 +01:00
|
|
|
if (chlen > len)
|
2006-10-04 02:30:14 +02:00
|
|
|
break; /* Invalid string */
|
|
|
|
|
2006-02-10 23:29:06 +01:00
|
|
|
chwidth = PQdsplen((const char *) pwcs, encoding);
|
2006-10-04 02:30:14 +02:00
|
|
|
|
2006-02-10 01:39:04 +01:00
|
|
|
if (chwidth > 0)
|
|
|
|
width += chwidth;
|
|
|
|
pwcs += chlen;
|
|
|
|
}
|
|
|
|
return width;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* pg_wcssize takes the given string in the given encoding and returns three
|
|
|
|
* values:
|
2008-05-08 19:04:26 +02:00
|
|
|
* result_width: Width in display characters of the longest line in string
|
2008-05-09 07:25:04 +02:00
|
|
|
* result_height: Number of lines in display output
|
|
|
|
* result_format_size: Number of bytes required to store formatted
|
|
|
|
* representation of string
|
|
|
|
*
|
|
|
|
* This MUST be kept in sync with pg_wcsformat!
|
2006-02-10 01:39:04 +01:00
|
|
|
*/
|
2008-05-09 07:25:04 +02:00
|
|
|
void
|
|
|
|
pg_wcssize(unsigned char *pwcs, size_t len, int encoding,
|
|
|
|
int *result_width, int *result_height, int *result_format_size)
|
2006-02-10 01:39:04 +01:00
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
int w,
|
|
|
|
chlen = 0,
|
|
|
|
linewidth = 0;
|
|
|
|
int width = 0;
|
|
|
|
int height = 1;
|
|
|
|
int format_size = 0;
|
2006-02-10 01:39:04 +01:00
|
|
|
|
|
|
|
for (; *pwcs && len > 0; pwcs += chlen)
|
|
|
|
{
|
2006-02-10 23:29:06 +01:00
|
|
|
chlen = PQmblen((char *) pwcs, encoding);
|
2006-10-04 02:30:14 +02:00
|
|
|
if (len < (size_t) chlen)
|
2006-02-10 01:39:04 +01:00
|
|
|
break;
|
2006-02-10 23:29:06 +01:00
|
|
|
w = PQdsplen((char *) pwcs, encoding);
|
2006-02-10 01:39:04 +01:00
|
|
|
|
2006-12-27 20:45:36 +01:00
|
|
|
if (chlen == 1) /* single-byte char */
|
2006-02-10 01:39:04 +01:00
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
if (*pwcs == '\n') /* Newline */
|
2006-02-10 01:39:04 +01:00
|
|
|
{
|
|
|
|
if (linewidth > width)
|
|
|
|
width = linewidth;
|
|
|
|
linewidth = 0;
|
|
|
|
height += 1;
|
2006-10-04 02:30:14 +02:00
|
|
|
format_size += 1; /* For NUL char */
|
2006-02-10 01:39:04 +01:00
|
|
|
}
|
2006-10-04 02:30:14 +02:00
|
|
|
else if (*pwcs == '\r') /* Linefeed */
|
2006-02-10 01:39:04 +01:00
|
|
|
{
|
|
|
|
linewidth += 2;
|
|
|
|
format_size += 2;
|
|
|
|
}
|
2008-05-09 07:25:04 +02:00
|
|
|
else if (*pwcs == '\t') /* Tab */
|
|
|
|
{
|
|
|
|
do
|
|
|
|
{
|
|
|
|
linewidth++;
|
|
|
|
format_size++;
|
|
|
|
} while (linewidth % 8 != 0);
|
|
|
|
}
|
2006-12-27 20:45:36 +01:00
|
|
|
else if (w < 0) /* Other control char */
|
2006-02-10 01:39:04 +01:00
|
|
|
{
|
|
|
|
linewidth += 4;
|
|
|
|
format_size += 4;
|
|
|
|
}
|
2007-11-16 02:11:04 +01:00
|
|
|
else /* Output it as-is */
|
2006-02-10 01:39:04 +01:00
|
|
|
{
|
2006-12-27 20:45:36 +01:00
|
|
|
linewidth += w;
|
2006-02-10 01:39:04 +01:00
|
|
|
format_size += 1;
|
|
|
|
}
|
|
|
|
}
|
2006-12-27 20:45:36 +01:00
|
|
|
else if (w < 0) /* Non-ascii control char */
|
2006-02-10 01:39:04 +01:00
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
linewidth += 6; /* \u0000 */
|
2006-02-10 01:39:04 +01:00
|
|
|
format_size += 6;
|
|
|
|
}
|
2007-11-16 02:11:04 +01:00
|
|
|
else /* All other chars */
|
2006-02-10 01:39:04 +01:00
|
|
|
{
|
|
|
|
linewidth += w;
|
|
|
|
format_size += chlen;
|
|
|
|
}
|
|
|
|
len -= chlen;
|
|
|
|
}
|
|
|
|
if (linewidth > width)
|
|
|
|
width = linewidth;
|
2009-06-11 16:49:15 +02:00
|
|
|
format_size += 1; /* For NUL char */
|
2006-10-04 02:30:14 +02:00
|
|
|
|
2006-02-10 01:39:04 +01:00
|
|
|
/* Set results */
|
|
|
|
if (result_width)
|
|
|
|
*result_width = width;
|
|
|
|
if (result_height)
|
|
|
|
*result_height = height;
|
|
|
|
if (result_format_size)
|
|
|
|
*result_format_size = format_size;
|
|
|
|
}
|
|
|
|
|
2008-05-08 19:04:26 +02:00
|
|
|
/*
|
2009-06-11 16:49:15 +02:00
|
|
|
* Format a string into one or more "struct lineptr" lines.
|
|
|
|
* lines[i].ptr == NULL indicates the end of the array.
|
2008-05-09 07:25:04 +02:00
|
|
|
*
|
|
|
|
* This MUST be kept in sync with pg_wcssize!
|
2008-05-08 19:04:26 +02:00
|
|
|
*/
|
2006-02-10 01:39:04 +01:00
|
|
|
void
|
|
|
|
pg_wcsformat(unsigned char *pwcs, size_t len, int encoding,
|
2009-06-11 16:49:15 +02:00
|
|
|
struct lineptr * lines, int count)
|
2006-02-10 01:39:04 +01:00
|
|
|
{
|
|
|
|
int w,
|
|
|
|
chlen = 0;
|
2006-10-04 02:30:14 +02:00
|
|
|
int linewidth = 0;
|
|
|
|
unsigned char *ptr = lines->ptr; /* Pointer to data area */
|
2006-02-10 01:39:04 +01:00
|
|
|
|
|
|
|
for (; *pwcs && len > 0; pwcs += chlen)
|
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
chlen = PQmblen((char *) pwcs, encoding);
|
|
|
|
if (len < (size_t) chlen)
|
2006-02-10 01:39:04 +01:00
|
|
|
break;
|
2006-10-04 02:30:14 +02:00
|
|
|
w = PQdsplen((char *) pwcs, encoding);
|
2006-02-10 01:39:04 +01:00
|
|
|
|
2006-12-27 20:45:36 +01:00
|
|
|
if (chlen == 1) /* single-byte char */
|
2006-02-10 01:39:04 +01:00
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
if (*pwcs == '\n') /* Newline */
|
2006-02-10 01:39:04 +01:00
|
|
|
{
|
2006-12-27 20:45:36 +01:00
|
|
|
*ptr++ = '\0';
|
2006-02-10 01:39:04 +01:00
|
|
|
lines->width = linewidth;
|
|
|
|
linewidth = 0;
|
|
|
|
lines++;
|
|
|
|
count--;
|
2008-05-09 07:25:04 +02:00
|
|
|
if (count <= 0)
|
2006-10-04 02:30:14 +02:00
|
|
|
exit(1); /* Screwup */
|
|
|
|
|
2008-05-08 19:04:26 +02:00
|
|
|
/* make next line point to remaining memory */
|
2006-02-10 01:39:04 +01:00
|
|
|
lines->ptr = ptr;
|
|
|
|
}
|
2006-10-04 02:30:14 +02:00
|
|
|
else if (*pwcs == '\r') /* Linefeed */
|
2006-02-10 01:39:04 +01:00
|
|
|
{
|
2006-02-10 23:29:06 +01:00
|
|
|
strcpy((char *) ptr, "\\r");
|
2006-02-10 01:39:04 +01:00
|
|
|
linewidth += 2;
|
|
|
|
ptr += 2;
|
|
|
|
}
|
2008-05-08 21:11:36 +02:00
|
|
|
else if (*pwcs == '\t') /* Tab */
|
|
|
|
{
|
|
|
|
do
|
|
|
|
{
|
|
|
|
*ptr++ = ' ';
|
|
|
|
linewidth++;
|
|
|
|
} while (linewidth % 8 != 0);
|
|
|
|
}
|
2006-12-27 20:45:36 +01:00
|
|
|
else if (w < 0) /* Other control char */
|
2006-02-10 01:39:04 +01:00
|
|
|
{
|
2006-02-10 23:29:06 +01:00
|
|
|
sprintf((char *) ptr, "\\x%02X", *pwcs);
|
2006-02-10 01:39:04 +01:00
|
|
|
linewidth += 4;
|
|
|
|
ptr += 4;
|
|
|
|
}
|
2007-11-16 02:11:04 +01:00
|
|
|
else /* Output it as-is */
|
2006-02-10 01:39:04 +01:00
|
|
|
{
|
2006-12-27 20:45:36 +01:00
|
|
|
linewidth += w;
|
2006-02-10 01:39:04 +01:00
|
|
|
*ptr++ = *pwcs;
|
|
|
|
}
|
|
|
|
}
|
2006-12-27 20:45:36 +01:00
|
|
|
else if (w < 0) /* Non-ascii control char */
|
2006-02-10 01:39:04 +01:00
|
|
|
{
|
|
|
|
if (encoding == PG_UTF8)
|
2006-02-10 23:29:06 +01:00
|
|
|
sprintf((char *) ptr, "\\u%04X", utf2ucs(pwcs));
|
2006-02-10 01:39:04 +01:00
|
|
|
else
|
2008-05-09 07:25:04 +02:00
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
/*
|
|
|
|
* This case cannot happen in the current code because only
|
|
|
|
* UTF-8 signals multibyte control characters. But we may need
|
|
|
|
* to support it at some stage
|
|
|
|
*/
|
2006-02-10 23:29:06 +01:00
|
|
|
sprintf((char *) ptr, "\\u????");
|
2008-05-09 07:25:04 +02:00
|
|
|
}
|
2006-02-10 01:39:04 +01:00
|
|
|
ptr += 6;
|
|
|
|
linewidth += 6;
|
|
|
|
}
|
2007-11-16 02:11:04 +01:00
|
|
|
else /* All other chars */
|
2006-02-10 01:39:04 +01:00
|
|
|
{
|
2006-10-04 02:30:14 +02:00
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < chlen; i++)
|
2006-02-10 01:39:04 +01:00
|
|
|
*ptr++ = pwcs[i];
|
|
|
|
linewidth += w;
|
|
|
|
}
|
|
|
|
len -= chlen;
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
}
|
2006-02-10 01:39:04 +01:00
|
|
|
lines->width = linewidth;
|
2009-06-11 16:49:15 +02:00
|
|
|
*ptr++ = '\0'; /* Terminate formatted string */
|
2008-05-08 19:04:26 +02:00
|
|
|
|
2008-05-09 07:25:04 +02:00
|
|
|
if (count <= 0)
|
2009-06-11 16:49:15 +02:00
|
|
|
exit(1); /* Screwup */
|
2008-05-08 19:04:26 +02:00
|
|
|
|
2009-06-11 16:49:15 +02:00
|
|
|
(lines + 1)->ptr = NULL; /* terminate line array */
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
}
|
|
|
|
|
2006-02-10 01:39:04 +01:00
|
|
|
unsigned char *
|
|
|
|
mbvalidate(unsigned char *pwcs, int encoding)
|
2001-10-25 07:50:21 +02:00
|
|
|
{
|
2003-03-18 23:15:44 +01:00
|
|
|
if (encoding == PG_UTF8)
|
2005-09-24 19:53:28 +02:00
|
|
|
mb_utf_validate((unsigned char *) pwcs);
|
2001-10-25 07:50:21 +02:00
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* other encodings needing validation should add their own routines
|
|
|
|
* here
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
*/
|
|
|
|
}
|
2005-09-24 19:53:28 +02:00
|
|
|
|
|
|
|
return pwcs;
|
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> characters (characters with values >= 0x10000, which are encoded on
> four bytes).
Also, update mb/expected/unicode.out. This is necessary since the
patches affetc the result of queries using UTF-8.
---------------------------------------------------------------
Hi,
I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:
- most of the functionality is only activated when MULTIBYTE is
defined,
- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).
- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.
- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).
- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).
I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.
Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .
Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....
err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)
I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)
Here is the patch !
Patrice.
--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----
2001-10-15 03:25:10 +02:00
|
|
|
}
|