encoding conversion functions. These are not can't-happen cases because
it's possible to create a conversion with the wrong conversion function
for the specified encoding pair. That would lead to an Assert crash in
an Assert-enabled build, or incorrect conversion otherwise, neither of
which is desirable. This would be a DOS issue if production databases
were customarily built with asserts enabled, but fortunately that's not so.
Per an observation by Heikki.
Back-patch to all supported branches.
except the caller can specify the encoding to work in; this will be needed
for pg_stat_statements. In passing, do some marginal efficiency hacking
and clean up some comments. Also, prevent the single-byte-encoding code
path from fetching one byte past the stated length of the string (this
last is a bug that might need to be back-patched at some point).
recursion when we are unable to convert a localized error message to the
client's encoding. We've been over this ground before, but as reported by
Ibrar Ahmed, it still didn't work in the case of conversion failures for
the conversion-failure message itself :-(. Fix by installing a "circuit
breaker" that disables attempts to localize this message once we get into
recursion trouble.
Patch all supported branches, because it is in fact broken in all of them;
though I had to add some missing translations to the older branches in
order to expose the failure in the particular test case I was using.
clear to me why I'd not seen this message before --- on F-9 it seems to
only happen if Asserts are disabled, which ought to be irrelevant.
Maybe that affects a decision whether to inline get_ten(), which would
be needed to expose the warning condition to the compiler? Anyway,
the fix is clear.
This is required on Windows due to the special locale
handling for UTF8 that doesn't change the full environment.
Fixes crash with translated error messages per bugs 4180
and 4196.
Tom Lane
going through DatumGetPointer or some other "official" conversion macro.
Not actually a bug, since Datum the same size as pointer is the only
supported case at the moment, but good cleanup for the future.
Gavin Sherry
modules are built. Foremost, it creates a solid distinction between these two
types of targets based on what had already been implemented and duplicated in
ad hoc ways before. Specifically,
- Dynamically loadable modules no longer get a soname. The numbers previously
set in the makefiles were dummy numbers anyway, and the presence of a soname
upset a few packaging tools, so it is nicer not to have one.
- The cumbersome detour taken on installation (build a libfoo.so.0.0.0 and
then override the rule to install foo.so instead) is removed.
- Lots of duplicated code simplified.
ISO_8859-5 <-> MULE_INTERNAL conversion tables.
This was discovered when trying to convert a string containing those characters
from ISO_8859-5 to Windows-1251, because we use MULE_INTERNAL/KOI8R as an
intermediate encoding between those two.
While the missing "Yo" was just an omission in the conversion tables, there are
a few other characters like the "Numero" sign ("No" as a single character) that
exists in all the other cyrillic encodings (win1251, ISO_8859-5 and cp866), but
not in KOI8R. Added comments about that.
Patch by Sergey Burladyan. Back-patch to 7.4.
errors in any commands, including in various clean targets that have so far
been handled inconsistently. make -i is available to ignore all errors in
a consistent and official way.
renumbering of encoding IDs done between 8.2 and 8.3 turns out to break 8.2
initdb and psql if they are run with an 8.3beta1 libpq.so. For the moment
we can rearrange the order of enum pg_enc to keep the same number for
everything except PG_JOHAB, which isn't a problem since there are no direct
references to it in the 8.2 programs anyway. (This does force initdb
unfortunately.)
Going forward, we want to fix things so that encoding IDs can be changed
without an ABI break, and this commit includes the changes needed to allow
libpq's encoding IDs to be treated as fully independent of the backend's.
The main issue is that libpq clients should not include pg_wchar.h or
otherwise assume they know the specific values of libpq's encoding IDs,
since they might encounter version skew between pg_wchar.h and the libpq.so
they are using. To fix, have libpq officially export functions needed for
encoding name<=>ID conversion and validity checking; it was doing this
anyway unofficially.
It's still the case that we can't renumber backend encoding IDs until the
next bump in libpq's major version number, since doing so will break the
8.2-era client programs. However the code is now prepared to avoid this
type of problem in future.
Note that initdb is no longer a libpq client: we just pull in the two
source files we need directly. The patch also fixes a few places that
were being sloppy about checking for an unrecognized encoding name.
database via builtin functions, as recently discussed on -hackers.
chr() now returns a character in the database encoding. For UTF8 encoded databases
the argument is treated as a Unicode code point. For other multi-byte encodings
the argument must designate a strict ascii character, or an error is raised,
as is also the case if the argument is 0.
ascii() is adjusted so that it remains the inverse of chr().
The two argument form of convert() is gone, and the three argument form now
takes a bytea first argument and returns a bytea. To cover this loss three new
functions are introduced:
. convert_from(bytea, name) returns text - converts the first argument from the
named encoding to the database encoding
. convert_to(text, name) returns bytea - converts the first argument from the
database encoding to the named encoding
. length(bytea, name) returns int - gives the length of the first argument in
characters in the named encoding
Get rid of VARATT_SIZE and VARATT_DATA, which were simply redundant with
VARSIZE and VARDATA, and as a consequence almost no code was using the
longer names. Rename the length fields of struct varlena and various
derived structures to catch anyplace that was accessing them directly;
and clean up various places so caught. In itself this patch doesn't
change any behavior at all, but it is necessary infrastructure if we hope
to play any games with the representation of varlena headers.
Greg Stark and Tom Lane
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that). pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character. The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there. Per report from James Russell and Michael Fuhr.
Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.
This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
bletcherous and unsafe manipulation of global encoding setting.
Clean up libxml reporting mechanism a bit (it still looks like a
dangling-pointer crash waiting to happen, though, not to mention
being far less than sane from a localization standpoint).
o remove many WIN32_CLIENT_ONLY defines
o add WIN32_ONLY_COMPILER define
o add 3rd argument to open() for portability
o add include/port/win32_msvc directory for
system includes
Magnus Hagander
characters in all cases. Formerly we mostly just threw warnings for invalid
input, and failed to detect it at all if no encoding conversion was required.
The tighter check is needed to defend against SQL-injection attacks as per
CVE-2006-2313 (further details will be published after release). Embedded
zero (null) bytes will be rejected as well. The checks are applied during
input to the backend (receipt from client or COPY IN), so it no longer seems
necessary to check in textin() and related routines; any string arriving at
those functions will already have been validated. Conversion failure
reporting (for characters with no equivalent in the destination encoding)
has been cleaned up and made consistent while at it.
Also, fix a few longstanding errors in little-used encoding conversion
routines: win1251_to_iso, win866_to_iso, euc_tw_to_big5, euc_tw_to_mic,
mic_to_euc_tw were all broken to varying extents.
Patches by Tatsuo Ishii and Tom Lane. Thanks to Akio Ishida and Yasuo Ohgaki
for identifying the security issues.
the script is not executable as UCS_to_most.pl is in CVS. It also won't
pick up any custom setting of the perl version/location to use. This
patch calls perl scripts like $(PERL) $(srcdir)/script.pl.
Kris Jurka
up a bunch of the support utilities.
In src/backend/utils/mb/Unicode remove nearly duplicate copies of the
UCS_to_XXX perl script and replace with one version to handle all generic
files. Update the Makefile so that it knows about all the map files.
This produces a slight difference in some of the map files, using a
uniform naming convention and not mapping the null character.
In src/backend/utils/mb/conversion_procs create a master utf8<->win
codepage function like the ISO 8859 versions instead of having a separate
handler for each conversion.
There is an externally visible change in the name of the win1258 to utf8
conversion. According to the documentation notes, it was named
incorrectly and this changes it to a standard name.
Running the Unicode mapping perl scripts has shown some additional mapping
changes in koi8r and iso8859-7.
id (CVE-2006-0553). Also fix related bug in SET SESSION AUTHORIZATION that
allows unprivileged users to crash the server, if it has been compiled with
Asserts enabled. The escalation-of-privilege risk exists only in 8.1.0-8.1.2.
However, the Assert-crash risk exists in all releases back to 7.3.
Thanks to Akio Ishida for reporting this problem.
If the second output column value is 'a\nb', the 'b' should appear
in the second display column, rather than the first column as it
does now.
Change libpq's PQdsplen() to return more useful values.
> Note: this changes the PQdsplen function, it can now return zero or
> minus one which was not possible before. It doesn't appear anyone is
> actually using the functions other than psql but it is a change. The
> functions are not actually documentated anywhere so it's not like we're
> breaking a defined interface. The new semantics follow the Unicode
> standard.
BACKWARD COMPATIBLE CHANGE.
The only user-visible change I saw in the regression tests is that a
SELECT * on a table where all the columns have been dropped doesn't
return a blank line like before. This seems like a step forward.
Martijn van Oosterhout
fmgr_info(), in the TopMemoryContext. I couldn't see that the code
actually leaked, but in general I think it's fragile to assume that
pfree'ing an FmgrInfo along with its fn_extra field is enough to
reclaim all the resources allocated by fmgr_info(). I changed the
code to do its allocations in a new child context of
TopMemoryContext, MbProcContext. When we want to release the
allocations we can just reset the context, which is cleaner.
rather than "return expr;" -- the latter style is used in most of the
tree. I kept the parentheses when they were necessary or useful because
the return expression was complex.
#define HIGHBIT (0x80)
#define IS_HIGHBIT_SET(ch) ((unsigned char)(ch) & HIGHBIT)
and removed CSIGNBIT and mapped it uses to HIGHBIT. I have also added
uses for IS_HIGHBIT_SET where appropriate. This change is
purely for code clarity.
See:
Subject: [HACKERS] bugs with certain Asian multibyte charsets
From: Tatsuo Ishii <ishii@sraoss.co.jp>
To: pgsql-hackers@postgresql.org
Date: Sat, 24 Dec 2005 18:25:33 +0900 (JST)
for more details/
Also make the code more robust by searching for target encoding
in the internal charset map.
Problem reported by Sagi Bashari on 2005/12/21.
See "[BUGS] BUG #2120: Crash when doing UTF8<->ISO_8859_8 encoding conversion"
on pgsql-bugs list for more details.
comment line where output as too long, and update typedefs for /lib
directory. Also fix case where identifiers were used as variable names
in the backend, but as typedefs in ecpg (favor the backend for
indenting).
Backpatch to 8.1.X.
optional arguments as text input functions, ie, typioparam OID and
atttypmod. Make all the datatypes that use typmod enforce it the same
way in typreceive as they do in typinput. This fixes a problem with
failure to enforce length restrictions during COPY FROM BINARY.
output area as INTERNAL not CSTRING. This is to prevent people from
calling the functions by hand. This is a permanent solution for the
back branches but I hope it is just a stopgap for HEAD.
Also performed an initial run through of upgrading our Copyright date to
extend to 2005 ... first run here was very simple ... change everything
where: grep 1996-2004 && the word 'Copyright' ... scanned through the
generated list with 'less' first, and after, to make sure that I only
picked up the right entries ...
- remove another senseless "extern" keyword that was applied to a
function definition
- change a foo more function signatures from "some_type foo()" to
"some_type foo(void)"
- rewrite another K&R style function definition
- make the type of the "action" function pointer in the KeyWord struct
in src/backend/utils/adt/formatting.c more precise
- replace some function signatures of the form "some_type foo()" with
"some_type foo(void)"
- replace a few instances of a literal 0 being used as a NULL pointer;
there are more instances of this in the code, but I just fixed a few
- in src/backend/utils/mb/wstrncmp.c, replace K&R style function
declarations with ANSI style, remove use of 'register' keyword
- remove an "extern" modifier that was applied to a function definition
(rather than a declaration)
Win32 port is now called 'win32' rather than 'win'
add -lwsock32 on Win32
make gethostname() be only used when kerberos4 is enabled
use /port/getopt.c
new /port/opendir.c routines
disable GUC unix_socket_group on Win32
convert some keywords.c symbols to KEYWORD_P to prevent conflict
create new FCNTL_NONBLOCK macro to turn off socket blocking
create new /include/port.h file that has /port prototypes, move
out of c.h
new /include/port/win32_include dir to hold missing include files
work around ERROR being defined in Win32 includes
in all cases, leaked TopMemoryContext memory in others. Make the
interaction between SetClientEncoding and InitializeClientEncoding
cleaner and better documented. I suspect these changes should be
back-patched into 7.3, but will wait on Tatsuo's verification.
correctly. See following thread for more details.
Subject: [HACKERS] client_encoding directive is ignored in postgresql.conf
From: Tatsuo Ishii <t-ishii@sra.co.jp>
Date: Wed, 29 Jan 2003 22:24:04 +0900 (JST)
cleaning up locale names and nothing else. Since all the locale names
are in plain ASCII I think it will be safe to use ASCII-only lower-case
conversion.
Nicolai Tufar
where it's safe to do database access. Along the way, fix core dump
for 'DEFAULT' parameters to CREATE DATABASE. initdb forced due to
change in pg_proc entry.
Eliminate the mysterious games that the Cygwin build plays with the linker
flag variables. DLLLIBS is gone, use SHLIB_LINK like everyone else.
Detect cygipc in configure, after the linker flags are set up, otherwise
configure might not work at all.
Make sure everything is covered by make clean.
Fix the build of the new conversion procedure modules.
Add new DLLIMPORT markers where required.
Finally, the compiler complains if we use an explicit
-I/usr/local/include, so don't do that. Curiously, -L/usr/local/lib is
still necessary.
'make install'; there are enough of 'em that this slowed down the make
noticeably. Ensure that 'all' is the default make target in all these
directories (defaulting to 'make install' is surprising and dangerous
IMHO). Fix a couple small typos.
with OPAQUE, as per recent pghackers discussion. I still want to do some
more work on the 'cstring' pseudo-type, but I'm going to commit the bulk
of the changes now before the tree starts shifting under me ...
conversion procs and conversions are added in initdb. Currently
supported conversions are:
UTF-8(UNICODE) <--> SQL_ASCII, ISO-8859-1 to 16, EUC_JP, EUC_KR,
EUC_CN, EUC_TW, SJIS, BIG5, GBK, GB18030, UHC,
JOHAB, TCVN
EUC_JP <--> SJIS
EUC_TW <--> BIG5
MULE_INTERNAL <--> EUC_JP, SJIS, EUC_TW, BIG5
Note that initial contents of pg_conversion system catalog are created
in the initdb process. So doing initdb required is ideal, it's
possible to add them to your databases by hand, however. To accomplish
this:
psql -f your_postgresql_install_path/share/conversion_create.sql your_database
So I did not bump up the version in cataversion.h.
TODO:
Add more conversion procs
Add [CASCADE|RESTRICT] to DROP CONVERSION
Add tuples to pg_depend
Add regression tests
Write docs
Add SQL99 CONVERT command?
--
Tatsuo Ishii
o Change all current CVS messages of NOTICE to WARNING. We were going
to do this just before 7.3 beta but it has to be done now, as you will
see below.
o Change current INFO messages that should be controlled by
client_min_messages to NOTICE.
o Force remaining INFO messages, like from EXPLAIN, VACUUM VERBOSE, etc.
to always go to the client.
o Remove INFO from the client_min_messages options and add NOTICE.
Seems we do need three non-ERROR elog levels to handle the various
behaviors we need for these messages.
Regression passed.
> > > > It was made to cope with encoding such as an Asian bloc in 7.2Beta2.
> > > >
> > > > Added ServerEncoding
> > > > Korean (JOHAB), Thai (WIN874),
> > > > Vietnamese (TCVN), Arabic (WIN1256)
> > > >
> > > > Added ClientEncoding
> > > > Simplified Chinese (GBK), Korean (UHC)
> > > >
> > > >
> > > >
> http://www.sankyo-unyu.co.jp/Pool/postgresql-7.2b2.newencoding.diff.tar.gz
> > > > (608K)
> > >
> > > Looks good. I need some people to review this for me.
> >
> > For me they look good too. The only missing part is a
> > documentation. I will ask him to write it up. If he couldn't, I will
> > do it for him.
> > > The diff is 3mb
> > > but appears to address only additions to multibyte. I have attached a
> > > list of files it modifies. Also, look at the sizes of the mb/
> > > directory. It is getting large:
> > >
> > > 4 ./CVS
> > > 6 ./Unicode/CVS
> > > 3433 ./Unicode
> > > 6197 .
> >
> > Yes. We definitely need the on-the-fly encoding addition capability:
> > i.e. CREATE CHRACTER SET in the future...
> > --
> > Tatsuo Ishii
> >
> >
Address chainge.
http://www.sankyo-unyu.co.jp/Pool/postgresql-7.2.newencoding.diff.gz
Add PsqlODBC and document ...etc patch.
Eiji Tokuya
side encoding name. This is necessary for client API's such as JDBC
to perform correct encoding conversions. See my email "[HACKERS]
pg_client_encoding" 10 Sep 2001.
> > ClientEncoding to SQL_ASCII (like default DatabaseEncoding). Bruce, can
> > you change it? It's one line change. Again thanks.
Forget it! A default client encoding must be set by actual database encoding...
Please apply the small attached patch that solve it better.
Karel Zak
-------------------------------------------------------------------
Subject: Re: [PATCHES] encoding names
From: Karel Zak <zakkr@zf.jcu.cz>
To: Peter Eisentraut <peter_e@gmx.net>
Cc: pgsql-patches <pgsql-patches@postgresql.org>
Date: Fri, 31 Aug 2001 17:24:38 +0200
On Thu, Aug 30, 2001 at 01:30:40AM +0200, Peter Eisentraut wrote:
> > - convert encoding 'name' to 'id'
>
> I thought we decided not to add functions returning "new" names until we
> know exactly what the new names should be, and pending schema
Ok, the patch not to add functions.
> better
>
> ...(): encoding name too long
Fixed.
I found new bug in command/variable.c in parse_client_encoding(), nobody
probably never see this error:
if (pg_set_client_encoding(encoding))
{
elog(ERROR, "Conversion between %s and %s is not supported",
value, GetDatabaseEncodingName());
}
because pg_set_client_encoding() returns -1 for error and 0 as true.
It's fixed too.
IMHO it can be apply.
Karel
PS:
* following files are renamed:
src/utils/mb/Unicode/KOI8_to_utf8.map -->
src/utils/mb/Unicode/koi8r_to_utf8.map
src/utils/mb/Unicode/WIN_to_utf8.map -->
src/utils/mb/Unicode/win1251_to_utf8.map
src/utils/mb/Unicode/utf8_to_KOI8.map -->
src/utils/mb/Unicode/utf8_to_koi8r.map
src/utils/mb/Unicode/utf8_to_WIN.map -->
src/utils/mb/Unicode/utf8_to_win1251.map
* new file:
src/utils/mb/encname.c
* removed file:
src/utils/mb/common.c
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
of a counted input string. Marinos Yannikos' recent crash report turns
out to be due to applying pg_ascii2wchar_with_len to a TEXT object that
is smack up against the end of memory. This is the second just-barely-
reproducible bug report I have seen that traces to some bit of code
fetching one more byte than it is allowed to. Let's be more careful
out there, boys and girls.
While at it, I changed the code to not risk a similar crash when there
is a truncated multibyte character at the end of an input string. The
output in this case might not be the most reasonable output possible;
if anyone wants to improve it further, step right up...
are now separate files "postgres.h" and "postgres_fe.h", which are meant
to be the primary include files for backend .c files and frontend .c files
respectively. By default, only include files meant for frontend use are
installed into the installation include directory. There is a new make
target 'make install-all-headers' that adds the whole content of the
src/include tree to the installed fileset, for use by people who want to
develop server-side code without keeping the complete source tree on hand.
Cleaned up a whole lot of crufty and inconsistent header inclusions.
$(CC) $(CFLAGS) $(LDFLAGS) <object files> <extra-libraries> $(LIBS) -o $@
This form seemed to be the most portable, readable, and logical, but in any
case it's better than having a dozen different ones in the tree.
--------------------------------------------------
Subject: Bug in unicode conversion ...
From: Jan Varga <varga@utcru.sk>
To: t-ishii@sra.co.jp
Date: Sat, 18 Nov 2000 17:41:20 +0100 (CET)
Hi,
I tried this new feature in PostgreSQL. I found one bug.
Script UCS_to_8859.pl skips input lines which
1. code <0x80 or
2. ucs <0x100
I think second one is not good idea because some codes in ISO8859-2
have ucs <0x100 (e.g. 0xE9 - 0x00E9)
--------------------------------------------------
cloned, rather than always cloning template1. Modify initdb to generate
two identical databases rather than one, template0 and template1.
Connections to template0 are disallowed, so that it will always remain
in its virgin as-initdb'd state. pg_dumpall now dumps databases with
restore commands that say CREATE DATABASE foo WITH TEMPLATE = template0.
This allows proper behavior when there is user-added data in template1.
initdb forced!
MULTIBYTE support is not compiled (you just can't set them to anything
but SQL_ASCII). This should reduce interoperability problems between
MB-enabled clients and non-MB-enabled servers.
source directory. This involves mostly makefiles using $(srcdir) when they
might have used ".". (Regression tests don't work with this, yet.)
Sort out usage of CPPFLAGS, CFLAGS (and CXXFLAGS). Add "override" keyword
in most places, to preserve necessary flags even when the user overrode the
flags.
for details). It doesn't really do that much yet, since there are no
short-term memory contexts in the executor, but the infrastructure is
in place and long-term contexts are handled reasonably. A few long-
standing bugs have been fixed, such as 'VACUUM; anything' in a single
query string crashing. Also, out-of-memory is now considered a
recoverable ERROR, not FATAL.
Eliminate a large amount of crufty, now-dead code in and around
memory management.
Fix problem with holding off SIGTRAP, SIGSEGV, etc in postmaster and
backend startup.
it failed to cover the case where high bits of char are 100 or 101.
Not sure if fix is right, but it agrees with pg_utf_mblen ... and it
doesn't lock up ...