Commit Graph

87 Commits

Author SHA1 Message Date
Bruce Momjian 9af4159fce pgindent run for release 9.3
This is the first run of the Perl-based pgindent script.  Also update
pgindent instructions.
2013-05-29 16:58:43 -04:00
Andrew Dunstan 3717f0837b Tidy up from frontend Assert change.
Quiet compiler warnings noted by Peter Eisentraut.
2012-12-16 12:22:57 -05:00
Tom Lane 60e9c224a1 Fix ASCII case in pg_wchar2mule_with_len.
Also some cosmetic improvements for wchar-to-mblen patch.
2012-07-10 15:59:39 -04:00
Robert Haas f6a05fd973 Fix failure of new wchar->mb functions to advance from pointer.
Bug spotted by Tom Lane.
2012-07-05 23:47:53 -04:00
Robert Haas 72dd6291f2 Add wchar -> mb conversion routines.
This is infrastructure for Alexander Korotkov's work on indexing regular
expression searches.

Alexander Korotkov, with a bit of further hackery on the MULE conversion
by me
2012-07-04 17:10:10 -04:00
Tom Lane 09022de1f5 Improve documentation about MULE encoding.
This commit improves the comments in pg_wchar.h and creates #define symbols
for some formerly hard-coded values.  No substantive code changes.

Tatsuo Ishii and Tom Lane
2012-07-04 00:29:57 -04:00
Bruce Momjian 927d61eeff Run pgindent on 9.2 source tree in preparation for first 9.3
commit-fest.
2012-06-10 15:20:04 -04:00
Robert Haas 5d4b60f2f2 Lots of doc corrections.
Josh Kupershmidt
2012-04-23 22:43:09 -04:00
Tom Lane eb5834d5af Further improvement of make_greater_string.
Make sure that it considers all the possibilities that the old code did,
instead of trying only one possibility per character position.  To keep the
runtime in bounds, instead tweak the character incrementers to not try
every possible multibyte character code.  Remove unnecessary logic to
restore the old character value on failure.  Additional comment and
formatting cleanup.
2011-10-30 12:22:11 -04:00
Robert Haas 78d523b633 Improve make_greater_string() with encoding-specific incrementers.
This infrastructure doesn't in any way guarantee that the character
we produce will sort before the one we incremented; but it does at least
make it much more likely that we'll end up with something that is a valid
character, which improves our chances.

Kyotaro Horiguchi, with various adjustments by me.
2011-10-29 14:22:20 -04:00
Peter Eisentraut a2a5ce6826 Improve "invalid byte sequence for encoding" message
It used to say

ERROR:  invalid byte sequence for encoding "UTF8": 0xdb24

Change this to

ERROR:  invalid byte sequence for encoding "UTF8": 0xdb 0x24

to make it clear that this is a byte sequence and not a code point.

Also fix the adjacent "character has no equivalent" message that has
the same issue.
2011-09-05 23:38:27 +03:00
Magnus Hagander 9f2e211386 Remove cvs keywords from all files. 2010-09-20 22:08:53 +02:00
Tom Lane 2d8314bd43 Rename utf2ucs() to utf8_to_unicode(), and export it so it can be used
elsewhere.

Similarly rename the version in mbprint.c, not because this affects anything
but just to keep the two copies in exact sync.  There was some discussion of
having only one copy in src/port/ instead, but this function is so small
and unlikely to change that that seems like overkill.

Slightly editorialized version of a patch by Joseph Adams.  (The bug-fix
aspect of his patch was applied separately, and back-patched.)
2010-08-18 19:54:01 +00:00
Andrew Dunstan fc09fb7bcf Remove sometimes inaccurate error hint about source of wrongly encoded data. 2010-01-04 20:38:31 +00:00
Bruce Momjian d747140279 8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef list
provided by Andrew.
2009-06-11 14:49:15 +00:00
Tom Lane fd9e2accef When we are in error recursion trouble, arrange to suppress translation and
encoding conversion of any elog/ereport message being sent to the frontend.
This generalizes a patch that I put in last October, which suppressed
translation of only specific messages known to be associated with recursive
can't-translate-the-message behavior.  As shown in bug #4680, we need a more
general answer in order to have some hope of coping with broken encoding
conversion setups.  This approach seems a good deal less klugy anyway.

Patch in all supported branches.
2009-03-02 21:18:43 +00:00
Peter Eisentraut 8b9dd6b5fd Support for KOI8U encoding 2009-02-10 19:29:39 +00:00
Peter Eisentraut 1cb54c2860 Remove the encoding *numbers* from the comments. They are useless, and
make maintenance harder.
2009-02-10 16:44:44 +00:00
Tom Lane 0d65eea3da Replace argument-checking Asserts with regular test-and-elog checks in all
encoding conversion functions.  These are not can't-happen cases because
it's possible to create a conversion with the wrong conversion function
for the specified encoding pair.  That would lead to an Assert crash in
an Assert-enabled build, or incorrect conversion otherwise, neither of
which is desirable.  This would be a DOS issue if production databases
were customarily built with asserts enabled, but fortunately that's not so.
Per an observation by Heikki.

Back-patch to all supported branches.
2009-01-29 19:23:42 +00:00
Peter Eisentraut 06735e3256 Unicode escapes in strings and identifiers 2008-10-29 08:04:54 +00:00
Tom Lane b0169bb124 Install a more robust solution for the problem of infinite error-processing
recursion when we are unable to convert a localized error message to the
client's encoding.  We've been over this ground before, but as reported by
Ibrar Ahmed, it still didn't work in the case of conversion failures for
the conversion-failure message itself :-(.  Fix by installing a "circuit
breaker" that disables attempts to localize this message once we get into
recursion trouble.

Patch all supported branches, because it is in fact broken in all of them;
though I had to add some missing translations to the older branches in
order to expose the failure in the particular test case I was using.
2008-10-27 19:37:22 +00:00
Bruce Momjian fdf5a5efb7 pgindent run for 8.3. 2007-11-15 21:14:46 +00:00
Tom Lane febd60bf5d Fix pg_wchar_table[] to match revised ordering of the encoding ID enum.
Add some comments so hopefully the next poor sod doesn't fall into the
same trap.  (Wrong comments are worse than none at all...)
2007-10-15 22:46:27 +00:00
Andrew Dunstan 55613bf9cd Close previously open holes for invalidly encoded data to enter the
database via builtin functions, as recently discussed on -hackers.

chr() now returns a character in the database encoding. For UTF8 encoded databases
the argument is treated as a Unicode code point. For other multi-byte encodings
the argument must designate a strict ascii character, or an error is raised,
as is also the case if the argument is 0.

ascii() is adjusted so that it remains the inverse of chr().

The two argument form of convert() is gone, and the three argument form now
takes a bytea first argument and returns a bytea. To cover this loss three new
functions are introduced:
. convert_from(bytea, name) returns text - converts the first argument from the
  named encoding to the database encoding
. convert_to(text, name) returns bytea - converts the first argument from the
  database encoding to the named encoding
. length(bytea, name) returns int - gives the length of the first argument in
  characters in the named encoding
2007-09-18 17:41:17 +00:00
Tom Lane 4dbbef2845 Suppress an integer-overflow warning. 2007-07-12 21:17:09 +00:00
Tatsuo Ishii 6041b92238 Make JOHAB client only encoding per discussions in pgsql-hackers
"Server-side support of all encodings" around 2007/3/26.
initdb required.
2007-04-15 10:56:30 +00:00
Tatsuo Ishii a6fbd2f12a Fix pg_wchar_table's maxmblen field of EUC_CN, EUC_TW, MULE_INTERNAL
and GB18030. patches from ITAGAKI Takahiro.
2007-03-26 11:15:13 +00:00
Tatsuo Ishii 75c6519ff6 Add new encoding EUC_JIS_2004 and SHIFT_JIS_2004,
along with new conversions among EUC_JIS_2004, SHIFT_JIS_2004 and UTF-8.
catalog version has been bump up.
2007-03-25 11:56:04 +00:00
Tom Lane 0887fa1117 Get pg_utf_mblen(), pg_utf2wchar_with_len(), and utf2ucs() all on the same
page about the maximum UTF8 sequence length we support (4 bytes since 8.1,
3 before that).  pg_utf2wchar_with_len never got updated to support 4-byte
characters at all, and in any case had a buffer-overrun risk in that it
could produce multiple pg_wchars from what mblen claims to be just one UTF8
character.  The only reason we don't have a major security hole is that most
callers allocate worst-case output buffers; the sole exception in released
versions appears to be pre-8.2 iwchareq() (ie, ILIKE), which can be crashed
due to zeroing out its return address --- but AFAICS that can't be exploited
for anything more than a crash, due to inability to control what gets written
there.  Per report from James Russell and Michael Fuhr.

Pre-8.1 the risk is much less, but I still think pg_utf2wchar_with_len's
behavior given an incomplete final character risks buffer overrun, so
back-patch that logic change anyway.

This patch also makes sure that UTF8 sequences exceeding the supported
length (whichever it is) are consistently treated as error cases, rather
than being treated like a valid shorter sequence in some places.
2007-01-24 17:12:17 +00:00
Bruce Momjian f99a569a2e pgindent run for 8.2. 2006-10-04 00:30:14 +00:00
Bruce Momjian a3132359fd In new "invalid byte sequence" error hint, call it "error", not
"failure".
2006-08-22 12:11:28 +00:00
Bruce Momjian e11cab650c Add hint for "invalid byte sequence for encoding" error message,
suggesting review of client_encoding.
2006-08-22 03:30:20 +00:00
Tom Lane c61a2f5841 Change the backend to reject strings containing invalidly-encoded multibyte
characters in all cases.  Formerly we mostly just threw warnings for invalid
input, and failed to detect it at all if no encoding conversion was required.
The tighter check is needed to defend against SQL-injection attacks as per
CVE-2006-2313 (further details will be published after release).  Embedded
zero (null) bytes will be rejected as well.  The checks are applied during
input to the backend (receipt from client or COPY IN), so it no longer seems
necessary to check in textin() and related routines; any string arriving at
those functions will already have been validated.  Conversion failure
reporting (for characters with no equivalent in the destination encoding)
has been cleaned up and made consistent while at it.

Also, fix a few longstanding errors in little-used encoding conversion
routines: win1251_to_iso, win866_to_iso, euc_tw_to_big5, euc_tw_to_mic,
mic_to_euc_tw were all broken to varying extents.

Patches by Tatsuo Ishii and Tom Lane.  Thanks to Akio Ishida and Yasuo Ohgaki
for identifying the security issues.
2006-05-21 20:05:21 +00:00
Peter Eisentraut 1b658473ea Add support for Windows codepages 1253, 1254, 1255, and 1257 and clean
up a bunch of the support utilities.

In src/backend/utils/mb/Unicode remove nearly duplicate copies of the
UCS_to_XXX perl script and replace with one version to handle all generic
files.  Update the Makefile so that it knows about all the map files.
This produces a slight difference in some of the map files, using a
uniform naming convention and not mapping the null character.

In src/backend/utils/mb/conversion_procs create a master utf8<->win
codepage function like the ISO 8859 versions instead of having a separate
handler for each conversion.

There is an externally visible change in the name of the win1258 to utf8
conversion.  According to the documentation notes, it was named
incorrectly and this changes it to a standard name.

Running the Unicode mapping perl scripts has shown some additional mapping
changes in koi8r and iso8859-7.
2006-02-18 16:15:23 +00:00
Bruce Momjian c01999a557 Allow psql multi-line column values to align in the proper columns
If the second output column value is 'a\nb', the 'b' should appear
  in the second display column, rather than the first column as it
  does now.

Change libpq's PQdsplen() to return more useful values.

> Note: this changes the PQdsplen function, it can now return zero or
> minus one which was not possible before. It doesn't appear anyone is
> actually using the functions other than psql but it is a change. The
> functions are not actually documentated anywhere so it's not like we're
> breaking a defined interface. The new semantics follow the Unicode
> standard.

BACKWARD COMPATIBLE CHANGE.

The only user-visible change I saw in the regression tests is that a
SELECT * on a table where all the columns have been dropped doesn't
return a blank line like before.  This seems like a step forward.

Martijn van Oosterhout
2006-02-10 00:39:04 +00:00
Bruce Momjian a2384d008a More uses of IS_HIGHBIT_SET() macro. 2005-12-26 19:30:45 +00:00
Bruce Momjian 261114a23f I have added these macros to c.h:
#define HIGHBIT                 (0x80)
        #define IS_HIGHBIT_SET(ch)      ((unsigned char)(ch) & HIGHBIT)

and removed CSIGNBIT and mapped it uses to HIGHBIT.  I have also added
uses for IS_HIGHBIT_SET where appropriate.  This change is
purely for code clarity.
2005-12-25 02:14:19 +00:00
Bruce Momjian d8a8183456 Formatting cleanups. 2005-12-24 17:19:40 +00:00
Bruce Momjian 0658a6a634 Formatting cleanup. 2005-12-24 16:49:48 +00:00
Tatsuo Ishii 804f6b8fc9 Fix long standing Asian multibyte charsets bug.
See:

Subject: [HACKERS] bugs with certain Asian multibyte charsets
From: Tatsuo Ishii <ishii@sraoss.co.jp>
To: pgsql-hackers@postgresql.org
Date: Sat, 24 Dec 2005 18:25:33 +0900 (JST)

for more details/
2005-12-24 09:35:36 +00:00
Peter Eisentraut 07bb9f086b Message corrections 2005-10-29 00:31:52 +00:00
Bruce Momjian 1dc3498251 Standard pgindent run for 8.1. 2005-10-15 02:49:52 +00:00
Tom Lane 8889685555 Suppress signed-vs-unsigned-char warnings. 2005-09-24 17:53:28 +00:00
Bruce Momjian 5955945828 Support 3 and 4-byte unicode characters.
John Hansen
2005-06-15 00:15:08 +00:00
Bruce Momjian e7fb9f18bf Add support for Win1252 encoding.
Roland Volkmann
2005-03-14 18:31:25 +00:00
Bruce Momjian 41e2a80f57 Update comments for new encoding names. 2005-03-14 00:19:13 +00:00
Bruce Momjian e3d7de6b99 Rename canonical encodings, per Peter:
UNICODE => UTF8
	ALT => WIN866
	WIN => WIN1251
	TCVN => WIN1258

The old codes continue to work.
2005-03-07 04:30:55 +00:00
Bruce Momjian 08e0b34bad Back out fix for Unicode characters above 0x10000 2004-12-03 01:20:33 +00:00
Bruce Momjian 4ea4f8bd06 Fix for Unicode characters above 0x10000.
John Hansen
2004-12-02 22:37:14 +00:00
Peter Eisentraut 152a101f2b Allow WIN1250 as server encoding. 2004-09-17 21:59:57 +00:00