| Store | Cart

[TCLCORE] Fwd: CFV Warning: TIP 389: Full support for Unicode 10.0 and later

From: Kevin Kenny <kevi...@gmail.com>
Wed, 18 Apr 2018 12:13:55 -0400
Sorry, meant to send this to the list:

---------- Forwarded message ----------
From: Kevin Kenny <kevi...@gmail.com>
Date: Wed, Apr 18, 2018 at 11:41 AM
Subject: Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0
and later
To: Jan Nijtmans <jan....@gmail.com>


On Wed, Apr 18, 2018 at 11:14 AM, Jan Nijtmans <jan....@gmail.com>
wrote:

> Packages which don't bother to handle characters higher than> U+0FFFF don't need to do anything. The 'state' is only used> when handling surrogates, so any operation without surrogates> will continue to work fine. I hope that the change in tdbcodbc> shows how easy it is to make an extension handle surrogate> pairs correctly: Just initialize some variables. Any extension> not doing the initialization, runs the risk of an incoming> low surrogate as first character of a string pairing with random> data to form a partially random 4-byte UTF-8 character. That's all.>

In dealing with ODBC, Windows API's, and streams of 16-bitc haracters
coming from external media, that's not entirely true. An application today
can deal with malformed UTF-16 simply by ignoring UTF-16 and treating the
data as if it were UCS-2.

For tdbc::odbc, applications must at least be able to process malformed
data in an existing SQL Server or Oracle database. I'm convinced that the
right way to handle this will involve having some sort of specific
representation for the bad data. There's no really good way to accomplish
this. I'm thinking that the least-worst approach will be to represent
unpaired surrogates with private use characters in the range
U+FD800-U+FDFFF and flag this usage by prefixing the private use character
with the noncharacter U+FDEF (The noncharacters U+FD00-U+FDEF would be
flagged similarly.)

That is, on input from a UTF-16 medium:

Unpaired surrogate U+Dxxx (in the range D800-DFFF) ->  U+FDEF U+FDxxx
Noncharacter U+FDxx (in the range FDD0-FDEF) -> U+FDEF U+FDxx

with the same transformation applied in reverse on output.

As I observed earlier, we'll need to do something similar with the Windows
file system, which offers no protection against the inclusion of unpaired
surrogates or noncharacters in file names.

This scheme could be extended to representations of malformed UTF-8 as
well, but there's much less of a need. Malformed UTF-8 can always be
processed as a byte array. But we don't have the equivalent concept for a
16-bit-codepoint array.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core

Recent Messages in this Thread
Jan Nijtmans Jan 10, 2018 09:37 am
Donald G Porter Jan 10, 2018 03:06 pm
Donald G Porter Jan 10, 2018 03:17 pm
Peter Da Silva Jan 10, 2018 03:45 pm
Donald G Porter Jan 10, 2018 04:00 pm
Christian Gollwitzer Jan 11, 2018 09:11 pm
Dipl. Ing. Sergey G. Brester via Tcl-Core Jan 11, 2018 10:37 pm
Kevin Kenny Jan 11, 2018 11:34 pm
Dipl. Ing. Sergey G. Brester via Tcl-Core Jan 12, 2018 08:39 am
Jan Nijtmans Jan 12, 2018 09:46 am
Jan Nijtmans Jan 12, 2018 10:11 am
Christian Gollwitzer Jan 13, 2018 09:36 pm
Donal K. Fellows Jan 16, 2018 01:02 pm
Donal K. Fellows Jan 12, 2018 01:29 pm
Donald G Porter Jan 16, 2018 03:33 pm
Donald G Porter Jan 16, 2018 05:27 pm
Rolf Ade Jan 12, 2018 02:26 pm
Steve Landers Apr 17, 2018 12:14 am
Jan Nijtmans Apr 17, 2018 09:12 am
Donald Porter Apr 17, 2018 11:05 am
Kevin Kenny Apr 18, 2018 04:13 pm
Jan Nijtmans Jan 12, 2018 03:12 pm
Rolf Ade Jan 13, 2018 12:42 am
Donald G Porter Jan 18, 2018 02:09 pm
Jan Nijtmans Apr 04, 2018 10:53 am
Messages in this thread