Sorry, meant to send this to the list:
---------- Forwarded message ----------
From: Kevin Kenny <kevi...@gmail.com>
Date: Wed, Apr 18, 2018 at 11:41 AM
Subject: Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0
and later
To: Jan Nijtmans <jan....@gmail.com>
On Wed, Apr 18, 2018 at 11:14 AM, Jan Nijtmans <jan....@gmail.com>
wrote:
> Packages which don't bother to handle characters higher than> U+0FFFF don't need to do anything. The 'state' is only used> when handling surrogates, so any operation without surrogates> will continue to work fine. I hope that the change in tdbcodbc> shows how easy it is to make an extension handle surrogate> pairs correctly: Just initialize some variables. Any extension> not doing the initialization, runs the risk of an incoming> low surrogate as first character of a string pairing with random> data to form a partially random 4-byte UTF-8 character. That's all.>
In dealing with ODBC, Windows API's, and streams of 16-bitc haracters
coming from external media, that's not entirely true. An application today
can deal with malformed UTF-16 simply by ignoring UTF-16 and treating the
data as if it were UCS-2.
For tdbc::odbc, applications must at least be able to process malformed
data in an existing SQL Server or Oracle database. I'm convinced that the
right way to handle this will involve having some sort of specific
representation for the bad data. There's no really good way to accomplish
this. I'm thinking that the least-worst approach will be to represent
unpaired surrogates with private use characters in the range
U+FD800-U+FDFFF and flag this usage by prefixing the private use character
with the noncharacter U+FDEF (The noncharacters U+FD00-U+FDEF would be
flagged similarly.)
That is, on input from a UTF-16 medium:
Unpaired surrogate U+Dxxx (in the range D800-DFFF) -> U+FDEF U+FDxxx
Noncharacter U+FDxx (in the range FDD0-FDEF) -> U+FDEF U+FDxx
with the same transformation applied in reverse on output.
As I observed earlier, we'll need to do something similar with the Windows
file system, which offers no protection against the inclusion of unpaired
surrogates or noncharacters in file names.
This scheme could be extended to representations of malformed UTF-8 as
well, but there's much less of a need. Malformed UTF-8 can always be
processed as a byte array. But we don't have the equivalent concept for a
16-bit-codepoint array.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core