[TCLCORE] Fwd: CFV Warning: TIP 389: Full support for Unicode 10.0 and later

From: Kevin Kenny <kevi...@gmail.com>

Wed, 18 Apr 2018 12:13:55 -0400

Sorry, meant to send this to the list:

---------- Forwarded message ----------
From: Kevin Kenny <kevi...@gmail.com>
Date: Wed, Apr 18, 2018 at 11:41 AM
Subject: Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0
and later
To: Jan Nijtmans <jan....@gmail.com>

On Wed, Apr 18, 2018 at 11:14 AM, Jan Nijtmans <jan....@gmail.com>
wrote:

> Packages which don't bother to handle characters higher than> U+0FFFF don't need to do anything. The 'state' is only used> when handling surrogates, so any operation without surrogates> will continue to work fine. I hope that the change in tdbcodbc> shows how easy it is to make an extension handle surrogate> pairs correctly: Just initialize some variables. Any extension> not doing the initialization, runs the risk of an incoming> low surrogate as first character of a string pairing with random> data to form a partially random 4-byte UTF-8 character. That's all.>

In dealing with ODBC, Windows API's, and streams of 16-bitc haracters
coming from external media, that's not entirely true. An application today
can deal with malformed UTF-16 simply by ignoring UTF-16 and treating the
data as if it were UCS-2.

For tdbc::odbc, applications must at least be able to process malformed
data in an existing SQL Server or Oracle database. I'm convinced that the
right way to handle this will involve having some sort of specific
representation for the bad data. There's no really good way to accomplish
this. I'm thinking that the least-worst approach will be to represent
unpaired surrogates with private use characters in the range
U+FD800-U+FDFFF and flag this usage by prefixing the private use character
with the noncharacter U+FDEF (The noncharacters U+FD00-U+FDEF would be
flagged similarly.)

That is, on input from a UTF-16 medium:

Unpaired surrogate U+Dxxx (in the range D800-DFFF) ->  U+FDEF U+FDxxx
Noncharacter U+FDxx (in the range FDD0-FDEF) -> U+FDEF U+FDxx

with the same transformation applied in reverse on output.

As I observed earlier, we'll need to do something similar with the Windows
file system, which offers no protection against the inclusion of unpaired
surrogates or noncharacters in file names.

This scheme could be extended to representations of malformed UTF-8 as
well, but there's much less of a need. Malformed UTF-8 can always be
processed as a byte array. But we don't have the equivalent concept for a
16-bit-codepoint array.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core

Recent Messages in this Thread
[TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Jan Nijtmans	Jan 10, 2018 09:37 am
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Donald G Porter	Jan 10, 2018 03:06 pm
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Donald G Porter	Jan 10, 2018 03:17 pm
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Peter Da Silva	Jan 10, 2018 03:45 pm
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Donald G Porter	Jan 10, 2018 04:00 pm
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Christian Gollwitzer	Jan 11, 2018 09:11 pm
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Dipl. Ing. Sergey G. Brester via Tcl-Core	Jan 11, 2018 10:37 pm
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Kevin Kenny	Jan 11, 2018 11:34 pm
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Dipl. Ing. Sergey G. Brester via Tcl-Core	Jan 12, 2018 08:39 am
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Jan Nijtmans	Jan 12, 2018 09:46 am
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Jan Nijtmans	Jan 12, 2018 10:11 am
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Christian Gollwitzer	Jan 13, 2018 09:36 pm
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Donal K. Fellows	Jan 16, 2018 01:02 pm
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Donal K. Fellows	Jan 12, 2018 01:29 pm
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Donald G Porter	Jan 16, 2018 03:33 pm
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Donald G Porter	Jan 16, 2018 05:27 pm
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Rolf Ade	Jan 12, 2018 02:26 pm
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Steve Landers	Apr 17, 2018 12:14 am
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Jan Nijtmans	Apr 17, 2018 09:12 am
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Donald Porter	Apr 17, 2018 11:05 am
[TCLCORE] Fwd: CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Kevin Kenny	Apr 18, 2018 04:13 pm
[TCLCORE] Fwd: CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Jan Nijtmans	Jan 12, 2018 03:12 pm
Re: [TCLCORE] Fwd: CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Rolf Ade	Jan 13, 2018 12:42 am
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Donald G Porter	Jan 18, 2018 02:09 pm
Re: [TCLCORE] CFV Warning: TIP 389: Full support for Unicode 10.0 and later	Jan Nijtmans	Apr 04, 2018 10:53 am

◄ Messages in this thread ►

Previous post: Re: [TCLCORE] CVF Warning: TIP #491 [Was: Calling into Tcl interpreter while a Tcl_CmdProc is active]

Next post: [TCLCORE] CFV TIP #425: Correct use of UTF-8 in Panic Callback (Windows only)

Subscribe to the tcl-core RSS feed

Accounts

List Archives

Feedback & Information

ActiveState

© 2019 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.