| Store | Cart

Re: [perl #125221] [PATCH] Use "UTF-8" consistently in perldelta

From: Karl Williamson <pub...@khwilliamson.com>
Thu, 21 May 2015 21:50:55 -0600
Top posting to cut to the chase:

The use of capitalization and presence or absence of a dash to indicate 
whether we accept malformed utf8 or not was wrong.  Subtle distinctions, 
especially ones like these that can be easily overlooked, shouldn't have 
such severe consequences.  The same argument applies as to whether we 
accept wellformed utf8 that is in one of the 3 problematic Unicode 
classes (surrogates, non-characters, and above-Unicode code points). 
That API should be fixed before more damage is done.

I think we should apply "UTF-8" to everything, and forget about the 
distinctions.  I wouldn't object to uniformly getting rid of the dash. 
I don't believe we would get sued for doing these things.

The fact that these different spellings have shown up in our 
documentation is proof that even perl porters don't pay much attention 
to the distinctions, so the average programmer is not going to notice at 
all.

I have no doubt that if Unicode ran out of code points, that they would 
simply increase the number available, with all previous protestations to 
the contrary becoming null and void.  And there are discussions in the 
Unicode mailing list about doing this that come up from time to time. 
But that's not going to happen anytime soon.  They've assigned about a 
quarter of the 2**21 so far allocated, in more than 20 years.  At that 
rate, it would be more than 60 more years before they would fill up, 
even ignoring the fact that the rate of assignment has been decreasing. 
  There could always be some new technology that would gobble up code 
points: emoji might end up doing that, but Unicode is hoping to get out 
of the new emoji business, and there are prospects of new technology 
allowing this to happen.

The bottom line IMO is that saying utf8 to mean one thing and UTF-8 to 
mean a more restricted thing is outrageously wrong, although try as I 
might, I can't quite blame World War I on this decision ;)

On 05/20/2015 11:39 PM, demerphq wrote:
> On 21 May 2015 at 04:01, Karl Williamson <pub...@khwilliamson.com> wrote:>> On 05/20/2015 06:44 PM, Tony Cook wrote:>>>>>> Oops, forgot to push the revert, I'll hold off on it for now.>>>>>> On Wed, May 20, 2015 at 07:16:39PM +0200, demerphq wrote:>>>>>>>> Sorry about that. I somehow feel like a party pooper for bringing this>>>> up.>>>>>>>>> I may have overreacted, sorry.>>>>>> Here's the way I think about it:>>>>>> - unless we need to specifically distinguish between them (as Encode>>>     does), calling perl's internal encoding UTF-8 is no big deal, since>>>     its intent is to represent Unicode.  If we do need to distinguish>>>     between them in perldelta then something like "perl's extended>>>     UTF-8" is more useful to most readers than "utf8".>>>>>> +1>> I don't agree really. This is a long held distinction.>>>>>>>>> - the name of the flag is SVf_UTF8, but it can be described as the>>>     "UTF-8 flag", consider the comment in the source:>>>>>> #define SVf_UTF8        0x20000000  /* SvPV is UTF-8 encoded>>>                                         This is also set on RVs whose>>> overloaded>>>                                         stringification is UTF-8. This>>> might>>>                                         only happen as a side effect of>>> SvPV() */>>>>>>     Using "the UTF8 flag" seems silly to me - name it or describe it,>>>     not something half-way between.>> I would argue the comment is wrong and should  be changed to "utf8".>>>>>>> Here's the chunks and my rationale:>>>>>> -=head2 Better heuristics on older platforms for determining locale>>> UTF8ness>>> +=head2 Better heuristics on older platforms for determining locale>>> UTF-8ness>>>>>>    On platforms that implement neither the C99 standard nor the POSIX 2001>>> -standard, determining if the current locale is UTF8 or not depends on>>> +standard, determining if the current locale is UTF-8 or not depends on>>>    heuristics.  These are improved in this release.>>>>>> In this case we're talking about whether the locales support UTF-8 or>>> not.  This has nothing to do with perl's internal SVf_UTF8 flag or>>> internal encoding.>>>>>> I think it belongs.>>>>>> +1>> No argument on this one.>>>>>>>    (D deprecated) The C<< /\C/ >> character class was deprecated in v5.20,>>> and>>>    now emits a warning. It is intended that it will become an error in>>> v5.24.>>>    This character class matches a single byte even if it appears within a>>> -multi-byte character, breaks encapsulation, and can corrupt utf8>>> +multi-byte character, breaks encapsulation, and can corrupt UTF-8>>>    strings.>>>>>> This is probably a mistake if perldelta needs to distinguish utf8 vs>>> UTF-8.>>>>>> I don't think perldelta needs to so distinguish.  And in particular, the>> above should be "UTF-8">> I disagree.>>>>>>>>>>    (W locale) While in a single-byte locale (I<i.e.>, a non-UTF-8>>>    one), a multi-byte character was encountered.   Perl considers this>>> -character to be the specified Unicode code point.  Combining non-UTF8>>> +character to be the specified Unicode code point.  Combining non-UTF-8>>>    locales and Unicode is dangerous.  Almost certainly some characters>>>    will have two different representations.  For example, in the ISO 8859-7>>>    (Greek) locale, the code point 0xC3 represents a Capital Gamma.  But so>>> @@ -2133,7 +2133,7 @@ David Mitchell for future work on vtables.>>>>>> We're talking about whether locales are UTF-8 or not again, and the>>> paragraph is inconsistent.>>>>>> I think it belongs.>>>>>> +1>> I have no objection to this.>>>>>>>>>> -Pad names are now always UTF8.  The C<PadnameUTF8> macro always returns>>> +Pad names are now always UTF-8.  The C<PadnameUTF8> macro always returns>>>    true.  Previously, this was effectively the case already, but any>>> support>>>    for two different internal representations of pad names has now been>>>    removed.>>>>>> This might need to be "utf8" instead of "UTF8" under the canon>>> according to Encode, but I think "UTF-8" is better.>>>>>> UTF-8 is better.>> If we enforce that varnames must be valid UTF-8 (and I think we> should) then fine. If we don't then not fine.>> For the record (Karl I know you know this), UTF-8 is both an encoding,> and also a specification of which codepoints are legal. Not all utf8> sequences are valid UTF-8. I think the distinction is important.>>>>>>>>>> -In Perl 5.20.0, C<$^N> accidentally had the internal UTF8 flag turned off>>> +In Perl 5.20.0, C<$^N> accidentally had the internal UTF-8 flag turned>>> off>>>>>> Per my attitude above, I think this change is correct.>>>>>> +1>> Disagree. It should be "utf8" unless $^N is guaranteed to contain valid UTF-8.>>>    Or be "had the>>>>>> C<SVf_UTF8> flag turned off".>>>>>>    if accessed from a code block within a regular expression, effectively>>> -UTF8-encoding the value.  This has been fixed.>>> +UTF-8-encoding the value.  This has been fixed.>>>    L<[perl #123135]|https://rt.perl.org/Ticket/Display.html?id=123135>.>>>>>> This would need to be "utf8-encoding".>>>>>> I hate the sentence anyway.  It doesn't make intuitive sense that turning>> off a flag is the same thing as 'encoding'.  To me 'encoding' and 'decoding'>> have arbitrary non-intuitive meanings which I always have to look up.  It's>> better to not use the terms, but say something that makes sense to most of>> the readers who I don't believe have the definitions ingrained.>> I dont mind using a more descriptive sentence. I do mind conflating> UTF-8 and utf8. I can just see someone saying "why does perl let me> put surrogate pair code points in a UTF-8 string?".>>>>>>>>>>    On some systems, such as VMS, C<crypt> can return a non-ASCII string.>>> If a>>> -scalar assigned to had contained a UTF8 string previously, then C<crypt>>>> -would not turn off the UTF8 flag, thus corrupting the return value.  This>>> +scalar assigned to had contained a UTF-8 string previously, then C<crypt>>>> +would not turn off the UTF-8 flag, thus corrupting the return value.>>> This>>>    would happen with C<$lexical = crypt ...>.>>>>>> Under canon the first UTF8 was wrong and the second was correct.  I>>> think they should both be "UTF-8".>>>>>> +1>> Disagree.>>>>>>>>> -C<< s///e >> on tainted utf8 strings corrupted C<< pos() >>. This bug,>>> +C<< s///e >> on tainted UTF-8 strings corrupted C<< pos() >>. This bug,>>>    introduced in 5.20, is now fixed.>>>    L<[perl #122148]|https://rt.perl.org/Ticket/Display.html?id=122148>.>>>>>> Correct under canon.>>>>>> I prefer UTF-8.>> Disagree.>>>>>>>>>> -Loading UTF8 tables during a regular expression match could cause>>> assertion>>> +Loading UTF-8 tables during a regular expression match could cause>>> assertion>>>    failures under debugging builds if the previous match used the very same>>>    regular expression.>>>    L<[perl #122747]|https://rt.perl.org/Ticket/Display.html?id=122747>>>>>>> This one may have been just plain incorrect.  If I understand>>> correctly we load tables that map unicode code points to properties,>>> not UTF-8 or perl-UTF-8 to properties.>>>>>> So this should refer to "Loading Unicode tables".>>>>>> Yes>>>>>> The bottom line is I think we should say UTF-8 in almost every circumstance.>> The whole Encode thing was a big mistake that should be corrected in 5.24.>> We now know the perils of not checking input UTF-8 for well-formedness,>> When you say that do you mean that the sequences are wellformed, or> that the codepoints that they map to are properly validated?>> I  agree about sequence well formedness, i dont agree about> validation. I consider the following  to be a perfectly valid program:>> perl -wle'my $s=chr(0x10000);'>> However $s will not contain UTF-8, but instead contain utf8.>>> and>> at the time those decisions were made, those perils were not understood.  To>> put it in terms currently in the news, we should issue a safety recall on>> the Encode API in this regard.>> I think this is really going too far, and goes against *years* of> practice in the Perl community. This is the age old argument about> whether strings are arrays of Unicode Codepoints, or are they packed> arrays of integers which happen to us the same encoding rules as that> of UTF-8. I don't think we will ever settle that argument. And I don't> think you can throw away a decade or more of this distinction just> like that.>> In short, as long as we allow UTF-8 forbidden codepoints (eg,> surrogate pairs and codepoints higher than Unicode allows) in our> strings then I don't think we should call it UTF-8.>> cheers,> Yves>

Recent Messages in this Thread
Tony Cook via RT May 25, 2015 04:57 am
Dagfinn Ilmari Mannsåker (via RT) May 20, 2015 01:06 am
Dagfinn Ilmari Mannsåker May 20, 2015 02:24 am
Dagfinn Ilmari Mannsåker May 20, 2015 02:30 am
Tony Cook via RT May 20, 2015 03:26 am
demerphq May 20, 2015 05:16 am
Tony Cook May 20, 2015 06:27 am
demerphq May 20, 2015 05:16 pm
Tony Cook May 21, 2015 12:44 am
Karl Williamson May 21, 2015 02:01 am
demerphq May 21, 2015 05:39 am
Karl Williamson May 22, 2015 03:50 am
Messages in this thread