Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes

From: Karl Williamson <pub...@khwilliamson.com>

Wed, 29 Oct 2014 22:43:36 -0600

On 10/02/2014 01:30 AM, demerphq wrote:
> On 2 October 2014 05:41, Karl Williamson <pub...@khwilliamson.com> <mailto:pub...@khwilliamson.com>> wrote:>>     On 09/29/2014 12:26 PM, demerphq wrote:>>              Any subset of the ranges [a-z] and [A-Z] is (and has been)>         specially>              handled to match on EBCDIC platforms the same equivalent>         characters>              it matches on ASCII platforms.  Hence qr/[i-j]/i, matches>         [ijIJ] on>              both ASCII and EBCDIC platforms.>>>         I think this is the problem. Why does this apply to [a-z] and [A-Z]>         only? Why not to all literals?>>              The special handling is only valid if both ends of the>         range are>              literals.  In EBCDIC, \xC9 is 'I' and \xD1 is 'J'.  If you>         specify>              any of [\xC9-J], [I-\xD1] , or [\xC9-\xD1], you get all the>         code>              points C9, CA, CB, CC, CD, CE, CF, and D1.  This is how it has>              worked since apparently 5.005_03, and is how I think it should>              continue to work.  In other words, I think we got the>         design right.>>>         For ranges involving non-literals I agree. But I don't think>         this design>         is sane for literals.>>         In other words, I think a rule that said that "literals in character>         classes will be interpreted according to the Unicode>         specification" is a>         better rule than what you described.>>         I don't suppose we can change it now but the current rules seem>         unnecessarily confusing.>>>     I'm not sure I understand your point here.  [%] matches an ASCII>     percent on an ASCII platform, and an EBCDIC percent on an EBCDIC>     platform.  The code is perfectly portable.  All literal characters>     match properly on both platforms, and would continue to do so if>     Perl were ever ported to yet another platform.  (The odds of that>     happening are infinitesimal, I realize.)>>     But there are only three cases where it is obvious what should be in>     a range of literals.  Those are any subsets of A-Z, a-z, and 0-9.>     Perl takes special action to handle those as DWIM.>>     The only other ASCII literal characters are punctuation and space.>     There is no natural language intrinsic ordering of them, and hence>     ranges with these as end points are obfuscations of what is really>     happening.>>> Whether or not they are an obfuscation is a personal aesthetic opinion.> And since there are many natural language ordering of characters in A-Z> I dont feel you are particularly firm ground suggesting there is> something intrinsically more sensible about A-Z than %-{.>>     Perl need not take special efforts to handle obfuscated code.>>> I think this is a terrible justification for the language not being well> defined.>> I mean, this case is rather different from "The CPU does math in a> different endianness than your code expects" type undefined behaviour> that cannot be avoided.  With character class ranges the damage is self> inflicted. I think that is sad an unnecessary.>>     I doubt that there is anybody on this list who knows immediately>     what [%-{] matches, or [|-&].>>> I dont think whether people offhand know how many characters are in the> unicode character set [%-{] is relevant. The point is that once you> looked it up you should be able to rely on it everywhere Perl runs. And> if you took this kind of argument to the extreme it would lead to> seriously bizarre consequences.>> Heck, Im not sure that many people could tell you how many characters> there are between "P" and "W" off the top of their head, and I bet a lot> of people from non-english backgrounds would *disagree* on the subject.>> IOW, I think the position you take differentiating between A-Z and %-{> is rooted in the fact that you and ASCII share a common cultural> background. If you were Icelandic you would expect to find "á" after> "a", but ASCII doesn't do that. In fact strictly speaking ASCII can't> even represent "á".>> So I think you are manufacturing a distinction between A-Z and %-{ that> is not really there, and to the extent that it does exist, is culturally> specific.>> I think that is a pretty terrible basis to decide that one part of a> regex pattern is well defined and others are not.>>     These match differently on EBCDIC than ASCII.>>> Yes, well that is the problem right? They are only poorly defined> *because* they are different on EBCDIC and ASCII.>>     It would be too late to change this behavior, nor do I think it>     would be desirable to do so.>>> Yes, I suspect you are right. Sadly.>> On the other hand what would we do if we targeted a different platform> that also used a different native character set? IMO we would be *nuts*> to repeat this design decision for said hypothetical platform.>>     This from the docs you quoted is right: "A sound principle is to use>     only ranges that begin from and end at either alphabetics of equal>     case ([a-e], [A-E]), or digits ([0-9])"  Perl should support doing>     that, but no more, at least in the ASCII range.>>> In an ideal world we would delete that sentence and replace it with> "character class ranges composed of literals are always interpreted> according to the unicode standard, so [%-{] will always match 88> characters regardless of native encoding, although the actual codepoints> matched may differ from unicode where appropriate".>> IOW, the problem here is that when we ported the regex engine to EBCDIC> we did not properly separate out "code points in the pattern as> expressed as literals" and "native representation of those code> points".  Which I suppose is natural given our EBCDIC port predates> Unicode, but it is still unfortunate.>> I do not think we should have any platform specific behaviour other than> that which is forced upon us.>> And I do not think it is good that a *scripting* language like Perl has> portability issues which are not forced upon us.>> Yves> --> perl -Mre=debug -e "/just|another|perl|hacker/"

I agree that it would be nice to be able to portably specify ranges. 
But before I get to that, I have a couple of points to make, moot as 
they might be.

If one has to look up what's exactly in a range when coding, then that 
person is unfairly burdening whomever might take up the maintenance of 
that code in the future.

You may very well be right about my cultural bias about what's in A-Z. 
I've tried to imagine what I would think if my first language had had 
other characters, but I can't really.

But your idealized solution effectively says to people on EBCDIC that 
they have to use a foreign character set, and that is just as 
chauvinistic as my A-Z bias.  There are people who code solely on and 
for EBCDIC, and Perl should accommodate their native way of thinking. 
So \x04 has to mean the character whose code point is natively 4 on 
whatever platform the code is being run on.  If you want to specify the 
character whose *Unicode* code point is 4, you can use \N{U+04}.

But then what about this range?

	[\N{U+04}-\N{U+09}]

It seems obvious to me that what the coder meant is

	[\N{U+04}\N{U+05}\N{U+06}\N{U+07}\N{U+08}\N{U+09}]

But on EBCDIC it currently doesn't mean that; it is an error because 
\N{U+04} is 0x37 and \N{U+09} is 0x05, so we have a range whose first 
value is larger than the second value, which is not allowed.  I think 
this is a bug, and I propose to fix it.  The fix is not hard.  The 
paradigm is that a range in any platform which is specified in terms of 
Unicode end-points should follow Unicode rules.  That gives portability 
across all platforms.

By extension, I think that using the Unicode name syntax should act 
identically as the U+ syntax.  The above range could be specified using 
that syntax as

	[\N{EOT}-\N{HT}]

and should include EOT (4 on ASCII), HT (9 on ASCII) plus U+05..U+08
(ENQ, ACK, BEL and BS (5, 6, 7, 8 respectively in ASCII).

So, by specifying a range in Unicode terminology, one could get the 
portability Yves wants.  [\N{PERCENT SIGN}-\N{LEFT CURLY BRACKET}] would 
match the same characters on all platforms that [%-{] does on ASCII.

The remaining question I have is what happens if only one end of the 
range is a Unicode construct?

	[\N{U+04}-\x{09}]
	[\x{04}-\N{U+09}]

I think this should be deprecated, and in the meantime, the non-Unicode 
endpoint be considered to be the Unicode value.   There are no such 
usages currently in CPAN.  In fact, there are only 2 modules that use 
\N{} in ranges, and both look to be wanting the behavior I'm proposing here.

http://grep.cpan.me/?q=\[.*\\N{[^}]*}-+-file%3A%22\.pod%24%22
http://grep.cpan.me/?q=-\\N{[^}]*}+-file%3A%22\.pod%24%22

Recent Messages in this Thread
[perl #122853] Guarantee 0-9, A-Z, a-z character classes	Father Chrysostomos via RT	Oct 30, 2014 11:37 pm
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes	Glenn Golden	Oct 31, 2014 01:06 am
[perl #122853] Guarantee 0-9, A-Z, a-z character classes	Father Chrysostomos via RT	Oct 30, 2014 11:31 pm
[perl #122853] Guarantee 0-9, A-Z, a-z character classes	Father Chrysostomos via RT	Oct 30, 2014 11:19 am
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes	Aristotle Pagaltzis	Nov 01, 2014 09:01 am
RE: [perl #122853] Guarantee 0-9, A-Z, a-z character classes	Ed Avis	Nov 01, 2014 09:14 am
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes	Abigail	Nov 06, 2014 09:42 am
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes	Karl Williamson	Nov 14, 2014 05:25 am
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes	Jarkko Hietaniemi	Oct 30, 2014 01:08 pm
[perl #122853] Guarantee 0-9, A-Z, a-z character classes	Father Chrysostomos via RT	Oct 30, 2014 03:24 pm
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes	Jarkko Hietaniemi	Oct 30, 2014 05:30 pm
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes	Jarkko Hietaniemi	Oct 30, 2014 10:25 pm
[perl #122853] Guarantee 0-9, A-Z, a-z character classes	Father Chrysostomos via RT	Oct 30, 2014 11:12 pm
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes	Jarkko Hietaniemi	Oct 30, 2014 11:14 pm
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes	Karl Williamson	Nov 14, 2014 05:29 am
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes	Dagfinn Ilmari Mannsåker	Nov 14, 2014 02:35 pm
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes	Ricardo Signes	Nov 16, 2014 06:17 pm
[perl #122853] Guarantee 0-9, A-Z, a-z character classes	Father Chrysostomos via RT	Nov 15, 2014 06:10 am
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes	Karl Williamson	Nov 15, 2014 06:02 pm
[perl #122853] Guarantee 0-9, A-Z, a-z character classes	Father Chrysostomos via RT	Oct 30, 2014 11:20 am
[perl #122853] Guarantee 0-9, A-Z, a-z character classes	Father Chrysostomos via RT	Oct 30, 2014 05:03 am
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes	Aristotle Pagaltzis	Oct 30, 2014 08:24 am
RE: [perl #122853] Guarantee 0-9, A-Z, a-z character classes	Ed Avis	Oct 30, 2014 07:19 am
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes	Abigail	Nov 01, 2014 12:06 am
[perl #122853] Guarantee 0-9, A-Z, a-z character classes	Karl Williamson via RT	Oct 07, 2014 02:57 pm
Re: [perl #122853] Guarantee 0-9, A-Z, a-z character classes	Karl Williamson	Oct 30, 2014 04:43 am

◄ Messages in this thread

►

Previous post: [perl #3306] DESTROY not called on code reference objects

Next post: [perl #122853] Guarantee 0-9, A-Z, a-z character classes

Subscribe to the perl5-porters RSS feed

Accounts

List Archives

Feedback & Information

ActiveState

© 2019 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.