| Store | Cart

[perl #123469] Bug in split function, with utf8 strings

From: Rostislav via RT <perl...@perl.org>
Sun, 21 Dec 2014 03:16:20 -0800
It seems the message was too large, and it's not shown in the web interface. Here is the main information, the perl -V output is in the previous message.

I have observed buggy behaviour of the built-in 'split' function under certain conditions. It is triggered when the PATTERN contains UTF8 characters from Latin-1 Supplement, and EXPR is a non-UTF8 (ascii-only) string. After that, subsequent calls to 'split' produce errorneous results.

In this example, the first and the last iterations of the 'for' loop are supposed to produce the same result, but actually the last result becomes different after 'split' is called as described above.

In addition to that, I have observed in Perl 5.14 and earlier versions that buggy behaviour is also triggered when there are any UTF8 characters in the PATTERN and an ascii-only string in EXPR.

[12:46] u...@debian7 ~/test/split $ cat split.pl
# this file is encoded in UTF8, obviously
use strict;
use warnings;
use utf8;

use Data::Dumper;

sub main {
    my $split_chr = 'ä';
    my $good = "a${split_chr}b";
    my $bad = 'aab';
    for my $str ($good, $bad, $good) {
        print "Splitting: $str by pattern $split_chr; is_utf8: "
        . utf8::is_utf8($str) . "\n";
        my @sp = split /$split_chr/, $str;
        print Dumper(\@sp);
    }
}

binmode STDOUT, ':utf8';
main;

[12:45] u...@debian7 ~/test/split $ perl5.20.1 split.pl
Splitting: aäb by pattern ä; is_utf8: 1
$VAR1 = [
          'a',
          'b'
        ];
Splitting: aab by pattern ä; is_utf8:
$VAR1 = [
          'aab'
        ];
Splitting: aäb by pattern ä; is_utf8: 1
$VAR1 = [
          "a\x{e4}b"
        ];
[12:41] u...@debian7 ~/test/split $ perlbrew exec perl split.pl
perl-5.14.4
==========
Splitting: aäb by pattern ä; is_utf8: 1
$VAR1 = [
          'a',
          'b'
        ];
Splitting: aab by pattern ä; is_utf8:
$VAR1 = [
          'aab'
        ];
Splitting: aäb by pattern ä; is_utf8: 1
$VAR1 = [
          "a\x{e4}b"
        ];


perl-5.21.6
==========
Splitting: aäb by pattern ä; is_utf8: 1
$VAR1 = [
          'a',
          'b'
        ];
Splitting: aab by pattern ä; is_utf8:
$VAR1 = [
          'aab'
        ];
Splitting: aäb by pattern ä; is_utf8: 1
$VAR1 = [
          "a\x{e4}b"
        ];


---
via perlbug:  queue: perl5 status: new
https://rt.perl.org/Ticket/Display.html?id=123469

Recent Messages in this Thread
Rostislav (via RT) Dec 21, 2014 11:07 am
Rostislav via RT Dec 21, 2014 11:16 am
Father Chrysostomos via RT Dec 21, 2014 07:29 pm
Rostislav via RT Dec 24, 2014 06:44 am
H.Merijn Brand Dec 24, 2014 08:14 am
Rostislav via RT Dec 24, 2014 10:14 am
demerphq Dec 24, 2014 01:48 pm
James E Keenan via RT Dec 21, 2014 01:48 pm
James E Keenan via RT Dec 21, 2014 02:13 pm
Rostislav via RT Dec 21, 2014 02:23 pm
Messages in this thread