| Store | Cart

[perl #66852] regexp very slow on UTF8 (over 100.000 times slower than without UTF8)

From: Victor Efimov via RT <perl...@perl.org>
Mon, 20 Oct 2014 06:21:47 -0700
I tested and seems issue if fixed in 5.20. Should it be backported to 5.18?

On Wed Jun 24 15:17:08 2009, p...@nevcal.com wrote:
> On approximately 6/24/2009 12:28 PM, came the following characters> from> the keyboard of demerphq:> > 2009/6/24 Glenn Linderman <p...@nevcal.com>:> >> On approximately 6/24/2009 9:28 AM, came the following characters> >> from the> >> keyboard of demerphq:> >>> 2009/6/24 Andreas Hansson <Andr...@axis.com>:> >>>> testing with utf8=ON...> >>>> Matching REx " (?:.*)------" against "Abra ka dabra%n -----> >>>> /var/log/> >>>> messages -----%n"> >>>> UTF-8 string...> >>>>  4 <Abra> < ka dabra>      |  1:EXACT < >(3)> >>>>  5 <Abra > <ka dabra%n>    |  3:STAR(5)> >>>>                                 SANY can match 40 times out of> >>>> 2147483647...> >>>> [.....]> >>>> Match failed> >>>> Time 0 seconds> >>>>> >>>>> >>>> When the data is non UTF8 then the match is immediatly rejected by> >>>> the> >>>> optimizer.> >>>>> >>>> When the data is in UTF8 then the match is not rejected by the> >>>> optimizer but actually tested by the engine.> >>>>> >>> Just thinking aloud for sake of the ticket, but i think one might> >>> argue that this is a bug.> >>>> >>> Why should utf8 affect case-insensitive matching? All one has to do> >>> is> >>> (up|down)grade the strings first and then use their literal bytes> >>> as> >>> the string to search for and it would be fast.> >>>> >>> So, er, why dont we do that i wonder...> >>> >> I've wondered that for a long time.  The arguments I've heard> >> include:> >>> >> 1) Compatibility... upgrading the user's string might break> >> something else> >> that expects it to stay non-utf8 ... and memory usage might go up> >> significantly.> >>> >> 2) Performance... updating the user's string to a temporary (to> >> avoid #1) is> >> slow for large strings, and may not help the memory usage problem> >> (which may> >> be coincident).> >>> >> 3) Not all strings are text; some may even be combinations of binary> >> and> >> text; upgrading may destroy invariants for other parts of the> >> program, that> >> don't properly deal with strings as strings, but rather expect them> >> to be> >> octets (which should be considered a bug these days, but wasn't> >> always, and> >> the compatibility police are concerned).> >> > Actually, my point was slightly different here. The strings im> > talking> > about are the strings stored by the regex engine. The ones used to> > find the "must match" strings so we fail fast. These are generally> > short snippets, and are not user exposed and the string being matched> > against would not change.> >> > So for instance the pattern in the two cases we are discussing here> > is> > the same, it is only the utf8ness of the string being searched that> > differs, and it actually turns out that the constant string we are> > searching for will be represented the same as it is in utf8, so in> > this case there is actually no reason to disable that optimisation.> >> > Or am i missing a subtlety here that you covered?> > > No, I'm overlooked the subtelty that you were discussing!> > For internal strings stored and used by the regex engine in the> compiled> form, it seems to me that there would be no user reason not to store> whatever is useful.  I suppose it would be possible to create> extremely> large match strings, but I'm not sure why one would do that... I've> never seen an application read several hundred KB of text, and use it> as> part of a regex match string :)> > So under the assumption that the strings are short, and are internal,> then it seems there are three cases:> > 1) match against non-utf8 strings> 2) match against utf8 strings> 3) match against both> 4) (for completeness only) match against neither> > So then there are two sub-cases:> > a) the internal string has the same representation in utf8 and non-> utf8> b) the internal string has different representation in utf8 and non-> utf8> > regex was initially designed for 1a only, I presume.  When utf8 came> along, so did all the other cases.> > Seems like for cases 1a, 2a, and 3a, the match should be done as> octets.>  This is unconditional because it is subcase a.> > To handle the b subcases is more complex.  Either there has to be> logic> to convert the internal string(s) to the same representation as the> match string as a preprocessing step, or both representations could be> saved, and the appropriate one chosen based on the representation of> the> match string.> > Of course, handling things like case-insensitivity couldn't be done> that> way, but this issue doesn't involve that.> > Should I guess that the current reason is that distance in characters> between the two exact matches cannot, in the utf8 match string case,> be> easily computed from the length in bytes, for setting $+[0] and $-[0],> so the optimizer rejects the shortcut?




---
via perlbug:  queue: perl5 status: open
https://rt.perl.org/Ticket/Display.html?id=66852

Recent Messages in this Thread
Victor Efimov via RT Oct 20, 2014 01:21 pm
Dave Mitchell Oct 20, 2014 03:15 pm
Messages in this thread