| Store | Cart

Problem timing out XML::LibXML parse_html_string call

From: Sam Tregar <s...@tregar.com>
Tue, 3 Feb 2009 14:44:02 -0500
Hello all.  I'm using XML::LibXML to parse some HTML.  Mostly it's working great - fast and very useful XPath support.  My problem is that it's choking on some very bad HTML in a very bad way - it's sitting on the CPU until killed manually.  I expected some HTML wouldn't parse, so this isn't such a tragedy.  What is a big problem is that my attempt to work around this with alarm() aren't working!

Here's my code:

use strict;
use warnings;
use XML::LibXML;

my $html = do { local $/; <> };

my $libxml = XML::LibXML->new();
#$libxml->recover(2);

eval {
    local $SIG{ALRM} = sub { die "TIMEOUT\n" };
    alarm(10);
    $libxml->parse_html_string($html);
    alarm(0);
};
if ($@ and $@ eq "TIMEOUT\n") {
    warn "Timed out ok.\n";
} elsif ($@) {
    die $@;
}

If I replace the parse call with sleep(20) then it works as expected - the alarm triggers and the timeout is caught.  If I run it as-is with my sample HTML then it never stops until killed.  If you want to play along at home here's the test file:

http://sam.tregar.com/libxml-fail.html

BEWARE: that's some really bad HTML and it not only breaks XML::LibXML but it also crashed Firefox on me.  You probably don't want to load it in your browser.

I've never had alarm() fail like this.  Is there an alternative I can try?  Any other ideas about how to handle this?

Thanks!
-sam

Hello all.  I'm using XML::LibXML to parse some HTML.  Mostly it's working
great - fast and very useful XPath support.  My problem is that it's choking
on some very bad HTML in a very bad way - it's sitting on the CPU until
killed manually.  I expected some HTML wouldn't parse, so this isn't such a
tragedy.  What is a big problem is that my attempt to work around this with
alarm() aren't working!

Here's my code:

use strict;
use warnings;
use XML::LibXML;

my $html = do { local $/; <> };

my $libxml = XML::LibXML->new();
#$libxml->recover(2);

eval {
    local $SIG{ALRM} = sub { die "TIMEOUT\n" };
    alarm(10);
    $libxml->parse_html_string($html);
    alarm(0);
};
if ($@ and $@ eq "TIMEOUT\n") {
    warn "Timed out ok.\n";
} elsif ($@) {
    die $@;
}

If I replace the parse call with sleep(20) then it works as expected - the
alarm triggers and the timeout is caught.  If I run it as-is with my sample
HTML then it never stops until killed.  If you want to play along at home
here's the test file:

http://sam.tregar.com/libxml-fail.html

BEWARE: that's some really bad HTML and it not only breaks XML::LibXML but
it also crashed Firefox on me.  You probably don't want to load it in your
browser.

I've never had alarm() fail like this.  Is there an alternative I can try?
Any other ideas about how to handle this?

Thanks!
-sam

Recent Messages in this Thread
Sam Tregar Feb 03, 2009 07:44 pm
Aaron Crane Feb 03, 2009 09:11 pm
Sam Tregar Feb 03, 2009 11:04 pm
Bjoern Hoehrmann Feb 03, 2009 11:12 pm
Messages in this thread