| Store | Cart

Re: Problem timing out XML::LibXML parse_html_string call

From: Aaron Crane <p...@aaroncrane.co.uk>
Tue, 3 Feb 2009 21:11:18 +0000
Sam Tregar writes:
> I'm using XML::LibXML to parse some HTML.  Mostly it's working great> - fast and very useful XPath support.  My problem is that it's> choking on some very bad HTML in a very bad way - it's sitting on> the CPU until killed manually.  I expected some HTML wouldn't parse,> so this isn't such a tragedy.  What is a big problem is that my> attempt to work around this with alarm() aren't working!

The problem with handling signals in Perl is that they happen
asynchronously.  If a signal is delivered while the Perl interpreter
is executing an op, the code in the Perl-level signal handler might
attempt to modify interpreter state in a way that will cause later
crashes.

Perl 5.8 introduced "safe signals" to alleviate this problem.  The
approach is to have the OS-level signal handler merely set a flag
indicating that the signal has been received.  Then the interpreter
checks the flags at safe points (between ops, effectively), and
invokes your Perl-level handler at that point, when it's known to be
safe.

The only problem with this scheme is that if an op goes into an
infinite loop, the Perl-level signal handler never gets invoked.
That's very unlikely for regular ops in stable releases of Perl, but
a call to an XS function -- a single op -- might ultimately fall into
an infinite loop.  And that's what's happening here; libxml2 (or
perhaps the XS component of XML::LibXML) has an infinite-loop bug, so
your signal handler never gets invoked.

You can switch back to the pre-5.8 signal-handling behaviour by
setting the environment variable PERL_SIGNALS to 'unsafe'.  This has
to have happened at the point Perl starts executing; you can't do it
by setting that variable from inside your code.  For example, using
env(1):

    $ env PERL_SIGNALS=unsafe perl your_program.pl

If it's not possible for you to put an appropriate wrapper round your
program, something along these lines might help, if placed suitably
early in your code:

    BEGIN {
        if (!$ENV{PERL_SIGNALS} || $ENV{PERL_SIGNALS} ne 'unsafe') {
            $ENV{PERL_SIGNALS} = 'unsafe';
            exec $^X, $0, @ARGV;
        }
    }

See also `perldoc perlipc` and search for "safe signals".

-- 
Aaron Crane ** http://aaroncrane.co.uk/

Recent Messages in this Thread
Sam Tregar Feb 03, 2009 07:44 pm
Aaron Crane Feb 03, 2009 09:11 pm
Sam Tregar Feb 03, 2009 11:04 pm
Bjoern Hoehrmann Feb 03, 2009 11:12 pm
Messages in this thread