| Store | Cart

Re: libxml and (X)HTML documents

From: Christian Glahn <chri...@uibk.ac.at>
Thu, 11 Jul 2002 10:48:38 +0200
On Wed, Jul 10, 2002 at 11:50:47PM -0400, Aaron Straup Cope wrote:
> Hi all,> > Can someone help me understand how exactly libxml deals with HTML file> and, more specifically, XHTML files?

libxml2 has a special parser that is able to deal with HTML tags that
are never closed. this parser differs from the standard parser by 
detecting special tags such as <p>, <img> that are commonly are just 
opened. XML::LibXML provides a special interface to this parser 
extension through the parse_html_* functions. 

XHTML is a bit different, since it is a XML with a Language binding.
because of that you should parse them by using the common parse_* 
functions, in order to indicate errors. because of the nature of 
XHTML libxml2 does not provide any special interface to parse those 
files (as far as i know).

so basicly libxml2 uses the same parser for XML and HTML data, where 
the HTML parser is just a special case of the more  general implementation 
of the XML parser. 

> I can understand treating HTML files as "special" but it appears that> XHTML files are lumped in with the bad apples even though there isn't any> reason for them to be.

to parse XHTML files, strings, handles ... use XML::LibXML's parse_file(),
parse_string() or parse_handle() instead of their parse_html_* relatives.
 
> If it's just another thing on the 'to-do' list then I can deal. But, I've> had to jump through all kinds of hoops (see below) to get all the widgets> used by, and including, XML::Filter::XSLT to munge one XHTML document into> another in a SAX context.

from what i can see you do a bit too much work, but see below :)

> It's done so I'm happy enough but it seems completely nuts to have to go> these lengths.> > Thanks,> > # in package Aaron::XML::Filter::XSLT> > sub end_document {>     my $self = shift;> >     # because "IMA" XML::Filter::XSLT so calling>     # SUPER would make bad things happen> >     my $dom = $self->XML::LibXML::SAX::Builder::end_document(@_);

ok, you get a XML::LibXML::Document here. 

> >     # Gah! In a plain old XML::LibXSLT situation I can>     # call parse_html_file, but since ::SAX::Builder calls>     # $obj->createDocument() there doesn't seem to be anything>     # else but to do the following...> >     my $parser = XML::LibXML->new();

because of havind a document already the following line is useless,
the different document types XML_DOCUMENT_NODE and XML_HTML_DOCUMENT_NODE
are basicly required for data output. i assume you don't really need
a separate parse step here.

*IMPORTANT* 
i think this extra parse causes your headaches. so try to avoid it.

>     $dom = $parser->parse_html_string($dom->toString());>>     my $xslt       = XML::LibXSLT->new();>     my $stylesheet = $xslt->parse_stylesheet($self->{StylesheetDOM});

for XSLT params you should remember to quote them for XSLT, but you may 
already did so.

>     my $results = $stylesheet->transform($dom,((ref($self->{'__params'})> eq "ARRAY") ? @{$self->{'__params'}} : ())); >>     # see earlier note to list on same subject [1]>     # this subclass basically does the following :>     # "You say HTML_DOCUMENT, I say X(HT)ML_DOCUMENT"

the document node type of the result node depends on the output type of the 
XSL itself. from the core document structure they don't differ, so you don't 
need to bother. especially since the current version of libxslt doesn't
support the output type xhtml.

>     my $parser = Aaron::XML::LibXML::SAX::Parser->new(%$self);

if you use the following function as shipped by XML::LibXML, this should
work with XML as with HTML documents. therefore the SAX generation should work 
fine.

>     $parser->generate($results);> }

i hope this helps you a bit.

christian

Recent Messages in this Thread
Aaron Straup Cope Jul 11, 2002 03:50 am
Christian Glahn Jul 11, 2002 08:48 am
Messages in this thread