XML Lexing ("shallow parsing") « Python recipes

Sometimes you want to work more with the form of an XML document than with the structural information it contains. For instance if you wanted to change a bunch of entity references or element names. Also, sometimes you have slightly incorrect XML that a traditional parser will choke on. In that case you want an XML lexer or "shallow parser". This is a Python implementation.

      import re

class recollector:
    def __init__(self):
        self.res={}
    def add(self, name, reg ):
        re.compile(reg) # check that it is valid

        self.res[name] = reg % self.res
        
collector = recollector()
a = collector.add

a("TextSE" , "[^<]+")
a("UntilHyphen" , "[^-]*-")
a("Until2Hyphens" , "%(UntilHyphen)s(?:[^-]%(UntilHyphen)s)*-")
a("CommentCE" , "%(Until2Hyphens)s>?") 
a("UntilRSBs" , "[^\\]]*](?:[^\\]]+])*]+")
a("CDATA_CE" , "%(UntilRSBs)s(?:[^\\]>]%(UntilRSBs)s)*>" )
a("S" , "[ \\n\\t\\r]+")
a("NameStrt" , "[A-Za-z_:]|[^\\x00-\\x7F]")
a("NameChar" , "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]")
a("Name" , "(?:%(NameStrt)s)(?:%(NameChar)s)*")
a("QuoteSE" , "\"[^\"]*\"|'[^']*'")
a("DT_IdentSE" , "%(S)s%(Name)s(?:%(S)s(?:%(Name)s|%(QuoteSE)s))*" )
a("MarkupDeclCE" , "(?:[^\\]\"'><]+|%(QuoteSE)s)*>" )
a("S1" , "[\\n\\r\\t ]")
a("UntilQMs" , "[^?]*\\?+")
a("PI_Tail" , "\\?>|%(S1)s%(UntilQMs)s(?:[^>?]%(UntilQMs)s)*>" )
a("DT_ItemSE" ,
    "<(?:!(?:--%(Until2Hyphens)s>|[^-]%(MarkupDeclCE)s)|\\?%(Name)s(?:%(PI_Tail)s))|%%%(Name)s;|%(S)s"
)
a("DocTypeCE" ,
"%(DT_IdentSE)s(?:%(S)s)?(?:\\[(?:%(DT_ItemSE)s)*](?:%(S)s)?)?>?" )
a("DeclCE" ,
    "--(?:%(CommentCE)s)?|\\[CDATA\\[(?:%(CDATA_CE)s)?|DOCTYPE(?:%(DocTypeCE)s)?")
a("PI_CE" , "%(Name)s(?:%(PI_Tail)s)?")
a("EndTagCE" , "%(Name)s(?:%(S)s)?>?")
a("AttValSE" , "\"[^<\"]*\"|'[^<']*'")
a("ElemTagCE" ,
    "%(Name)s(?:%(S)s%(Name)s(?:%(S)s)?=(?:%(S)s)?(?:%(AttValSE)s))*(?:%(S)s)?/?>?")

a("MarkupSPE" ,
    "<(?:!(?:%(DeclCE)s)?|\\?(?:%(PI_CE)s)?|/(?:%(EndTagCE)s)?|(?:%(ElemTagCE)s)?)")
a("XML_SPE" , "%(TextSE)s|%(MarkupSPE)s")
a("XML_MARKUP_ONLY_SPE" , "%(MarkupSPE)s")


def lexxml(data, markuponly=0):
    if markuponly:
        reg = "XML_MARKUP_ONLY_SPE"
    else:
        reg = "XML_SPE"
    regex = re.compile(collector.res[reg])
    return regex.findall(data)

def assertlex(data, numtokens, markuponly=0):
    tokens = lexxml(data, markuponly)
    if len(tokens)!=numtokens:
        assert len(lexxml(data))==numtokens,            "data = '%s', numtokens = '%s'" %(data, numotkens)
    if not markuponly:
        assert "".join(tokens)==data
    walktokens(tokens)

def walktokens(tokens):
    print
    for token in tokens:
        if token.startswith("<"):
            if token.startswith("<!"):
                print "declaration:", token
            elif token.startswith("<?xml"):
                print "xml declaration:", token
            elif token.startswith("<?"):
                print "processing instruction:", token
            elif token.startswith("</"):
                print "end-tag:", token
            elif token.endswith("/>"):
                print "empty-tag:", token
            elif token.endswith(">"):
                print "start-tag:", token
            else:
                print "error:", token
        else:
            print "text:", token

def testlexer():
    # this test suite could be larger!
    assertlex("<abc/>", 1)
    assertlex("<abc><def/></abc>", 3)
    assertlex("<abc>Blah</abc>", 3)
    assertlex("<abc>Blah</abc>", 2, markuponly=1)
    assertlex("<?xml version='1.0'?><abc>Blah</abc>", 3, markuponly=1)
    assertlex("<abc>Blah&foo;Blah</abc>", 3)
    assertlex("<abc>Blah&foo;Blah</abc>", 2, markuponly=1)
    assertlex("<abc><abc>", 2)
    assertlex("</abc></abc>", 2)
    assertlex("<abc></def></abc>", 3)

if __name__=="__main__":
    testlexer()

      

A traditional XML parser does a few tasks at once:

it breaks up the stream of text into logical components (tags, text, processing instructions, etc.)
it ensures that these structures are used in accordance with the XML spec.
it throws away "extra" characters and reports the significant stuff. For instance it would report tag names but not less-than or greater-than signs.

This "shallow parser" does only the first task. It just breaks up the document and presumes that you know how to deal with the broken up bits yourself. That makes it very efficient and very "forgiving" of errors in the document.

The xmllex function is the entry point. Just call xmllex(data) to get back a list of "tokens" (bits of the document).

The lexer also makes it very easy to get back the original content of the document exactly. Unless there is a bug, the following code should always succeed:

tokens = lexxml(data) data2 = "".join(tokens) assert data == data2

If you find any bugs that disallow this, please report them.

There is a second, optional argument to lexxml that allows you to only get back markup and ignore the text of a document. This is useful as a performance optimization when you only care about tags.

The walktokens function shows how to walk over the tokens and work with them.

All of this work is based upon this paper:

Robert D. Cameron. REX: XML Shallow Parsing with Regular Expressions. Markup Languages: Theory and Applications, Summer 1999, pp. 61-88. http://www.cs.sfu.ca/~cameron/REX.html

The regular expressions in the recipe were translated from Perl to Python.

Tags: xml

1 comment

Benrhard Kohl 22 years, 9 months ago # | flag

Hint: Typo in line 60.

There is a typo numotkens in
def assertlex ...
           assert len(lexxml(data)) ... (data, numotkens)

where it should read (data, numtokens)

◄	Python recipes (4591)	►
◄	Paul Prescod's recipes (6)	►
◄	Python Cookbook Edition 1 (103)	►

XML Lexing ("shallow parsing") (Python recipe) by Paul Prescod
ActiveState Code (http://code.activestate.com/recipes/65125/)

1 comment

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

XML Lexing ("shallow parsing") (Python recipe) by Paul Prescod ActiveState Code (http://code.activestate.com/recipes/65125/)

1 comment

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

XML Lexing ("shallow parsing") (Python recipe) by Paul Prescod
ActiveState Code (http://code.activestate.com/recipes/65125/)