Credit: Paul Prescod
It’s not uncommon to want to work with the form of an XML document rather than with the structural information it contains (e.g., to change a bunch of entity references or element names). The XML may be slightly incorrect, enough to choke a traditional parser. In such cases, you need an XML lexer, also known as a shallow parser.
You might be tempted to hack together a regular expression or two to do some simple parsing of XML (or other structured text format), rather than using the appropriate library module. Don’t—it’s not a trivial task to get the regular expressions right! However, the hard work has already been done for you in Example 12-1, which contains already-debugged regular expressions and supporting functions that you can use for shallow-parsing tasks on XML data (or, more importantly, on data that is almost, but not quite, correct XML, so that a real XML parser seizes up with error diagnostics when you try to parse your data with it).
A traditional XML parser does a few tasks:
It breaks up the stream of text into logical components (tags, text, processing instructions, etc.).
It ensures that these components comply with the XML specification.
It throws away extra characters and reports the significant data. For instance, it would report tag names but not the less-than and greater-than signs around them.
The shallow parser in Example 12-1 performs only the first task. It breaks up the document and presumes that you know how to deal with the fragments yourself. That makes it efficient and forgiving of errors in the document.
The
lexxml
function is the code’s entry point. Call
lexxml(data)
to get back a list of
tokens (strings that
are bits of the document). This lexer also makes it easy to get back
the exact original content of the document. Unless there is a bug in
the recipe, the following code should always succeed:
tokens = lexxml(data) data2 = "".join(tokens) assert data == data2
If you find any bugs that disallow this, please report them! There is
a second, optional argument to lexxml
that allows
you to get back only markup and ignore the text of the document. This
is useful as a performance optimization when you care only about
tags. The
walktokens
function in the recipe shows how to walk over the tokens and work
with them.
Example 12-1. XML lexing
import re class recollector: def _ _init_ _(self): self.res={} def add(self, name, reg ): re.compile(reg) # Check that it is valid self.res[name] = reg % self.res collector = recollector( ) a = collector.add a("TextSE" , "[^<]+") a("UntilHyphen" , "[^-]*-") a("Until2Hyphens" , "%(UntilHyphen)s(?:[^-]%(UntilHyphen)s)*-") a("CommentCE" , "%(Until2Hyphens)s>?") a("UntilRSBs" , "[^\\]]*](?:[^\\]]+])*]+") a("CDATA_CE" , "%(UntilRSBs)s(?:[^\\]>]%(UntilRSBs)s)*>" ) a("S" , "[ \\n\\t\\r]+") a("NameStrt" , "[A-Za-z_:]|[^\\x00-\\x7F]") a("NameChar" , "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]") a("Name" , "(?:%(NameStrt)s)(?:%(NameChar)s)*") a("QuoteSE" , "\"[^\"]*\"|'[^']*'") a("DT_IdentSE" , "%(S)s%(Name)s(?:%(S)s(?:%(Name)s|%(QuoteSE)s))*" ) a("MarkupDeclCE" , "(?:[^\\]\"'><]+|%(QuoteSE)s)*>" ) a("S1" , "[\\n\\r\\t ]") a("UntilQMs" , "[^?]*\\?+") a("PI_Tail" , "\\?>|%(S1)s%(UntilQMs)s(?:[^>?]%(UntilQMs)s)*>" ) a("DT_ItemSE" , "<(?:!(?:--%(Until2Hyphens)s>|[^-]%(MarkupDeclCE)s)|\\?%(Name)s" "(?:%(PI_Tail)s))|%%%(Name)s;|%(S)s" ) a("DocTypeCE" , "%(DT_IdentSE)s(?:%(S)s)?(?:\\[(?:%(DT_ItemSE)s)*](?:%(S)s)?)?>?" ) a("DeclCE" , "--(?:%(CommentCE)s)?|\\[CDATA\\[(?:%(CDATA_CE)s)?|DOCTYPE" "(?:%(DocTypeCE)s)?") a("PI_CE" , "%(Name)s(?:%(PI_Tail)s)?") a("EndTagCE" , "%(Name)s(?:%(S)s)?>?") a("AttValSE" , "\"[^<\"]*\"|'[^<']*'") a("ElemTagCE" , "%(Name)s(?:%(S)s%(Name)s(?:%(S)s)?=(?:%(S)s)?(?:%(AttValSE)s))*" "(?:%(S)s)?/?>?") a("MarkupSPE" , "<(?:!(?:%(DeclCE)s)?|\\?(?:%(PI_CE)s)?|/(?:%(EndTagCE)s)?|" "(?:%(ElemTagCE)s)?)") a("XML_SPE" , "%(TextSE)s|%(MarkupSPE)s") a("XML_MARKUP_ONLY_SPE" , "%(MarkupSPE)s") def lexxml(data, markuponly=0): if markuponly: reg = "XML_MARKUP_ONLY_SPE" else: reg = "XML_SPE" regex = re.compile(collector.res[reg]) return regex.findall(data) def assertlex(data, numtokens, markuponly=0): tokens = lexxml(data, markuponly) if len(tokens)!=numtokens: assert len(lexxml(data))==numtokens, \ "data = '%s', numtokens = '%s'" %(data, numtokens) if not markuponly: assert "".join(tokens)==data walktokens(tokens) def walktokens(tokens): print for token in tokens: if token.startswith("<"): if token.startswith("<!"): print "declaration:", token elif token.startswith("<?xml"): print "xml declaration:", token elif token.startswith("<?"): print "processing instruction:", token elif token.startswith("</"): print "end-tag:", token elif token.endswith("/>"): print "empty-tag:", token elif token.endswith(">"): print "start-tag:", token else: print "error:", token else: print "text:", token def testlexer( ): # This test suite could be larger! assertlex("<abc/>", 1) assertlex("<abc><def/></abc>", 3) assertlex("<abc>Blah</abc>", 3) assertlex("<abc>Blah</abc>", 2, markuponly=1) assertlex("<?xml version='1.0'?><abc>Blah</abc>", 3, markuponly=1) assertlex("<abc>Blah&foo;Blah</abc>", 3) assertlex("<abc>Blah&foo;Blah</abc>", 2, markuponly=1) assertlex("<abc><abc>", 2) assertlex("</abc></abc>", 2) assertlex("<abc></def></abc>", 3) if _ _name_ _=="_ _main_ _": testlexer( )
This recipe is based on the following article, with regular expressions translated from Perl into Python: “REX: XML Shallow Parsing with Regular Expressions”, Robert D. Cameron, Markup Languages: Theory and Applications, Summer 1999, pp. 61-88, http://www.cs.sfu.ca/~cameron/REX.html.
Get Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.