Credit: Luther Blissett
You have plain text that follows certain common conventions (e.g., paragraphs are separated by empty lines, text meant to be highlighted is marked up _like this_), and you need to mark it up automatically as XML.
Producing XML markup from data that is otherwise structured, including plain text that rigorously follows certain conventions, is really quite easy:
def markup(text, paragraph_tag='paragraph', inline_tags={'_':'highlight'}): # First we must escape special characters, which have special meaning in XML text = text.replace('&', "&")\ .replace('<', "<")\ .replace('"', """)\ .replace('>', ">") # paragraph markup; pass any false value as the paragraph_tag argument to disable if paragraph_tag: # Get list of lines, removing leading and trailing empty lines: lines = text.splitlines(1) while lines and lines[-1].isspace(): lines.pop( ) while lines and lines[0].isspace( ): lines.pop(0) # Insert paragraph tags on empty lines: marker = '</%s>\n\n<%s>' % (paragraph_tag, paragraph_tag) for i in range(len(lines)): if lines[i].isspace( ): lines[i] = marker # remove 'empty paragraphs': if i!=0 and lines[i-1] == marker: lines[i-1] = '' # Join text again lines.insert(0, '<%s>'%paragraph_tag) lines.append('</%s>\n'%paragraph_tag) text = ''.join(lines) # inline-tags markup; pass any false value as the inline_tags argument to disable if inline_tags: for ch, tag in inline_tags.items( ): pieces = text.split(ch) # Text should have an even count of ch, so pieces should have # odd length. But just in case there's an extra unmatched ch: if len(pieces)%2 == 0: pieces.append('') for i in range(1, len(pieces), 2): pieces[i] = '<%s>%s</%s>'%(tag, pieces[i], tag) # Join text again text = ''.join(pieces) return text if _ _name_ _ == '_ _main_ _': sample = """ Just some _important_ text, with inlike "_markup_" by convention. Oh, and paragraphs separated by empty (or all-whitespace) lines. Sometimes more than one, wantonly. I've got _lots_ of old text like that around -- don't you? """ print markup(sample)
Sometimes you have a lot of plain text that needs to be automatically marked up in a structured way—usually, these days, as XML. If the plain text you start with follows certain typical conventions, you can use them heuristically to get each text snippet into marked-up form with reasonably little effort.
In my case, the two conventions I had to work with were: paragraphs are separated by blank lines (empty or with some spaces, and sometimes several redundant blank lines for just one paragraph separation), and underlines (one before, one after) are used to indicate important text that should be highlighted. This seems to be quite far from the brave new world of:
<paragraph>blah blah</paragraph>
But in reality, it isn’t as far as all that, thanks,
of course, to Python! While you could use regular expressions for
this task, I prefer the simplicity of the
split
and splitlines
methods, with
join
to put the strings back together again.
StructuredText, the latest incarnation of which, ReStructuredText, is
part of the docutils
package (http://docutils.sourceforge.net/).
Get Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.