Credit: Itamar Shtull-Trauring
You have received some HTML input from a user and need to make sure that the HTML is clean. You want to allow only safe tags, to ensure that tags needing closure are indeed closed, and, ideally, to strip out any Javascript that might be part of the page.
The sgmllib
module helps with cleaning up the HTML
tags, but we still have to fight against the Javascript:
import sgmllib, string class StrippingParser(sgmllib.SGMLParser): # These are the HTML tags that we will leave intact valid_tags = ('b', 'a', 'i', 'br', 'p') tolerate_missing_closing_tags = ('br', 'p') from htmlentitydefs import entitydefs # replace entitydefs from sgmllib def _ _init_ _(self): sgmllib.SGMLParser._ _init_ _(self) self.result = [] self.endTagList = [] def handle_data(self, data): self.result.append(data) def handle_charref(self, name): self.result.append("&#%s;" % name) def handle_entityref(self, name): x = ';' * self.entitydefs.has_key(name) self.result.append("&%s%s" % (name, x)) def unknown_starttag(self, tag, attrs): """ Delete all tags except for legal ones. """ if tag in self.valid_tags: self.result.append('<' + tag) for k, v in attrs: if string.lower(k[0:2]) != 'on' and string.lower( v[0:10]) != 'javascript': self.result.append(' %s="%s"' % (k, v)) self.result.append('>') if tag not in self.tolerate_missing_closing_tags: endTag = '</%s>' % tag self.endTagList.insert(0,endTag) def unknown_endtag(self, tag): if tag in self.valid_tags: # We don't ensure proper nesting of opening/closing tags endTag = '</%s>' % tag self.result.append(endTag) self.endTagList.remove(endTag) def cleanup(self): """ Append missing closing tags. """ self.result.extend(self.endTagList) def strip(s): """ Strip unsafe HTML tags and Javascript from string s. """ parser = StrippingParser( ) parser.feed(s) parser.close( ) parser.cleanup( ) return ''.join(parser.result)
This recipe uses sgmllib
to get rid of any HTML
tags, except for those specified in the
valid_tags
list. It also tolerates missing closing
tags only for those tags specified in
tolerate_missing_closing_tags
.
Getting rid of Javascript is much harder. This
recipe’s code handles only URLs that start with
javascript
: or onClick
and
similar handlers. The contents of <script>
tags will be printed as part of the text, and
vbscript
:, jscript
:, and other
weird URLs may be legal in some versions of IE. We could do a better
job on both scores, but only at the price of substantial additional
complications.
There is one Pythonic good habit worth noticing in the code. When you
need to put together a large string result out of many small pieces,
don’t keep the string as a string during the
composition. All the +=
or equivalent operations
will kill your performance (which would be
O(N2)—terrible for large enough values of
N). Instead, keep the result as a list of
strings, growing it with calls to append
or
extend
, and make the result a string only when
you’re done accumulating all the pieces with a
single invocation of ''.join
on the result list.
This is a much faster approach (specifically, it’s
roughly O(N) when amortized over large-enough
N). If you get into the habit of building
strings out of pieces the Python way, you’ll never
have to worry about this aspect of your program’s
performance.
Recipe 3.6; documentation for the standard
library module sgmllib
in the Library Reference; the W3C page on HTML (http://www.w3.org/MarkUp/).
Get Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.