Stripping Dangerous Tags and Javascript from HTML

Credit: Itamar Shtull-Trauring

Problem

You have received some HTML input from a user and need to make sure that the HTML is clean. You want to allow only safe tags, to ensure that tags needing closure are indeed closed, and, ideally, to strip out any Javascript that might be part of the page.

Solution

The sgmllib module helps with cleaning up the HTML tags, but we still have to fight against the Javascript:

import sgmllib, string

class StrippingParser(sgmllib.SGMLParser):
    # These are the HTML tags that we will leave intact
    valid_tags = ('b', 'a', 'i', 'br', 'p')
    tolerate_missing_closing_tags = ('br', 'p')
    from htmlentitydefs import entitydefs # replace entitydefs from sgmllib

    def _ _init_ _(self):
        sgmllib.SGMLParser._ _init_ _(self)
        self.result = []
        self.endTagList = []

    def handle_data(self, data):
        self.result.append(data)

    def handle_charref(self, name):
        self.result.append("&#%s;" % name)

    def handle_entityref(self, name):
        x = ';' * self.entitydefs.has_key(name)
        self.result.append("&%s%s" % (name, x))

    def unknown_starttag(self, tag, attrs):
        """ Delete all tags except for legal ones. """
        if tag in self.valid_tags:
            self.result.append('<' + tag)
            for k, v in attrs:
                if string.lower(k[0:2]) != 'on' and string.lower(
                        v[0:10]) != 'javascript':
                    self.result.append(' %s="%s"' % (k, v))
            self.result.append('>')
            if tag not in self.tolerate_missing_closing_tags:
                endTag = '</%s>' % tag
                self.endTagList.insert(0,endTag)

    def unknown_endtag(self, tag):
        if tag in self.valid_tags:
            # We don't ensure proper nesting of opening/closing tags
            endTag = '</%s>' % tag
            self.result.append(endTag)
            self.endTagList.remove(endTag)

    def cleanup(self):
        """ Append missing closing tags. """
        self.result.extend(self.endTagList)

def strip(s):
    """ Strip unsafe HTML tags and Javascript from string s. """
    parser = StrippingParser(  )
    parser.feed(s)
    parser.close(  )
    parser.cleanup(  )
    return ''.join(parser.result)

Discussion

This recipe uses sgmllib to get rid of any HTML tags, except for those specified in the valid_tags list. It also tolerates missing closing tags only for those tags specified in tolerate_missing_closing_tags.

Getting rid of Javascript is much harder. This recipe’s code handles only URLs that start with javascript: or onClick and similar handlers. The contents of <script> tags will be printed as part of the text, and vbscript:, jscript:, and other weird URLs may be legal in some versions of IE. We could do a better job on both scores, but only at the price of substantial additional complications.

There is one Pythonic good habit worth noticing in the code. When you need to put together a large string result out of many small pieces, don’t keep the string as a string during the composition. All the += or equivalent operations will kill your performance (which would be O(N2)—terrible for large enough values of N). Instead, keep the result as a list of strings, growing it with calls to append or extend, and make the result a string only when you’re done accumulating all the pieces with a single invocation of ''.join on the result list. This is a much faster approach (specifically, it’s roughly O(N) when amortized over large-enough N). If you get into the habit of building strings out of pieces the Python way, you’ll never have to worry about this aspect of your program’s performance.

See Also

Recipe 3.6; documentation for the standard library module sgmllib in the Library Reference; the W3C page on HTML (http://www.w3.org/MarkUp/).

Get Python Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.