Search the Catalog
Python Standard Library

Python Standard Library

By Fredrik Lundh
May 2001
0-596-00096-0, Order Number: 0960
300 pages, $29.95

Chapter 5
File Formats

Contents:

Overview
The xmllib Module
The xml.parsers.expat Module
The sgmllib Module
The htmllib Module
The htmlentitydefs Module
The formatter Module
The ConfigParser Module
The netrc Module
The shlex Module
The zipfile Module
The gzip Module

Overview

This chapter describes a number of modules that are used to parse different file formats.

Markup Languages

Python comes with extensive support for the Extensible Markup Language (XML) and Hypertext Markup Language (HTML) file formats. Python also provides basic support for Standard Generalized Markup Language (SGML).

All these formats share the same basic structure because both HTML and XML are derived from SGML. Each document contains a mix of start tags, end tags, plain text (also called character data), and entity references, as shown in the following:

<document name="sample.xml">
    <header>This is a header</header>
    <body>This is the body text.  The text can contain
    plain text (&quot;character data&quot;), tags, and
    entities.
    </body>
</document>

In the previous example, <document>, <header>, and <body> are start tags. For each start tag, there's a corresponding end tag that looks similar, but has a slash before the tag name. The start tag can also contain one or more attributes, like the name attribute in this example.

Everything between a start tag and its matching end tag is called an element. In the previous example, the document element contains two other elements: header and body.

Finally, &quot; is a character entity. It is used to represent reserved characters in the text sections. In this case, it's an ampersand (&), which is used to start the entity itself. Other common entities include &lt; for "less than" (<), and &gt; for "greater than" (>).

While XML, HTML, and SGML all share the same building blocks, there are important differences between them. In XML, all elements must have both start tags and end tags, and the tags must be properly nested (if they are, the document is said to be well-formed). In addition, XML is case-sensitive, so <document> and <Document> are two different element types.

HTML, in contrast, is much more flexible. The HTML parser can often fill in missing tags; for example, if you open a new paragraph in HTML using the <P> tag without closing the previous paragraph, the parser automatically adds a </P> end tag. HTML is also case-insensitive. On the other hand, XML allows you to define your own elements, while HTML uses a fixed element set, as defined by the HTML specifications.

SGML is even more flexible. In its full incarnation, you can use a custom declaration to define how to translate the source text into an element structure, and a document type description (DTD) to validate the structure and fill in missing tags. Technically, both HTML and XML are SGML applications; they both have their own SGML declaration, and HTML also has a standard DTD.

Python comes with parsers for all markup flavors. While SGML is the most flexible of the formats, Python's sgmllib parser is actually pretty simple. It avoids most of the problems by only understanding enough of the SGML standard to be able to deal with HTML. It doesn't handle DTDs either; instead, you can customize the parser via subclassing.

Python's HTML support is built on the SGML parser. The htmllib parser delegates the actual rendering to a formatter object. The formatter module contains a couple of standard formatters.

Python's XML support is most complex. In Python 1.5.2, the built-in support was limited to the xmllib parser, which is pretty similar to the sgmllib module (with one important difference; xmllib actually tries to support the entire XML standard). Python 2.0 comes with more advanced XML tools, based on the optional expat parser.

Configuration Files

The ConfigParser module reads and writes a simple configuration file format, similar to Windows INI files.

The netrc file reads .netrc configuration files, and the shlex module can be used to read any configuration file using a shell script-like syntax.

Archive Formats

Python's standard library provides support for the popular GZIP and ZIP (2.0 only) formats. The gzip module reads and writes GZIP files, and the zipfile reads and writes ZIP files. Both modules depend on the zlib data compression module.

The xmllib Module

The xmlib module provides a simple XML parser, using regular expressions to pull the XML data apart, as shown in Example 5-1. The parser does basic checks on the document, such as a check to see that there is only one top-level element and a check to see that all tags are balanced.

You feed XML data to this parser piece by piece (as data arrives over a network, for example). The parser calls methods in itself for start tags, data sections, end tags, and entities, among other things.

If you're only interested in a few tags, you can define special start_tag and end_tag methods, where tag is the tag name. The start functions are called with the attributes given as a dictionary.

Example 5.1. Using the xmllib Module to Extract Information from an Element

File: xmllib-example-1.py

import xmllib

class Parser(xmllib.XMLParser):
    # get quotation number

    def __init__(self, file=None):
        xmllib.XMLParser.__init__(self)
        if file:
            self.load(file)

    def load(self, file):
        while 1:
            s = file.read(512)
            if not s:
                break
            self.feed(s)
        self.close()

    def start_quotation(self, attrs):
        print "id =>", attrs.get("id")
        raise EOFError

try:
    c = Parser()
    c.load(open("samples/sample.xml"))
except EOFError:
    pass

id => 031

Example 5-2 contains a simple (and incomplete) rendering engine. The parser maintains an element stack (__tags), which it passes to the renderer, together with text fragments. The renderer looks up the current tag hierarchy in a style dictionary, and if it isn't already there, it creates a new style descriptor by combining bits and pieces from the stylesheet.

Example 5.2. Using the xmllib Module

File: xmllib-example-2.py

import xmllib
import string, sys

STYLESHEET = {
    # each element can contribute one or more style elements
    "quotation": {"style": "italic"},
    "lang": {"weight": "bold"},
    "name": {"weight": "medium"},
}

class Parser(xmllib.XMLParser):
    # a simple styling engine

    def __init__(self, renderer):
        xmllib.XMLParser.__init__(self)
        self.__data = []
        self.__tags = []
        self.__renderer = renderer

    def load(self, file):
        while 1:
            s = file.read(8192)
            if not s:
                break
            self.feed(s)
        self.close()

    def handle_data(self, data):
        self.__data.append(data)

    def unknown_starttag(self, tag, attrs):
        if self.__data:
            text = string.join(self.__data, "")
            self.__renderer.text(self.__tags, text)
        self.__tags.append(tag)
        self.__data = []

    def unknown_endtag(self, tag):
        self.__tags.pop()
        if self.__data:
            text = string.join(self.__data, "")
            self.__renderer.text(self.__tags, text)
        self.__data = []

class DumbRenderer:

    def __init__(self):
        self.cache = {}

    def text(self, tags, text):
        # render text in the style given by the tag stack
        tags = tuple(tags)
        style = self.cache.get(tags)
        if style is None:
            # figure out a combined style
            style = {}
            for tag in tags:
                s = STYLESHEET.get(tag)
                if s:
                    style.update(s)
            self.cache[tags] = style # update cache
        # write to standard output
        sys.stdout.write("%s =>\n" % style)
        sys.stdout.write("  " + repr(text) + "\n")

#
# try it out

r = DumbRenderer()
c = Parser(r)
c.load(open("samples/sample.xml"))

{'style': 'italic'} =>
  'I\'ve had a lot of developers come up to me and\012say,
  "I haven\'t had this much fun in a long time. It sure
  beats\012writing '
{'style': 'italic', 'weight': 'bold'} =>
  'Cobol'
{'style': 'italic'} =>
  '" -- '
{'style': 'italic', 'weight': 'medium'} =>
  'James Gosling'
{'style': 'italic'} =>
  ', on\012'
{'weight': 'bold'} =>
  'Java'
{'style': 'italic'} =>
  '.'

The xml.parsers.expat Module

(Optional) The xml.parsers.expat module is an interface to James Clark's Expat XML parser. Example 5-3 demonstrates this full-featured and fast parser, which is an excellent choice for production use.

Example 5.3. Using the xml.parsers.expat Module

File: xml-parsers-expat-example-1.py

from xml.parsers import expat

class Parser:

    def __init__(self):
        self._parser = expat.ParserCreate()
        self._parser.StartElementHandler = self.start
        self._parser.EndElementHandler = self.end
        self._parser.CharacterDataHandler = self.data

    def feed(self, data):
        self._parser.Parse(data, 0)

    def close(self):
        self._parser.Parse("", 1) # end of data
        del self._parser # get rid of circular references

    def start(self, tag, attrs):
        print "START", repr(tag), attrs

    def end(self, tag):
        print "END", repr(tag)

    def data(self, data):
        print "DATA", repr(data)

p = Parser()
p.feed("<tag>data</tag>")
p.close()

START u'tag' {}
DATA u'data'
END u'tag'

Note that the parser returns Unicode strings, even if you pass it ordinary text. By default, the parser interprets the source text as UTF-8 (as per the XML standard). To use other encodings, make sure the XML file contains an encoding directive. Example 5-4 shows how to read ISO Latin-1 text using xml.parsers.expat.

Example 5.4. Using the xml.parsers.expat Module to Read ISO Latin-1 Text

File: xml-parsers-expat-example-2.py

from xml.parsers import expat

class Parser:

    def __init__(self):
        self._parser = expat.ParserCreate()
        self._parser.StartElementHandler = self.start
        self._parser.EndElementHandler = self.end
        self._parser.CharacterDataHandler = self.data

    def feed(self, data):
        self._parser.Parse(data, 0)

    def close(self):
        self._parser.Parse("", 1) # end of data
        del self._parser # get rid of circular references

    def start(self, tag, attrs):
        print "START", repr(tag), attrs

    def end(self, tag):
        print "END", repr(tag)

    def data(self, data):
        print "DATA", repr(data)

p = Parser()
p.feed("""\
<?xml version='1.0' encoding='iso-8859-1'?>
<author>
<name>fredrik lundh</name>
<city>linköping</city>
</author>
"""
)
p.close()

START u'author' {}
DATA u'\012'
START u'name' {}
DATA u'fredrik lundh'
END u'name'
DATA u'\012'
START u'city' {}
DATA u'link\366ping'
END u'city'
DATA u'\012'
END u'author'

The sgmllib Module

The sgmllib module, shown in Example 5-5, provides a basic SGML parser. It works pretty much the same as the xmllib parser, but is less restrictive (and less complete).

Like in xmllib, this parser calls methods in itself to deal with things like start tags, data sections, end tags, and entities. If you're only interested in a few tags, you can define special start and end methods.

Example 5.5. Using the sgmllib Module to Extract the Title Element

File: sgmllib-example-1.py

import sgmllib
import string

class FoundTitle(Exception):
    pass

class ExtractTitle(sgmllib.SGMLParser):

    def __init__(self, verbose=0):
        sgmllib.SGMLParser.__init__(self, verbose)
        self.title = self.data = None

    def handle_data(self, data):
        if self.data is not None:
            self.data.append(data)

    def start_title(self, attrs):
        self.data = []

    def end_title(self):
        self.title = string.join(self.data, "")
        raise FoundTitle # abort parsing!

def extract(file):
    # extract title from an HTML/SGML stream
    p = ExtractTitle()
    try:
        while 1:
            # read small chunks
            s = file.read(512)
            if not s:
                break
            p.feed(s)
        p.close()
    except FoundTitle:
        return p.title
    return None

#
# try it out

print "html", "=>", extract(open("samples/sample.htm"))
print "sgml", "=>", extract(open("samples/sample.sgm"))

html => A Title.
sgml => Quotations

To handle all tags, overload the unknown_starttag and unknown_endtag methods instead, as Example 5-6 demonstrates.

Example 5.6. Using the sgmllib Module to Format an SGML Document

File: sgmllib-example-2.py

import sgmllib
import cgi, sys

class PrettyPrinter(sgmllib.SGMLParser):
    # A simple SGML pretty printer

    def __init__(self):
        # initialize base class
        sgmllib.SGMLParser.__init__(self)
        self.flag = 0

    def newline(self):
        # force newline, if necessary
        if self.flag:
            sys.stdout.write("\n")
        self.flag = 0

    def unknown_starttag(self, tag, attrs):
        # called for each start tag

        # the attrs argument is a list of (attr, value)
        # tuples. convert it to a string.
        text = ""
        for attr, value in attrs:
            text = text + " %s='%s'" % (attr, cgi.escape(value))

        self.newline()
        sys.stdout.write("<%s%s>\n" % (tag, text))

    def handle_data(self, text):
        # called for each text section
        sys.stdout.write(text)
        self.flag = (text[-1:] != "\n")

    def handle_entityref(self, text):
        # called for each entity
        sys.stdout.write("&%s;" % text)
        
    def unknown_endtag(self, tag):
        # called for each end tag
        self.newline()
        sys.stdout.write("<%s>" % tag)

#
# try it out

file = open("samples/sample.sgm")

p = PrettyPrinter()
p.feed(file.read())
p.close()

<chapter>
<title>
Quotations
<title>
<epigraph>
<attribution>
eff-bot, June 1997
<attribution>
<para>
<quote>
Nobody expects the Spanish Inquisition! Amongst
our weaponry are such diverse elements as fear, surprise,
ruthless efficiency, and an almost fanatical devotion to
Guido, and nice red uniforms &mdash; oh, damn!
<quote>
<para>
<epigraph>
<chapter>

Example 5-7 checks if an SGML document is "well-formed", in the XML sense. In a well-formed document, all elements are properly nested, with one end tag for each start tag.

To check this, we simply keep a list of open tags, and check that each end tag closes a matching start tag and that there are no open tags when we reach the end of the document.

Example 5.7. Using the sgmllib Module to Check if an SGML Document Is Well-Formed

File: sgmllib-example-3.py

import sgmllib

class WellFormednessChecker(sgmllib.SGMLParser):
    # check that an SGML document is 'well-formed'
    # (in the XML sense).

    def __init__(self, file=None):
        sgmllib.SGMLParser.__init__(self)
        self.tags = []
        if file:
            self.load(file)

    def load(self, file):
        while 1:
            s = file.read(8192)
            if not s:
                break
            self.feed(s)
        self.close()

    def close(self):
        sgmllib.SGMLParser.close(self)
        if self.tags:
            raise SyntaxError, "start tag %s not closed" % self.tags[-1]

    def unknown_starttag(self, start, attrs):
        self.tags.append(start)

    def unknown_endtag(self, end):
        start = self.tags.pop()
        if end != start:
            raise SyntaxError, "end tag %s does't match start tag %s" %\
                  (end, start)

try:
    c = WellFormednessChecker()
    c.load(open("samples/sample.htm"))
except SyntaxError:
    raise # report error
else:
    print "document is well-formed"

Traceback (innermost last):
...
SyntaxError: end tag head does't match start tag meta

Finally, Example 5-8 shows a class that allows you to filter HTML and SGML documents. To use this class, create your own base class, and implement the start and end methods.

Example 5.8. Using the sgmllib Module to Filter SGML Documents

File: sgmllib-example-4.py

import sgmllib
import cgi, string, sys

class SGMLFilter(sgmllib.SGMLParser):
    # sgml filter.  override start/end to manipulate
    # document elements

    def __init__(self, outfile=None, infile=None):
        sgmllib.SGMLParser.__init__(self)
        if not outfile:
            outfile = sys.stdout
        self.write = outfile.write
        if infile:
            self.load(infile)

    def load(self, file):
        while 1:
            s = file.read(8192)
            if not s:
                break
            self.feed(s)
        self.close()

    def handle_entityref(self, name):
        self.write("&%s;" % name)

    def handle_data(self, data):
        self.write(cgi.escape(data))

    def unknown_starttag(self, tag, attrs):
        tag, attrs = self.start(tag, attrs)
        if tag:
            if not attrs:
                self.write("<%s>" % tag)
            else:
                self.write("<%s" % tag)
                for k, v in attrs:
                    self.write(" %s=%s" % (k, repr(v)))
                self.write(">")

    def unknown_endtag(self, tag):
        tag = self.end(tag)
        if tag:
            self.write("</%s>" % tag)

    def start(self, tag, attrs):
        return tag, attrs # override

    def end(self, tag):
        return tag # override

class Filter(SGMLFilter):

    def fixtag(self, tag):
        if tag == "em":
            tag = "i"
        if tag == "string":
            tag = "b"
        return string.upper(tag)

    def start(self, tag, attrs):
        return self.fixtag(tag), attrs

    def end(self, tag):
        return self.fixtag(tag)

c = Filter()
c.load(open("samples/sample.htm"))

The htmllib Module

The htmlib module contains a tag-driven HTML parser, which sends data to a formatting object. Example 5-9 uses this module. For more examples on how to parse HTML files using this module, see the descriptions of the formatter module.

Example 5.9. Using the htmllib Module

File: htmllib-example-1.py

import htmllib
import formatter
import string

class Parser(htmllib.HTMLParser):
    # return a dictionary mapping anchor texts to lists
    # of associated hyperlinks

    def __init__(self, verbose=0):
        self.anchors = {}
        f = formatter.NullFormatter()
        htmllib.HTMLParser.__init__(self, f, verbose)

    def anchor_bgn(self, href, name, type):
        self.save_bgn()
        self.anchor = href

    def anchor_end(self):
        text = string.strip(self.save_end())
        if self.anchor and text:
            self.anchors[text] = self.anchors.get(text, []) + [self.anchor]

file = open("samples/sample.htm")
html = file.read()
file.close()

p = Parser()
p.feed(html)
p.close()

for k, v in p.anchors.items():
    print k, "=>", v

print

link => ['http://www.python.org']

If you're only out to parse an HTML file and not render it to an output device, it's usually easier to use the sgmllib module instead.

The htmlentitydefs Module

The htmlentitydefs module contains a dictionary with many ISO Latin 1 character entities used by HTML. Its use is demonstrated in Example 5-10.

Example 5.10. Using the htmlentitydefs Module

File: htmlentitydefs-example-1.py

import htmlentitydefs

entities = htmlentitydefs.entitydefs

for entity in "amp", "quot", "copy", "yen":
    print entity, "=", entities[entity]

amp = &
quot = "
copy = \302\251
yen = \302\245

Example 5-11 shows how to combine regular expressions with this dictionary to translate entities in a string (the opposite of cgi.escape).

Example 5.11. Using the htmlentitydefs Module to Translate Entities

File: htmlentitydefs-example-2.py

import htmlentitydefs
import re
import cgi

pattern = re.compile("&(\w+?);")

def descape_entity(m, defs=htmlentitydefs.entitydefs):
    # callback: translate one entity to its ISO Latin value
    try:
        return defs[m.group(1)]
    except KeyError:
        return m.group(0) # use as is

def descape(string):
    return pattern.sub(descape_entity, string)

print descape("&lt;spam&amp;eggs&gt;")
print descape(cgi.escape("<spam&eggs>"))

<spam&eggs>
<spam&eggs>

Finally, Example 5-12 shows how to use translate reserved XML characters and ISO Latin 1 characters to an XML string. This is similar to cgi.escape, but it also replaces non-ASCII characters.

Example 5.12. Escaping ISO Latin 1 Entities

File: htmlentitydefs-example-3.py

import htmlentitydefs
import re, string

# this pattern matches substrings of reserved and non-ASCII characters
pattern = re.compile(r"[&<>\"\x80-\xff]+")

# create character map
entity_map = {}

for i in range(256):
    entity_map[chr(i)] = "&%d;" % i

for entity, char in htmlentitydefs.entitydefs.items():
    if entity_map.has_key(char):
        entity_map[char] = "&%s;" % entity

def escape_entity(m, get=entity_map.get):
    return string.join(map(get, m.group()), "")

def escape(string):
    return pattern.sub(escape_entity, string)

print escape("<spam&eggs>")
print escape("\303\245 i \303\245a \303\244 e \303\266")

&lt;spam&amp;eggs&gt;
&aring; i &aring;a &auml; e &ouml;

The formatter Module

The formatter module provides formatter classes that can be used together with the htmllib module.

This module provides two class families, formatters and writers. Formatters convert a stream of tags and data strings from the HTML parser into an event stream suitable for an output device, and writers render that event stream on an output device. Example 5-13 demonstrates.

In most cases, you can use the AbstractFormatter class to do the formatting. It calls methods on the writer object, representing different kinds of formatting events. The AbstractWriter class simply prints a message for each method call.

Example 5.13. Using the formatter Module to Convert HTML to an Event Stream

File: formatter-example-1.py

import formatter
import htmllib

w = formatter.AbstractWriter()
f = formatter.AbstractFormatter(w)

file = open("samples/sample.htm")

p = htmllib.HTMLParser(f)
p.feed(file.read())
p.close()

file.close()

send_paragraph(1)
new_font(('h1', 0, 1, 0))
send_flowing_data('A Chapter.')
send_line_break()
send_paragraph(1)
new_font(None)
send_flowing_data('Some text. Some more text. Some')
send_flowing_data(' ')
new_font((None, 1, None, None))
send_flowing_data('emphasized')
new_font(None)
send_flowing_data(' text. A')
send_flowing_data(' link')
send_flowing_data('[1]')
send_flowing_data('.')

In addition to the AbstractWriter class, the formatter module provides a NullWriter class, which ignores all events passed to it, and a DumbWriter class that converts the event stream to a plain text document, as shown in Example 5-14.

Example 5.14. Using the formatter Module to Convert HTML to Plain Text

File: formatter-example-2.py

import formatter
import htmllib

w = formatter.DumbWriter() # plain text
f = formatter.AbstractFormatter(w)

file = open("samples/sample.htm")

# print html body as plain text
p = htmllib.HTMLParser(f)
p.feed(file.read())
p.close()

file.close()

# print links
print
print
i = 1
for link in p.anchorlist:
    print i, "=>", link
    i = i + 1

A Chapter.

Some text. Some more text. Some emphasized text. A link[1].

1 => http://www.python.org

Example 5-15 provides a custom Writer, which in this case is subclassed from the DumbWriter class. This version keeps track of the current font style and tweaks the output somewhat depending on the font.

Example 5.15. Using the formatter Module with a Custom Writer

File: formatter-example-3.py

import formatter
import htmllib, string

class Writer(formatter.DumbWriter):

    def __init__(self):
        formatter.DumbWriter.__init__(self)
        self.tag = ""
        self.bold = self.italic = 0
        self.fonts = []

    def new_font(self, font):
        if font is None:
            font = self.fonts.pop()
            self.tag, self.bold, self.italic = font
        else:
            self.fonts.append((self.tag, self.bold, self.italic))
            tag, bold, italic, typewriter = font
            if tag is not None:
                self.tag = tag
            if bold is not None:
                self.bold = bold
            if italic is not None:
                self.italic = italic

    def send_flowing_data(self, data):
        if not data:
            return
        atbreak = self.atbreak or data[0] in string.whitespace
        for word in string.split(data):
            if atbreak:
                self.file.write(" ")
            if self.tag in ("h1", "h2", "h3"):
                word = string.upper(word)
            if self.bold:
                word = "*" + word + "*"
            if self.italic:
                word = "_" + word + "_"
            self.file.write(word)
            atbreak = 1
        self.atbreak = data[-1] in string.whitespace

w = Writer()
f = formatter.AbstractFormatter(w)

file = open("samples/sample.htm")

# print html body as plain text
p = htmllib.HTMLParser(f)
p.feed(file.read())
p.close()

_A_ _CHAPTER._

Some text. Some more text. Some *emphasized* text. A link[1].

The ConfigParser Module

The ConfigParser module reads configuration files.

The files should be written in a format similar to Windows INI files. The file contains one or more sections, separated by section names written in brackets. Each section can contain one or more configuration items.

Here's the sample file used in Example 5-16:

[book]
title: The Python Standard Library
author: Fredrik Lundh
email: fredrik@pythonware.com
version: 2.0-001115

[ematter]
pages: 250

[hardcopy]
pages: 350

Example 5-16 uses the ConfigParser module to read the sample configuration file.

Example 5.16. Using the ConfigParser Module

File: configparser-example-1.py

import ConfigParser
import string

config = ConfigParser.ConfigParser()

config.read("samples/sample.ini")

# print summary
print
print string.upper(config.get("book", "title"))
print "by", config.get("book", "author"),
print  "(" + config.get("book", "email") + ")"
print
print config.get("ematter", "pages"), "pages"
print

# dump entire config file
for section in config.sections():
    print section
    for option in config.options(section):
        print " ", option, "=", config.get(section, option)

THE PYTHON STANDARD LIBRARY
by Fredrik Lundh (fredrik@pythonware.com)

250 pages

book
  title = The Python Standard Library
  email = fredrik@pythonware.com
  author = Fredrik Lundh
  version = 2.0-001115
  __name__ = book
ematter
  __name__ = ematter
  pages = 250
hardcopy
  __name__ = hardcopy
  pages = 350

In Python 2.0, the ConfigParser module also allows you to write configuration data to a file, as Example 5-17 shows.

Example 5.17. Using the ConfigParser Module to Write Configuration Data

File: configparser-example-2.py

import ConfigParser
import sys

config = ConfigParser.ConfigParser()

# set a number of parameters
config.add_section("book")
config.set("book", "title", "the python standard library")
config.set("book", "author", "fredrik lundh")

config.add_section("ematter")
config.set("ematter", "pages", 250)

# write to screen
config.write(sys.stdout)

[book]
title = the python standard library
author = fredrik lundh

[ematter]
pages = 250

The netrc Module

The netrc module parses .netrc configuration files, as shown in Example 5-18. Such files are used to store FTP usernames and passwords in a user's home directory (don't forget to configure things so that the file can only be read by the user: "chmod 0600 ~/.netrc," in other words).

Example 5.18. Using the netrc Module

File: netrc-example-1.py

import netrc

# default is $HOME/.netrc
info = netrc.netrc("samples/sample.netrc")

login, account, password = info.authenticators("secret.fbi")
print "login", "=>", repr(login)
print "account", "=>", repr(account)
print "password", "=>", repr(password)

login => 'mulder'
account => None
password => 'trustno1'

The shlex Module

The shlex module provides a simple lexer (also known as tokenizer) for languages based on the Unix shell syntax. Its use is demonstrated in Example 5-19.

Example 5.19. Using the shlex Module

File: shlex-example-1.py

import shlex

lexer = shlex.shlex(open("samples/sample.netrc", "r"))
lexer.wordchars = lexer.wordchars + "._"

while 1:
    token = lexer.get_token()
    if not token:
        break
    print repr(token)

'machine'
'secret.fbi'
'login'
'mulder'
'password'
'trustno1'
'machine'
'non.secret.fbi'
'login'
'scully'
'password'
'noway'

The zipfile Module

(New in 2.0) The zipfile module allows you to read and write files in the popular ZIP archive format.

Listing the Contents

To list the contents of an existing archive, you can use the namelist and infolist methods used in Example 5-20. The former returns a list of filenames, and the latter returns a list of ZipInfo instances.

Example 5.20. Using the zipfile Module to List Files in a ZIP File

File: zipfile-example-1.py

import zipfile

file = zipfile.ZipFile("samples/sample.zip", "r")

# list filenames
for name in file.namelist():
    print name,
print

# list file information
for info in file.infolist():
    print info.filename, info.date_time, info.file_size

sample.txt sample.jpg
sample.txt (1999, 9, 11, 20, 11, 8) 302
sample.jpg (1999, 9, 18, 16, 9, 44) 4762

Reading Data from a ZIP File

To read data from an archive, simply use the read method used in Example 5-21. It takes a filename as an argument and returns the data as a string.

Example 5.21. Using the zipfile Module to Read Data from a ZIP File

File: zipfile-example-2.py

import zipfile

file = zipfile.ZipFile("samples/sample.zip", "r")

for name in file.namelist():
    data = file.read(name)
    print name, len(data), repr(data[:10])

sample.txt 302 'We will pe'
sample.jpg 4762 '\377\330\377\340\000\020JFIF'

Writing Data to a ZIP File

Adding files to an archive is easy. Just pass the filename, and the name you want that file to have in the archive, to the write method.

The script in Example 5-22 creates a ZIP file containing all files in the samples directory.

Example 5.22. Using the zipfile Module to Store Files in a ZIP File

File: zipfile-example-3.py

import zipfile
import glob, os

# open the zip file for writing, and write stuff to it

file = zipfile.ZipFile("test.zip", "w")

for name in glob.glob("samples/*"):
    file.write(name, os.path.basename(name), zipfile.ZIP_DEFLATED)

file.close()

# open the file again, to see what's in it

file = zipfile.ZipFile("test.zip", "r")
for info in file.infolist():
    print info.filename, info.date_time, info.file_size, info.compress_size

sample.wav (1999, 8, 15, 21, 26, 46) 13260 10985
sample.jpg (1999, 9, 18, 16, 9, 44) 4762 4626
sample.au (1999, 7, 18, 20, 57, 34) 1676 1103
...

The third, optional argument to the write method controls what compression method to use or, rather, it controls whether data should be compressed at all. The default is zipfile.ZIP_STORED, which stores the data in the archive without any compression at all. If the zlib module is installed, you can also use zipfile.ZIP_DEFLATED, which gives you "deflate" compression.

The zipfile module also allows you to add strings to the archive. However, adding data from a string is a bit tricky; instead of just passing in the archive name and the data, you have to create a ZipInfo instance and configure it correctly. Example 5-23 offers a simple solution.

Example 5.23. Using the zipfile Module to Store Strings in a ZIP File

File: zipfile-example-4.py

import zipfile
import glob, os, time

file = zipfile.ZipFile("test.zip", "w")

now = time.localtime(time.time())[:6]

for name in ("life", "of", "brian"):
    info = zipfile.ZipInfo(name)
    info.date_time = now
    info.compress_type = zipfile.ZIP_DEFLATED
    file.writestr(info, name*1000)

file.close()

# open the file again, to see what's in it

file = zipfile.ZipFile("test.zip", "r")

for info in file.infolist():
    print info.filename, info.date_time, info.file_size, info.compress_size

life (2000, 12, 1, 0, 12, 1) 4000 26
of (2000, 12, 1, 0, 12, 1) 2000 18
brian (2000, 12, 1, 0, 12, 1) 5000 31

The gzip Module

The gzip module allows you to read and write gzip-compressed files as if they were ordinary files, as shown in Example 5-24.

Example 5.24. Using the gzip Module to Read a Compressed File

File: gzip-example-1.py

import gzip

file = gzip.GzipFile("samples/sample.gz")

print file.read()

Well it certainly looks as though we're in for
a splendid afternoon's sport in this the 127th
Upperclass Twit of the Year Show.

The standard implementation doesn't support the seek and tell methods. Example 5-25 shows how to add forward seeking.

Example 5.25. Extending the gzip Module to Support seek/tell

File: gzip-example-2.py

import gzip

class gzipFile(gzip.GzipFile):
    # adds seek/tell support to GzipFile

    offset = 0

    def read(self, size=None):
        data = gzip.GzipFile.read(self, size)
        self.offset = self.offset + len(data)
        return data

    def seek(self, offset, whence=0):
        # figure out new position (we can only seek forwards)
        if whence == 0:
            position = offset
        elif whence == 1:
            position = self.offset + offset
        else:
            raise IOError, "Illegal argument"
        if position < self.offset:
            raise IOError, "Cannot seek backwards"

        # skip forward, in 16k blocks
        while position > self.offset:
            if not self.read(min(position - self.offset, 16384)):
                break

    def tell(self):
        return self.offset

#
# try it

file = gzipFile("samples/sample.gz")
file.seek(80)

print file.read()

this the 127th
Upperclass Twit of the Year Show.

Back to: Python Standard Library


O'Reilly Home | O'Reilly Bookstores | How to Order | O'Reilly Contacts
International | About O'Reilly | Affiliated Companies

© 2001, O'Reilly & Associates, Inc.
webmaster@oreilly.com