Serializing Data Using the marshal Module

Credit: Luther Blissett

Problem

You have a Python data structure composed of only fundamental Python objects (e.g., lists, tuples, numbers, and strings, but no classes, instances, etc.), and you want to serialize it and reconstruct it later as fast as possible.

Solution

If you know that your data is composed entirely of fundamental Python objects, the lowest-level, fastest approach to serializing it (i.e., turning it into a string of bytes and later reconstructing it from such a string) is via the marshal module. Suppose that data is composed of only elementary Python data types. For example:

data = {12:'twelve', 'feep':list('ciao'), 1.23:4+5j, (1,2,3):u'wer'}

You can serialize data to a byte string at top speed as follows:

import marshal
bytes = marshal.dumps(data)

You can now sling bytes around as you wish (e.g., send it across a network, put it as a BLOB in a database, etc.), as long as you keep its arbitrary binary bytes intact. Then you can reconstruct the data any time you’d like:

redata = marshal.loads(bytes)

This reconstructs a data structure that compares equal (==) to data. In other words, the order of keys in dictionaries is arbitrary in both the original and reconstructed data structures, but order in any kind of sequence is meaningful, and thus it is preserved. Note that loads works independently of machine architecture, but you must guarantee that it is used by the same release of Python under which bytes was originally generated via dumps.

When you specifically want to write the data to a disk file, as long as the latter is open for binary (not the default text mode) input/output, you can also use the dump function of the marshal module, which lets you dump several data structures one after the other:

ouf = open('datafile.dat', 'wb')
marshal.dump(data, ouf)
marshal.dump('some string', ouf)
marshal.dump(range(19), ouf)
ouf.close(  )

When you have done this, you can recover from datafile.dat the same data structures you dumped into it, in the same sequence:

inf = open('datafile.dat', 'rb')
a = marshal.load(inf)
b = marshal.load(inf)
c = marshal.load(inf)
inf.close(  )

Discussion

Python offers several ways to serialize data (i.e., make the data into a string of bytes that you can save on disk, in a database, send across the network, and so on) and corresponding ways to reconstruct the data from such serialized forms. The lowest-level approach is to use the marshal module, which Python uses to write its bytecode files. marshal supports only elementary data types (e.g., dictionaries, lists, tuples, numbers, and strings) and combinations thereof. marshal does not guarantee compatibility from one Python release to another, so data serialized with marshal may not be readable if you upgrade your Python release. However, it does guarantee independence from a specific machine’s architecture, so it is guaranteed to work if you’re sending serialized data between different machines, as long as they are all running the same version of Python—similar to how you can share compiled Python bytecode files in such a distributed setting.

marshal’s dumps function accepts any Python data structure and returns a byte string representing it. You can pass that byte string to the loads function, which will return another Python data structure that compares equal (==) to the one you originally dumped. In between the dumps and loads calls, you can subject the byte string to any procedure you wish, such as sending it over the network, storing it into a database and retrieving it, or encrypting it and decrypting it. As long as the string’s binary structure is correctly restored, loads will work fine on it (again, as long as it is under the same Python release with which you originally executed dumps).

When you specifically need to save the data to a file, you can also use marshal’s dump function, which takes two arguments: the data structure you’re dumping and the open file object. Note that the file must be opened for binary I/O (not the default, which is text I/O) and can’t be a file-like object, as marshal is quite picky about it being a true file. The advantage of dump is that you can perform several calls to dump with various data structures and the same open file object: each data structure is then dumped together with information about how long the dumped byte string is. As a consequence, when you later open the file for binary reading and then call marshal.load, passing the file as the argument, each previously dumped data structure is reloaded sequentially. The return value of load, like that of loads, is a new data structure that compares equal to the one you originally dumped.

Python Cookbook by

Serializing Data Using the marshal Module

Problem

Solution

Discussion

See Also

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly