Chapter 4. Data Structures and Data Types

Now that you’ve been properly introduced, it’s time to focus on how Polars works.

Data comes in many shapes and sizes, all of which need to be stored in proper structures to work with it. To accommodate all the data you’ll be working with, Polars implements the Arrow memory specification, which provides a vast array of data types.

In this chapter you’ll learn about:

The structures Polars uses to store data
The different data types that are available
Some of the data types that aren’t so straight forward

Let’s start the beautiful journey of learning about Polars.

Series, DataFrames, and LazyFrames

Polars stores all of its data in a Series or a DataFrame.

A Series is a one-dimensional data structure that holds a sequence of values. All values in a Series have the same data type, like integers, floats, or Strings. Series can exist on their own, but they’re most commonly used as columns in a DataFrame.

An example of a Series is the following:

sales_series = pl.Series("sales", [150.00, 300.00, 250.00])

sales_series

shape: (3,)
Series: 'sales' [f64]
[
	150.0
	300.0
	250.0
]

A DataFrame is a two-dimensional data structure that organizes data in a table format, with rows and columns. Internally, it’s represented as a collection of Series, each with the same length. To dive deeper into the inner workings of Series and DataFrames, refer to Chapter 18.

Here’s an example of a DataFrame that incorporates the Series you just made:

sales_df = pl.DataFrame(
    {
        "sales": sales_series,
        "customer_id": [24, 25, 26],
    }
)

sales_df

shape: (3, 2)
┌───────┬─────────────┐
│ sales │ customer_id │
│ ---   │ ---         │
│ f64   │ i64         │
╞═══════╪═════════════╡
│ 150.0 │ 24          │
│ 300.0 │ 25          │
│ 250.0 │ 26          │
└───────┴─────────────┘

A LazyFrame resembles a DataFrame but holds no data¹. While a DataFrame stores data directly in memory, a LazyFrame contains only instructions for reading and processing data. None of the read operations or transformations applied to a LazyFrame are executed immediately; instead, they are deferred until needed, hence the term “lazy” evaluation. Until evaluation, a LazyFrame remains a blueprint for generating a DataFrame—a query graph representing the computational steps. This query graph enables the optimizer to refine and optimize the planned computations, ensuring efficient execution when finally evaluated.

Here’s an example of a LazyFrame:

lazy_df = pl.scan_csv("data/fruit.csv").with_columns(
    is_heavy=pl.col("weight") > 200
)

lazy_df.show_graph()

This blueprint turns into a DataFrame, once you execute it with a method like lf.collect().

We will dive deeper into the usage of the two eager and lazy APIs in Chapter 5.

Data Types

To store data efficiently, Polars implements the Apache Arrow memory specification. In Chapter 18 you can read more about what Arrow is and how it works. In short, Arrow is a columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. This means Polars stores your data in a way that allows for optimal performance when processing it.

Polars has implemented the data types shown in Table 4-1. Most of these are based on the data types defined by the Arrow specification.² Some data types occur multiple times with different bit sizes. This allows you to store data that fits within the range with a smaller memory footprint.

Table 4-1. Data types available in Polars
Group	Type	Details	Range
	DataType	Base class for all Polars data types.
Numeric	Decimal	Decimal 128-bit type with an optional precision and non-negative scale.	Can exactly represent 38 significant digits
	Float32	32-bit floating point type.	-3.4e+38 to 3.4e+38
	Float64	64-bit floating point type.	-1.7e+308 to 1.7e+308
	Int8	8-bit signed integer type.	-128 to 128
	Int16	16-bit signed integer type.	-32,768 to 32,767
	Int32	32-bit signed integer type.	-2,147,483,648 to 2,147,483,647
	Int64	64-bit signed integer type.	-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
	UInt8	8-bit unsigned integer type.	0 to 255
	UInt16	16-bit unsigned integer type.	0 to 65,535
	UInt32	32-bit unsigned integer type.	0 to 4,294,967,295
	UInt64	64-bit unsigned integer type.	0 to 1.8446744e+19
Temporal	Date	Calendar date type. Uses the Arrow date32 data type, which represents the number of days since UNIX epoch 1970-01-01 as int32.	-5877641-06-24 to 5879610-09-09
	Datetime	Calendar date and time type. Exact timestamp encoded with int64 since UNIX epoch. Default unit microseconds.
	Duration	Time duration/delta type.
	Time	Time of day type.
Nested	Array	Fixed length list type.
	List	Variable length list type.
	Struct	Struct type.
String	String	UTF-8 encoded string type of variable length.
	Categorical	A categorical encoding of a set of Strings. Allows for more efficient memory usage if a Series contains few unique Strings.
	Enum	A categorical encoding of a set of Strings that is fixed. The categories must be known and defined beforehand.
Other	Boolean	Boolean type taking 1 bit of space.	True or False
	Binary	Binary type with variable-length bytes.
	Null	Type representing Null / None values.
	Object	Type for wrapping arbitrary Python objects.
	Unknown^a	Type representing Datatype values that could not be determined statically.
^a The documentation lists the Unknown data type. This data type is only used internally as a placeholder and should not be used in your code.

Object Stowaways

Sometimes you need to add arbitrary Python objects to a DataFrame. For example, you want to store multiple machine-learning models in a column. In this case you can use the Object data type.

The downside is that this data cannot be processed using the normal functions. Moreover, none of the optimizations are used, because Polars does not use Python to look at what the data represents. As a result, an Object column can be seen as a passenger in the DataFrame, which is passed on in, say, join operations, but does not take part in optimized calculations.

Using Objects is generally discouraged when the data can be represented by another data type, but there can be use cases for it.

Nested Data Types

Polars has three nested data types: Array, List, and Struct. These data types enable Polars to manage complex data structures efficiently within a DataFrame. The Array type represents fixed-size collections where each element holds the same data type, commonly used for compact storage and predictable indexing. The List type is more flexible, allowing variable-length collections within each row. Lastly, the Struct type lets users store and access related fields as a single entity, encapsulating multiple named fields in a column.

An Array is a collection of elements that are of the same data type. Within a Series, each Array must have the same shape. The shape can of be any dimension. For example, to store the pixels of RGB images with a size of 640 by 480, you would use three dimensions. You can specify the inner data type and the shape of an Array data as follows:

coordinates = pl.DataFrame(
    [
        pl.Series("point_2d", [[1, 3], [2, 5]]),
        pl.Series("point_3d", [[1, 7, 3], [8, 1, 0]]),
    ],
    schema={
        "point_2d": pl.Array(shape=2, inner=pl.Int64),
        "point_3d": pl.Array(shape=3, inner=pl.Int64),
    },
)

coordinates

shape: (2, 2)
┌───────────────┬───────────────┐
│ point_2d      │ point_3d      │
│ ---           │ ---           │
│ array[i64, 2] │ array[i64, 3] │
╞═══════════════╪═══════════════╡
│ [1, 3]        │ [1, 7, 3]     │
│ [2, 5]        │ [8, 1, 0]     │
└───────────────┴───────────────┘

A List is comparable to an Array in that it is a collection of elements of the same data type. However in contrast to the Array, a List does not have to have the same length on every row. Note that it’s different from the Python list which can contain multiple data types. It is possible to store Python lists in the Series, by making the data type Object. The only argument List takes is what data type it contains.

Here’s how you can create a DataFrame with two List columns. Because we’re not specifying a schema, like we did in the previous example, the inner data types are inferred from the data:

weather_readings = pl.DataFrame(
    {
        "temperature": [[72.5, 75.0, 77.3], [68.0, 70.2]],
        "wind_speed": [[15, 20], [10, 12, 14, 16]],
    }
)

weather_readings

shape: (2, 2)
┌────────────────────┬────────────────┐
│ temperature        │ wind_speed     │
│ ---                │ ---            │
│ list[f64]          │ list[i64]      │
╞════════════════════╪════════════════╡
│ [72.5, 75.0, 77.3] │ [15, 20]       │
│ [68.0, 70.2]       │ [10, 12, … 16] │
└────────────────────┴────────────────┘

Lastly, there’s the Struct data type. A Struct is often used to work multiple Series at once. Here’s an example that shows how Structs can be created using Python dictionaries:

rating_series = pl.Series(
    "ratings",
    [
        {"Movie": "Cars", "Theatre": "NE", "Avg_Rating": 4.5},
        {"Movie": "Toy Story", "Theatre": "ME", "Avg_Rating": 4.9},
    ],
)
rating_series

shape: (2,)
Series: 'ratings' [struct[3]]
[
	{"Cars","NE",4.5}
	{"Toy Story","ME",4.9}
]

We discuss working with List, Array, and Struct data types in more detail in Chapter 12.

Missing Values

In Polars, missing data is always represented as null. This holds for all data types, including the numerical ones.³ Information about missing values is stored in metadata of the Series.

Additionally, whether a value is missing is stored in its validity bitmap, which is a bit that is set to 1 if the value is present and 0 if it is missing. This lets you cheaply check how many values are missing in a Series, using methods like df.null_count() and Expr.is_null().

To demonstrate this, we’ll create a DataFrame with some missing values:

missing_df = pl.DataFrame(
    {
        "value": [None, 2, 3, 4, None, None, 7, 8, 9, None],
    },
)
missing_df

shape: (10, 1)
┌───────┐
│ value │
│ ---   │
│ i64   │
╞═══════╡
│ null  │
│ 2     │
│ 3     │
│ 4     │
│ null  │
│ null  │
│ 7     │
│ 8     │
│ 9     │
│ null  │
└───────┘

You can fill in missing data using the Expr.fill_null() method, which you can call in four ways:

Using a single value
Using a fill strategy
Using an expression
Using an interpolation

Not A Number But Not Missing Either

NaN (meaning “not a number”) values are not considered missing data in Polars. These values are used for the Float data types to represent the result of an operation that is not a number.

Consequently, NaN values are not counted as null values in methods like df.null_count() or Expr.fill_null(). As an alternative, use Expr.is_nan() and Expr.fill_nan() to work with these values.

The following example shows how you can fill with a single value:

missing_df.with_columns(filled_with_single=pl.col("value").fill_null(-1))

shape: (10, 2)
┌───────┬────────────────────┐
│ value │ filled_with_single │
│ ---   │ ---                │
│ i64   │ i64                │
╞═══════╪════════════════════╡
│ null  │ -1                 │
│ 2     │ 2                  │
│ 3     │ 3                  │
│ 4     │ 4                  │
│ null  │ -1                 │
│ null  │ -1                 │
│ 7     │ 7                  │
│ 8     │ 8                  │
│ 9     │ 9                  │
│ null  │ -1                 │
└───────┴────────────────────┘

The second way is to use a fill strategy. A fill strategy allows you to pick an imputation strategy out of the following list:

forward: Fill with the previous non-null value.
backward: Fill with the next non-null value.
min: Fill with the minimum value of the Series.
max: Fill with the maximum value of the Series.
mean: Fill with the mean of the Series. Note that this mean is cast to the data type of the Series, which in the case of an int means the part behind the decimal mark is cut off.
zero: Fill with 0.
one: Fill with 1.

In the example below you’ll see all of these strategies next to each other:

missing_df.with_columns(
    forward=pl.col("value").fill_null(strategy="forward"),
    backward=pl.col("value").fill_null(strategy="backward"),
    min=pl.col("value").fill_null(strategy="min"),
    max=pl.col("value").fill_null(strategy="max"),
    mean=pl.col("value").fill_null(strategy="mean"),
    zero=pl.col("value").fill_null(strategy="zero"),
    one=pl.col("value").fill_null(strategy="one"),
)

shape: (10, 8)
┌───────┬─────────┬──────────┬─────┬─────┬──────┬──────┬─────┐
│ value │ forward │ backward │ min │ max │ mean │ zero │ one │
│ ---   │ ---     │ ---      │ --- │ --- │ ---  │ ---  │ --- │
│ i64   │ i64     │ i64      │ i64 │ i64 │ i64  │ i64  │ i64 │
╞═══════╪═════════╪══════════╪═════╪═════╪══════╪══════╪═════╡
│ null  │ null    │ 2        │ 2   │ 9   │ 5    │ 0    │ 1   │
│ 2     │ 2       │ 2        │ 2   │ 2   │ 2    │ 2    │ 2   │
│ 3     │ 3       │ 3        │ 3   │ 3   │ 3    │ 3    │ 3   │
│ 4     │ 4       │ 4        │ 4   │ 4   │ 4    │ 4    │ 4   │
│ null  │ 4       │ 7        │ 2   │ 9   │ 5    │ 0    │ 1   │
│ null  │ 4       │ 7        │ 2   │ 9   │ 5    │ 0    │ 1   │
│ 7     │ 7       │ 7        │ 7   │ 7   │ 7    │ 7    │ 7   │
│ 8     │ 8       │ 8        │ 8   │ 8   │ 8    │ 8    │ 8   │
│ 9     │ 9       │ 9        │ 9   │ 9   │ 9    │ 9    │ 9   │
│ null  │ 9       │ null     │ 2   │ 9   │ 5    │ 0    │ 1   │
└───────┴─────────┴──────────┴─────┴─────┴──────┴──────┴─────┘

The third way of filling null values is with an expression like pl.col("value").mean(). Expressions won’t be fully explained until Chapter 7, but we wanted to at least show an example of how this would work:

missing_df.with_columns(
    expression_mean=pl.col("value").fill_null(pl.col("value").mean())
)

shape: (10, 2)
┌───────┬─────────────────┐
│ value │ expression_mean │
│ ---   │ ---             │
│ i64   │ f64             │
╞═══════╪═════════════════╡
│ null  │ 5.5             │
│ 2     │ 2.0             │
│ 3     │ 3.0             │
│ 4     │ 4.0             │
│ null  │ 5.5             │
│ null  │ 5.5             │
│ 7     │ 7.0             │
│ 8     │ 8.0             │
│ 9     │ 9.0             │
│ null  │ 5.5             │
└───────┴─────────────────┘

We showcase more ways of filling null values using expressions in Chapter 8. The fourth and final way of filling nulls is with an interpolation method like df.interpolate():

missing_df.interpolate()

shape: (10, 1)
┌───────┐
│ value │
│ ---   │
│ f64   │
╞═══════╡
│ null  │
│ 2.0   │
│ 3.0   │
│ 4.0   │
│ 5.0   │
│ 6.0   │
│ 7.0   │
│ 8.0   │
│ 9.0   │
│ null  │
└───────┘

Floats, How Do They Work?

A Float data type is a floating point number format. The single point precision, or 32-bit floating point number, is called Float32. This is more common than the 16-bit half-precision format and provides a good balance between range and precision.

These 32 bits contain the following information:

The 1st bit represents the sign bit (0 for positive, 1 for negative)
The 2-9th bits represent the exponent by which the fraction is multiplied
The 10-32nd bit represent the fraction with an implicit leading 1 before the binary representation.

The formula for calculating the value of a Float32 is given by:

x = {(- 1)}^{s i g n} * (1 + f r a c t i o n) * 2^{(e x p o n e n t - b i a s)}

The bias for Float32 is a constant value of 127. This means that the actual exponent value in decimal form is obtained by subtracting this bias from the exponent’s binary representation. The reason a float uses a bias is that the negative exponents generated from the subtraction of the bias ensures it can represent both very large and very tiny numbers efficiently.

As an example consider the following float in bits:

0 10000010 10100000000000000000000

0 - means the float is positive.
10000010 - The exponent in binary, which is 130 in decimal.
10100000000000000000000 - This is the fraction part in binary. It’s calculated by adding an implicit leading 1 (for normalized numbers) to the binary digits, interpreted as follows: 1 (the implicit leading 1) plus $1 asterisk 2 Superscript negative 1$ (the first digit, representing 0.5) plus $0 asterisk 2 Superscript negative 2$ (the second digit, ignored since it’s 0) plus $1 asterisk 2 Superscript negative 3$ (the third digit, representing 0.125). Subsequent digits are zeros and do not contribute to the value. Therefore, the fraction equals $1 + 0.5 + 0.125 = 1.625$

Plugging these values into the formula gives:

$x = {(- 1)}^{0} * (1 + 0.5 + 0.125) * 2^{(130 - 127)}$
$x equals 1 asterisk 1.625 asterisk 8$
$x equals 13$

Data Type Conversion

There are situations where you need to change the data type of a column or Series. For example, you just read a CSV file, and there’s column which is incorrectly inferred as a String, and should be numeric.

For this, you can use either the Expr.cast() or the df.cast() methods.

The Expr.cast() method changes the data type of one column (technically, an expression) to the one provided as an argument. Here’s an example that demonstrates show why having the right data type matters.

string_df = pl.DataFrame({"id": ["10000", "20000", "30000"]})
print(string_df)
print(f"Estimated size: {string_df.estimated_size('b')} bytes")

shape: (3, 1)
┌───────┐
│ id    │
│ ---   │
│ str   │
╞═══════╡
│ 10000 │
│ 20000 │
│ 30000 │
└───────┘
Estimated size: 15 bytes

However, you know that this column only contains numeric data types, which can be stored more efficiently. Changing the data type would look like this:

int_df = string_df.select(pl.col("id").cast(pl.UInt16))
print(int_df)
print(f"Estimated size: {int_df.estimated_size('b')} bytes")

shape: (3, 1)
┌───────┐
│ id    │
│ ---   │
│ u16   │
╞═══════╡
│ 10000 │
│ 20000 │
│ 30000 │
└───────┘
Estimated size: 6 bytes

We just reduced the used memory by more than 60%. Using the optimal data types can provide a lot of performance advantages.

Table Table 4-1 shows the ranges of each data type, where applicable. Memory usage can be optimized by casting to the smallest size of a data type that still fits the data.

In the example above you used the Expr.cast() method for expressions. You can also use the df.cast() method on a DataFrame. In that case, you can cast multiple Series at once, by specifying either a single data type or a dictionary of column-type pairs. The keys can be column names, or column selectors. Here are the ways to use the df.cast() method, starting with casting everything to one data type:

data_types_df = pl.DataFrame(
    {
        "id": [10000, 20000, 30000],
        "value": [1.0, 2.0, 3.0],
        "value2": ["1", "2", "3"],
    }
)

data_types_df.cast(pl.UInt16)

shape: (3, 3)
┌───────┬───────┬────────┐
│ id    │ value │ value2 │
│ ---   │ ---   │ ---    │
│ u16   │ u16   │ u16    │
╞═══════╪═══════╪════════╡
│ 10000 │ 1     │ 1      │
│ 20000 │ 2     │ 2      │
│ 30000 │ 3     │ 3      │
└───────┴───────┴────────┘

Or with a dictionary, to cast certain Series differently:

data_types_df.cast({"id": pl.UInt16, "value": pl.Float32, "value2": pl.UInt8})

shape: (3, 3)
┌───────┬───────┬────────┐
│ id    │ value │ value2 │
│ ---   │ ---   │ ---    │
│ u16   │ f32   │ u8     │
╞═══════╪═══════╪════════╡
│ 10000 │ 1.0   │ 1      │
│ 20000 │ 2.0   │ 2      │
│ 30000 │ 3.0   │ 3      │
└───────┴───────┴────────┘

You can also cast specific data types to others as follow. Let’s cast all Float64 values to Float32, and all String values to UInt8:

data_types_df.cast({pl.Float64: pl.Float32, pl.String: pl.UInt8})

shape: (3, 3)
┌───────┬───────┬────────┐
│ id    │ value │ value2 │
│ ---   │ ---   │ ---    │
│ i64   │ f32   │ u8     │
╞═══════╪═══════╪════════╡
│ 10000 │ 1.0   │ 1      │
│ 20000 │ 2.0   │ 2      │
│ 30000 │ 3.0   │ 3      │
└───────┴───────┴────────┘

Lastly, you can use column selectors:

import polars.selectors as cs

data_types_df.cast({cs.numeric(): pl.UInt16})

shape: (3, 3)
┌───────┬───────┬────────┐
│ id    │ value │ value2 │
│ ---   │ ---   │ ---    │
│ u16   │ u16   │ str    │
╞═══════╪═══════╪════════╡
│ 10000 │ 1     │ 1      │
│ 20000 │ 2     │ 2      │
│ 30000 │ 3     │ 3      │
└───────┴───────┴────────┘

We’ll explore the column selectors in more detail in Chapter 10.

Basic casting doesn’t always magically work. In some cases special methods need to be used because data cannot be parsed without extra knowledge. One of the examples is when parsing a Datetime from a String. In Chapter 12 you’ll read about methods that allow for this more advanced casting.

Takeaways

In this chapter you learned about:

The structures Polars provides for working with data: Series, DataFrame, and LazyFrames.
The different data types Polars offers for data storage.
Some data types offer their own special operations, such a textual, nested, and temporal data types. We’ll dive deeper into these specifics in Chapter 12.
The way missing data is handled in Polars.
Changing data types using the Expr.cast() and df.cast() methods.

This knowledge can be used to fill our DataFrames. In the next chapter you’ll dive into the different APIs Polars offers to work on this data.

¹ A LazyFrame can hold data when you turn a DataFrame lazy using df.lazy(). In this case the source DataFrame is stored in the LazyFrame itself. Besides that, a LazyFrame can contain metadata about the source data to allow the Optimizer to perform its magic.

² Polars sometimes deviates from the Arrow specification. For instance, Polars has implemented their own String data type to for additional performance gains. Arrow also doesn’t have the Object and Unknown data types. See https://arrow.apache.org/docs/python/api/datatypes.html.

³ Except the Null data type itself, which cannot be missing.

Get Python Polars: The Definitive Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Python Polars: The Definitive Guide by Jeroen Janssens, Thijs Nieuwdorp

Chapter 4. Data Structures and Data Types

Series, DataFrames, and LazyFrames

Figure 4-1. LazyFrame query graph