Chapter 4. Object-Oriented Programming and Functional Programming
In this chapter, I want to introduce you to two styles of programming that you’ll likely encounter in your data science career: object-oriented programming (OOP) and functional programming (FP). It’s extremely helpful to have an awareness of both. Even if you don’t ever write code in either of these styles, you’ll encounter packages that use one or other of them extensively. These include standard Python data science packages such as pandas and Matplotlib. I’d like to equip you with an understanding of OOP and FP so that you can use the code you encounter more effectively.
OOP and FP are programming paradigms based on underlying computer science principles. Some programming languages support only one of them or strongly favor one over the other. For example, Java is an object-oriented language. Python supports both. OOP is more popular as an overall style in Python, but you’ll also see the occasional use of FP.
These styles also give you a framework for ways to break down your code. When you’re writing code, you could just write everything you want to do as one single long script. This would still run just fine, but it’s hard to maintain and debug. As discussed in Chapter 1, it’s important to break code down into smaller chunks, and both OOP and FP can suggest good ways to do this.
In my code, I don’t stick strictly to the principles of either functional or object-oriented programming. I sometimes define my own classes following OOP principles, and occasionally I write functions that conform to FP principles. Most modern Python programs occupy a middle ground combining both paradigms. In this chapter, I’ll give you an overview of both styles so that you gain an understanding of the basics of both.
Object-Oriented Programming
Object-oriented programming is very common in Python. But what is an “object” in this context? You can think of an object as a “thing” that can be described by a noun. In data science code some common objects could be a pandas DataFrame, a NumPy array, a Matplotlib figure, or a scikit-learn estimator.
An object can hold data, it has some actions associated with it, and it can interact with other objects. For example, a pandas DataFrame object contains a list of column names. One action associated with a DataFrame object is renaming the columns. The DataFrame can interact with a pandas Series object by adding that series as a new column.
You can also think about an object as a custom data structure. You design it to hold the data you want so that you can do something with it later. Taking a pandas DataFrame as an example again, the designers of pandas came up with a structure that could hold data in a tabular format. You can then access the data in rows and columns and operate on the data in those forms.
In the next section, I’ll introduce the main terminology in OOP and show some examples of how you may already be using it.
Classes, Methods, and Attributes
Classes, methods, and attributes are important terms that you’ll encounter in OOP. Here’s an overview of each:
-
A class defines an object, and you can think of it as a blueprint for making more objects of that variety. An individual object is an instance of that class, and each object is an individual “thing.”
-
Methods are something that you can do to objects of that class. They define the behavior of that object and may modify its attributes.
-
Attributes are variables that are some property of that class, and each object can have different data stored in those attributes.
That’s all very abstract, so I’ll give you a more concrete example. Here’s a way that object-oriented terminology could be adapted to the real world. The book you’re currently reading, Software Engineering for Data Scientists, is an object of the class “Book.” One of the attributes of this object is its number of pages and another is the name of the author. A method you could call on this object is to “read” it. There are many instances of the “Book” class, but they all have a certain number of pages, and they can all be read.
In Python, a class is usually named using CamelCase
, so you would name a class MyClass
rather than my_class
. This convention helps you identify classes more easily. You can look up an attribute using the format class_instance.attribute
. You can call a method using class_instance.method()
(note that this includes parentheses). Methods may take arguments, but attributes cannot.
For example, let’s consider a pandas DataFrame. You’re likely to be familiar with the syntax for creating a new DataFrame:
import
pandas
as
pd
my_dict
=
{
"column_1"
:
[
1
,
2
],
"column_2"
:
[
"a"
,
"b"
]}
df
=
pd
.
DataFrame
(
data
=
my_dict
)
Looking at this from an object-oriented perspective, when you run the line df = pd.DataFrame(data=my_dict)
you’ve initialized a new object of type DataFrame, and you’ve passed in some data that will be used to set up the attributes of that
DataFrame.
You can look up some of the attributes of that DataFrame, like so:
df
.
columns
df
.
shape
.columns
and .shape
are attributes of the df
object.
And you can call many methods on that DataFrame object, for example:
df
.
to_csv
(
"file_path"
,
index
=
False
)
.to_csv()
is the method in this example.
Another familiar example of creating a new object and calling a method comes from scikit-learn. If you’re training a machine learning model on two arrays, with X_train
containing the training features and y_train
containing the training labels, you’d write some code like this:
from
sklearn.linear_model
import
LogisticRegression
clf
=
LogisticRegression
()
clf
.
fit
(
X_train
,
y_train
)
In this example, you’re initializing a new LogisticRegression
classifier object and calling the .fit()
method on it.
Here’s another example. This is the code that creates Figure 2-3 in Chapter 2. Two objects are created here, a Matplotlib figure object and a Matplotlib axes object. Several methods are then called to do various operations to those objects, as I’ll explain in the code annotations:
import
matplotlib
.
pyplot
as
plt
import
numpy
as
np
n
=
np
.
linspace
(
1
,
10
,
1000
)
line_names
=
[
"
Constant
"
,
"
Linear
"
,
"
Quadratic
"
,
"
Exponential
"
,
"
Logarithmic
"
,
"
n log n
"
,
]
big_o
=
[
np
.
ones
(
n
.
shape
)
,
n
,
n
*
*
2
,
2
*
*
n
,
np
.
log
(
n
)
,
n
*
(
np
.
log
(
n
)
)
]
fig
,
ax
=
plt
.
subplots
(
)
fig
.
set_facecolor
(
"
white
"
)
ax
.
set_ylim
(
0
,
50
)
for
i
in
range
(
len
(
big_o
)
)
:
ax
.
plot
(
n
,
big_o
[
i
]
,
label
=
line_names
[
i
]
)
ax
.
set_ylabel
(
"
Relative Runtime
"
)
ax
.
set_xlabel
(
"
Input Size
"
)
ax
.
legend
(
)
fig
.
savefig
(
save_path
,
bbox_inches
=
"
tight
"
)
Initialize
figure
andaxes
objects.Call the
set_facecolor
method on thefig
object with an argumentwhite
.All the methods in the next few lines operate on the
ax
object.Saving the figure is a method called on the
fig
object.
The figure
and axes
objects have many methods that you can call to update these objects.
Note
Matplotlib sometimes feels confusing because it has two types of interface. One of these is object oriented, and the other is designed to imitate plotting in MATLAB. Matplotlib was first released in 2003, and its developers wanted to make it familiar to people who were accustomed to using MATLAB. These days, it’s much more common to use the object-oriented interface as I’ve shown in the previous code example. But because people’s code depends on both types of interface, they both still need to exist. The article “Why You Hate Matplotlib” has more details on this topic.
Even if the terminology surrounding OOP is unfamiliar, you’ll already be using it frequently in a lot of common data science packages. The next step is to define your own classes so that you can use an object-oriented approach in your own code.
Defining Your Own Classes
If you want to write your own code in an object-oriented style, you’ll need to define your own classes. I’ll show you a couple of simple examples for how to do this. The first one repeats some text a set number of times. The second one uses the UN Sustainable Development Goals data that I’ve used in other examples throughout this book. You can find more details about this data in “Data in This Book”.
In Python, you define a new class with the class
statement:
class
RepeatText
():
It’s very common to store some attributes every time a new instance of an object is initialized. To do this, Python uses a special method called __init__
, which is defined like this:
def
__init__
(
self
,
n_repeats
):
self
.
n_repeats
=
n_repeats
The first argument in the __init__
method refers to the new instance of the object that gets created. By convention, this is usually named self
. In this example, the __init__
method takes one other argument: n_repeats
. The line self.n_repeats = n_repeats
means that each new instance of a RepeatText
object has an n_repeats
attribute, which must be provided each time a new object is initialized.
You can create a new RepeatText
object like this:
repeat_twice
=
RepeatText
(
n_repeats
=
2
)
Then you can access the n_repeats
attribute with the following syntax:
>>>
(
repeat_twice
.
n_repeats
)
...
2
Defining another method looks similar to defining the __init__
method, but you can give it any name you like, as if it were a normal function. As you’ll see below, you still need the self
argument if you want each instance of your object to have this
behavior:
def
multiply_text
(
self
,
some_text
):
((
some_text
+
" "
)
*
self
.
n_repeats
)
This method will look up the n_repeats
attribute of the instance of the class that it acts on. This means you need to create an instance of a RepeatText
object before you can use the method.
Note
There are special methods in Python that don’t take the self
parameter as an argument: classmethods and staticmethods. Classmethods apply to a whole class, not just an instance of a class, and staticmethods can be called without creating an instance of the class. You can learn more about these in Introducing Python by Bill Lubanovic (O’Reilly, 2019).
You can call your newly created method like this:
>>>
repeat_twice
.
multiply_text
(
"hello"
)
...
'hello hello'
Here’s the complete definition of the new class:
class
RepeatText
():
def
__init__
(
self
,
n_repeats
):
self
.
n_repeats
=
n_repeats
def
multiply_text
(
self
,
some_text
):
((
some_text
+
" "
)
*
self
.
n_repeats
)
Let’s look at another example. This time let’s use the UN Sustainable Development Goal data introduced in Chapter 1. In the example below, I’m creating a Goal5Data
object to hold some data relevant to Goal 5, “Achieve gender equality and empower all women and girls.” This particular object will hold data for one of the targets associated with this goal, Target 5.5: “Ensure women’s full and effective participation and equal opportunities for leadership at all levels of decision-making in political, economic and public life.”
I want to be able to create an object to store the data for each country so that I can easily manipulate it in the same way. Here’s the code to create the new class and hold the data:
class
Goal5Data
(
)
:
def
__init__
(
self
,
name
,
population
,
women_in_parliament
)
:
self
.
name
=
name
self
.
population
=
population
self
.
women_in_parliament
=
women_in_parliament
This attribute holds a list of the percentage of seats in the country’s governing body held by women, by year.
Here’s a method that prints a summary of this data:
def
print_summary
(
self
):
null_women_in_parliament
=
len
(
self
.
women_in_parliament
)
-
np
.
count_nonzero
(
self
.
women_in_parliament
)
(
f
"There are
{
len
(
self
.
women_in_parliament
)
}
data points for
Indicator
5.5.1
,
'Proportion of seats held by women in national
parliaments
'.")
(
f
"
{
null_women_in_parliament
}
are nulls."
)
In the same way as the previous example, you can create a new instance of this class, like so:
usa
=
CountryData
(
name
=
"USA"
,
population
=
336262544
,
women_in_parliament
=
[
13.33
,
14.02
,
14.02
,
...
])
Calling the print_summary
method gives the following result:
>>>
usa
.
print_summary
()
...
"There are 24 data points for Indicator 5.5.1,
'Proportion of seats held by women in national parliaments'
.
0
are
nulls
.
"
Writing this as a method ensures the code is modular, well organized, and easy to reuse. It’s also very clear what it is doing, which will help anyone who wants to use your code.
I’ll use this class in the next section to show you another principle of classes: inheritance.
OOP Principles
You’ll often encounter these terms in OOP: encapsulation, abstraction, inheritance, and polymorphism. I’ll define all of these in this section and show some examples of how inheritance can be useful to you.
Inheritance
Inheritance means that you can extend a class by creating another class that builds on it. This helps reduce repetition, because if you need a new class that’s closely related to one you have already written, you don’t need to duplicate that class to make a minor change.
You may not need to use inheritance when defining your own classes, but you might need to use it with classes from an external library. You’ll see a couple of examples of inheritance for data validation later in the book, in “Data Validation with Pydantic” and in “Adding Functionality to Your API”. In this section, I want to help you spot and understand inheritance when you encounter it.
You can spot a class that uses inheritance because it will have the following syntax:
class
NewClass
(
OriginalClass
):
...
The NewClass
class can use all the attributes and methods of the OriginalClass
, but you can override any of these that you want to change. The term “parent” is often used to refer to the original class, and the new class that inherits from it is often called the “child” class.
Here’s an example of a new class, Goal5TimeSeries
that inherits from the Goal5Data
class in the previous section, turning it into a class that can work with time series data:
class
Goal5TimeSeries
(
Goal5Data
):
def
__init__
(
self
,
name
,
population
,
women_in_parliament
,
timestamps
):
super
()
.
__init__
(
name
,
population
,
women_in_parliament
)
self
.
timestamps
=
timestamps
The __init__
method looks a little different this time. Using super()
means that the parent class’s __init__
method gets called, and this initializes the name
, population
, and women_in_parliament
attributes.
You can create a new Goal5TimeSeries
object, like so:
india
=
Goal5TimeSeries
(
name
=
"India"
,
population
=
1417242151
,
women_in_parliament
=
[
9.02
,
9.01
,
8.84
,
...
],
timestamps
=
[
2000
,
2001
,
2002
,
...
])
And you can still access the method from the Goal5Data
class:
>>>
india
.
print_summary
()
...
"There are 24 data points for Indicator 5.5.1,
'Proportion of seats held by women in national parliaments'
.
0
are
nulls
.
"
You also can add a new method that’s relevant to the child class. For example, this new fit_trendline()
method fits a regression line to the data to find its trend:
from
scipy
.
stats
import
linregress
class
Goal5TimeSeries
(
Goal5Data
)
:
def
__init__
(
self
,
name
,
population
,
women_in_parliament
,
timestamps
)
:
super
(
)
.
__init__
(
name
,
population
,
women_in_parliament
)
self
.
timestamps
=
timestamps
def
fit_trendline
(
self
)
:
result
=
linregress
(
self
.
timestamps
,
self
.
women_in_parliament
)
slope
=
round
(
result
.
slope
,
3
)
r_squared
=
round
(
result
.
rvalue
*
*
2
,
3
)
return
slope
,
r_squared
Use the
linregress
function fromscipy
to fit a straight line through the data using linear regression.Calculate the coefficient of determination (R-squared) to determine the goodness of fit of the line.
Calling the new method returns the slope of the trendline and the normalized root mean squared error of the fit of the line to the data:
>>>
india
.
fit_trendline
()
...
(
0.292
,
0.869
)
If you’re using inheritance in your own classes, it lets you extend the capabilities of the classes you create. This means less duplication of code and it helps keep your code modular. It’s also very helpful to inherit from classes in an external library. Again, this means that you don’t duplicate their functionality but you can add extra features.
Encapsulation
Encapsulation means that your class hides its details from the outside. You can see only the interface to the class, not the internal details of what’s going on. The interface is made up of the methods and attributes that you design. It’s not so common in Python, but in other programming languages classes are often designed with hidden or private methods or attributes that can’t be changed from the outside.
However, the concept of encapsulation is still applied in Python, and many libraries and applications take advantage of it. pandas is a great example of this. pandas uses encapsulation by providing methods and attributes that let you interact with data while keeping the underlying implementation details hidden. A DataFrame object encapsulates data and provides various methods for accessing, filtering, and transforming it. As I mentioned in Chapter 3, pandas DataFrames use NumPy under the hood, but you don’t need to know this to use them. You can use the pandas DataFrame interface to achieve your tasks, but if you need to dive deeper you can still use NumPy methods as well.
Note
Interfaces are extremely important because other code or classes will often depend on the existence of some attribute or method, so if you change that interface some other code may break. It’s fine to change the internal workings of your class, for example, to change the calculations within some method to make it more efficient. But you should make the interface easy to use from the start and try not to change it. I’ll discuss interfaces in more detail in Chapter 8.
Abstraction
Abstraction is closely linked to encapsulation. It means that you should deal with a class at the appropriate level of detail. So you might choose to keep the details of some calculation within a method, or you might allow it to be accessed through the interface. Again, this is more common in other programming languages.
Polymorphism
Polymorphism means that you can have the same interface for different classes, which simplifies your code and reduces repetition. That is, two classes can have a method with the same name that produces a similar result, but the internal workings are different. The two classes can be a parent and child class, or they can be unrelated.
scikit-learn contains a great example of polymorphism. Every classifier has the same fit
method to train the classifier on some data, even though it’s defined as a different class. Here’s an example of training two different classifiers on some data:
from
sklearn.linear_model
import
LogisticRegression
from
sklearn.ensemble
import
RandomForestClassifier
lr_clf
=
LogisticRegression
()
lr_clf
.
fit
(
X_train
,
y_train
)
rf_clf
=
RandomForestClassifier
()
rf_clf
.
fit
(
X_train
,
y_train
)
Even though LogisticRegression
and RandomForestClassifier
are different classes, both of them have a .fit()
method that takes the training data and training labels as arguments. Sharing the name of the method makes it easy for you to change the classifier without changing much of your code.
This was a brief overview of the main features of object-oriented programming. It’s a huge topic, and I recommend Introducing Python by Bill Lubanovic (O’Reilly, 2019) if you would like to learn more.
Functional Programming
While Python supports the functional programming paradigm, it’s not common to write Python in a purely FP style. Many software engineers have the opinion that other languages are more suitable for FP, such as Scala. However, useful FP features available in Python are very much worth knowing about, which I’ll discuss below.
Functional programming, as the name suggests, is all about functions that don’t change. These functions shouldn’t change any data that exists outside the function or change any global variables. To use the correct terminology, the functions are immutable, “pure,” and free of side effects. They don’t affect anything that isn’t reflected in what the function returns. For example, if you have a function that adds an item to a list, that function should return a new copy of the list rather than modifying the existing list. In strict FP, a program consists only of evaluating functions. These may be nested (where one function is defined within another) or functions may be passed as arguments to other functions.
Some advantages of FP include:
-
It’s easy to test because a function always returns the same output for a given input. Nothing outside the function is modified.
-
It’s easy to parallelize because data is not modified.
-
It enforces writing modular code.
-
It can be more concise and efficient.
Common Python concepts in a functional style include lambda functions and the map
and filter
built-in functions. In addition, generators are often written in this style, and list comprehensions can also be thought of as a form of FP. Other libraries worth knowing about for FP include itertools
and more-itertools
. I’ll take a closer look at lambda functions and map()
in the next section.
Lambda Functions and map()
Lambda functions are small, anonymous Python functions that you can use for quick one-off tasks. They are termed “anonymous” because they aren’t defined like a normal Python function with a name.
A lambda function has the syntax:
lambda
arguments
:
expression
A lambda function can take as many arguments as you like, but it can have only one expression. Lambda functions are frequently used with built-in functions like map
and filter
. These take functions as arguments and then can apply the function to every element in an iterable (such as a list).
Here’s a simple example. Using the Goal 5 data from “Defining Your Own Classes”, you can convert a list of the percentages of women in government positions to a list of proportions from 0 to 1 using the following function:
usa_govt_percentages
=
[
13.33
,
14.02
,
14.02
,
14.25
,
...
]
usa_govt_proportions
=
list
(
map
(
lambda
x
:
x
/
100
,
usa_govt_percentages
))
There’s a lot going on in one line here. The lambda function in this case is lambda x: x/100
. In this function, x
is a temporary variable that isn’t used outside the function. map()
applies the lambda function to every element in the list. And finally, list()
creates a new list based on the map.
This gives the following result:
>>>
(
usa_govt_proportions
)
...
[
0.1333
,
0.1402
,
0.1402
,
0.1425
,
...
]
Note that the original data was not changed by applying this function. A new list was created with the altered data.
Applying Functions to DataFrames
In a similar way to the map()
built-in function above, you can also apply functions to DataFrames. This can be particularly useful if you want to create a new column based on an existing column. Again, you can use a function that takes another function as an input. In pandas, this is apply()
.
Here’s an example of applying a lambda function to a column in a DataFrame:
df
[
"USA_processed"
]
=
df
[
"United States of America"
]
.
apply
(
lambda
x
:
"Mostly male"
if
x
<
50
else
"Mostly female"
)
In this example, the column United States of America
is the data on women in government positions that I’ve been using throughout the chapter. The lambda function takes the percentage of women in government positions and returns "Mostly male"
if that figure is under 50%, or "Mostly female"
if it is 50% or greater.
You can also use df.apply()
with a named function defined elsewhere as well. Here’s the same function as before but as a named function:
def
binary_labels
(
women_in_govt
):
if
women_in_govt
<
50
:
return
"Mostly male"
else
:
return
"Mostly female"
You can call this function on every row in a column by passing the function name as an argument to the apply
function:
df
[
"USA_processed"
]
=
df
[
"United States of America"
]
.
apply
(
binary_labels
)
This is a better solution than writing a lambda function because you may want to reuse the function in the future, and you can also test and debug it separately. You can also include more complex functionality than in a lambda function.
Warning
The apply
function in pandas is slower than built-in vectorized functions because it iterates through every row in a DataFrame. So best practice is to use apply
only for something that’s not already implemented. Simple numeric operations like getting the maximum of a list or simple string options such as replacing one string with another are already available as faster vectorized functions, so you should use these built-in functions where possible.
Which Paradigm Should I Use?
To be honest, if you’re just writing a small script or working on a short project on your own, you don’t need to fully buy in to either of these paradigms. Just stick to modular scripts that work.
For larger projects, however, it’s a great idea to think about the type of problem you’re dealing with and whether one of these paradigms is a good fit. You might reach for OOP if you find yourself thinking about a set of things that need something done to them. You can turn your problem space into instances that need to have similar behavior but different attributes or data. An important point here is that you should have many instances of some class. It isn’t worth writing a new class if you have only one instance of it; that just adds extra complexity you don’t need.
If you find yourself wanting to do new things to some data that remains fixed, FP might be a good choice for you. It’s also worth looking at FP if you have a large amount of data and you want to parallelize the operations you do to it.
There’s no right or wrong here, though. You can go with your personal preference if you’re working alone, or go with what’s predominantly used in your team to keep things standardized. It’s good to recognize when these paradigms are being used, use them in other people’s code, and make decisions about what would work best for your specific problem.
Key Takeaways
OOP and FP are programming paradigms that you’ll encounter in the code you read. OOP is concerned with objects, which are custom data structures, and FP is concerned with functions that don’t change the underlying data.
In OOP, a class defines new objects, which can have attributes and methods. You can define your own classes to keep associated methods and data together, and this is a great approach to use when you have many instances of similar objects. You can use inheritance to avoid repeating code, and you can use polymorphism to keep your interfaces standardized.
In FP, ideally everything is within the function. This is useful when you have data that doesn’t change and you want to do lots of things to it, or you want to parallelize what you are doing to the data. Lambda functions are the most commonly used example of FP in Python.
Your choice of paradigm depends on the problem you are working on, but you’ll find it useful to have an awareness of both.
Get Software Engineering for Data Scientists now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.