Chapter 1. What Is Good Code?

This book aims to help you write better code. But first, what makes code “good”? There are a number of ways to think about this: the best code could be the code that runs fastest. Or it could be easiest to read. Another possible definition is that good code is easy to maintain. That is, if the project changes, it should be easy to go back to the code and change it to reflect the new requirements. The requirements for your code will change frequently because of updates to the business problem you’re solving, new research directions, or updates elsewhere in the codebase.

In addition, your code shouldn’t be complex, and it shouldn’t break if it gets an unexpected input. It should be easy to add a simple new feature to your code; if this is hard it suggests your code is not well written. In this chapter, I’ll introduce aspects of good code and show examples for each. I’ll divide these into five categories: simplicity, modularity, readability, performance, and robustness.

Why Good Code Matters

Good code is especially important when your data science code integrates with a larger system. This could be putting a machine learning model into production, writing packages for wider distribution, or building tools for other data scientists. It’s most useful for larger codebases that will be run repeatedly. As your project grows in size and complexity, the value of good code will increase.

Sometimes, the code you write will be a one-off, a prototype that needs to be hacked together today for a demo tomorrow. And if you truly will run the code only once, then don’t spend the time making it beautiful: just write code to do the job it’s needed for. But in my experience, even the code you write for a one-off demo is almost always run again or reused for another purpose. I encourage you to take the time to go back to your code after the urgency has passed and tidy it up for future use.

Good code is also easier to maintain. There’s a phenomenon known as “bit-rot”: the need to update code that hasn’t been used in some time. This happens because things your code depends on also change (for example, third-party libraries or even the operating system you’re using). If you come back to code you haven’t used for a while, you’ll probably need to do some work to modernize it. This is much easier if your code is well structured and well documented.

Note

Technical debt (often abbreviated as tech debt) is a commonly used term for deferred work resulting from when code is written quickly instead of correctly. Tech debt can take the form of missing documentation, poorly structured code, poorly named variables, or any other cut corners. These make the code harder to maintain or refactor, and it’s likely that you will spend more time in the future fixing bugs than you would have spent writing the code well in the first place. That said, tech debt is often necessary because of business deadlines and budgets. You don’t always have time to polish your code.

Adapting to Changing Requirements

Writing code is not like building a bridge, where the design is thoroughly worked out, the plans are fixed, and then construction happens. The one constant in writing code, for a data science project or anything else, is that you should expect things to change as you work on a project. These changes may be the result of your discoveries through your research process, changing business requirements, or innovations that you want to include in the project. Good code can be easily adapted to work well with these changes.

This adaptability becomes more important as your codebase grows. With a single small script, making changes is simple. But as the project grows and gets broken out into multiple scripts or notebooks that depend on each other, it can become more complex and harder to make changes. Good practices from the start will make it easier to modify the code in a larger project.

Data science is still a relatively new field, but data science teams are starting to encounter situations where they have been working on the same codebase for multiple years, and the code has been worked on by many people, some of whom may have left the company. In this situation, where a project is handed over from one person to another, code quality becomes even more important. It is much easier to pick up on someone else’s work if it is well documented and easy to read.

Software engineering as a discipline has been dealing with changing requirements and increasing complexity for decades. It has developed a number of useful strategies that you, as a data scientist, can borrow and take advantage of. If you start to look into this, you may see references to “clean code,” from the book of the same name by Robert C. Martin, or acronyms such as SOLID.

As I mentioned, in this chapter, I’ve chosen to divide these principles into five features of good code: simplicity, modularity, readability, performance, and robustness. I’ll describe each of these in detail in the rest of this chapter.

Simplicity

Simple is better than complex.

Tim Peters, The Zen of Python

If you are working on a small project, maybe a data visualization notebook or a short data wrangling script, you can keep all the details in your mind at one time. But as your project grows and gets more complex, this stops being feasible. You can keep the training steps of your machine learning model in your head but not the input data pipeline, or the model deployment process.

Complexity makes it hard to modify code when your requirements change, and it can be defined as follows:

Complexity is anything related to the structure of a system that makes it hard to understand and modify a system.

John K. Ousterhout, A Philosophy of Software Design

This isn’t a precise definition, but with experience you’ll get a sense of when a system becomes more complex. One way of thinking about it is that when you make a change, it breaks something unrelated in an unexpected way. For example, you might train a machine learning model on customer review data to extract the item that the customer bought using natural language processing (NLP) techniques. You have a separate preprocessing step that truncates the review to 512 characters. But when you deploy the model, you forget to add the preprocessing step to the inference code. Suddenly, your model throws errors because the input data is larger than 512 characters. This system is starting to get hard to reason about and is becoming complex.

When I’m talking about complexity in code, this is generally accidental and different from the essential complexity of a project. Your machine learning project may be complex because you want to try many different types of models and many different combinations of features to see which one works best. Your analysis may be complex because the data you’re using has many different interdependent parameters. Neither of these can be made simpler; the complexity is just part of the project. In contrast, accidental complexity is when you’re not sure which function within your code you need to change to achieve some action.

However, there are tools you can use to help decrease complexity in your code. Making everything a little bit simpler as you go along has huge benefits when the project becomes large. In the next section, I’ll describe how to keep your code simple by avoiding repetition. Then I’ll discuss how to keep your code concise. You can also keep your code from becoming complex by dividing it into reusable pieces, as I’ll describe in “Modularity”.

Don’t Repeat Yourself (DRY)

One of the most important principles in writing good code is that information should not be repeated. All knowledge should have one single representation in code. If information is repeated in multiple places, and that information needs updating because of changing requirements, then one change means many updates. You would need to remember all the places where this information needs to be updated. This is hard to do and increases code complexity. Additionally, duplication increases opportunities for bugs, and longer code requires more time to read and understand.

There’s also an increase in mental effort when you see two pieces of code that are very similar but not exact duplicates. It’s hard to tell if the two pieces of code are doing the same thing or something different.

Here’s a simple example: you want to open three CSV files, read them into a pandas DataFrame, do some processing, and return each DataFrame. The data in this example is from the UN Sustainable Development Goals (SDGs), and I’ll be using data from this site throughout the book. You can find more details on this data in “Data in This Book”. The code and the CSV files are available in the GitHub repository for this book.

For a first pass, you could do something like this:

import pandas as pd

df = pd.read_csv("sdg_literacy_rate.csv")
df = df.drop(["Series Name", "Series Code", "Country Code"], axis=1)
df = df.set_index("Country Name").transpose()

df2 = pd.read_csv("sdg_electricity_data.csv")
df2 = df2.drop(["Series Name", "Series Code", "Country Code"], axis=1)
df2 = df2.set_index("Country Name").transpose()

df3 = pd.read_csv("sdg_urban_population.csv")
df3 = df3.drop(["Series Name", "Series Code", "Country Code"], axis=1)
df3 = df3.set_index("Country Name").transpose()

But this is unnecessarily long-winded and repetitive. A better way to achieve the same result would be to put the repeated code inside a for loop. If this is something you’ll be using repeatedly, you can put the code inside a function, like this:

def process_sdg_data(csv_file, columns_to_drop):
    df = pd.read_csv(csv_file)
    df = df.drop(columns_to_drop, axis=1)
    df = df.set_index("Country Name").transpose()
    return df

Other, more subtle cases can give rise to code duplication. Here are a few examples:

You might find yourself using very similar code in multiple projects without realizing it, for example, data processing code. Breaking out the processing code so that it can take slightly varying data rather than rigidly accepting only one exact type of data, could help you avoid this duplication.
Multiple people working on similar projects might write similar code, particularly if they don’t communicate about what they’re working on. Making code easy for other people to use and providing good documentation will help to reduce this type of duplication.
Comments and documentation can also be a form of duplication. The same knowledge is represented in the code and the documentation that describes it. Don’t write comments that describe exactly what the code is doing; instead, use them to add knowledge. I’ll describe this in more detail in Chapter 9.

The DRY principle is extremely important to consider when writing good code. It sounds trivial, but avoiding repetition means that your code needs to be modular and readable. I’ll discuss these concepts later in this chapter.

Avoid Verbose Code

Sometimes, you can make your code simpler by having fewer lines of code. This means fewer opportunities for bugs and less code for someone else to read and understand. However, there’s often a trade-off between making your code shorter and making it less readable. I’ll talk about how to ensure your code is readable in “Readability”.

I recommend that you aim to make your code concise but still readable. To do this, avoid doing things that make your code unnecessarily long-winded, such as writing your own functions instead of using built-in functions or using unnecessary temporary variables. You should also avoid repetition, as described in the previous section.

Here’s an example of an unnecessary temporary variable:

i = float(i)
image_vector.append(i/255.0)

This can be simplified to the following:

image_vector.append(float(i)/255)

Of course, there are downsides to squeezing your code into fewer lines. If a lot is happening on one line, it can be extremely hard for anyone else to understand what is going on. This means it is harder for someone else to work on your code, and this could lead to more bugs. If in doubt, I recommend that you keep your code readable even if it means you use a few extra lines.

Modularity

Writing modular code is the art of breaking a big system into smaller components. Modular code has several important advantages: it makes the code easier to read, it’s easier to locate where a problem comes from, and it’s easier to reuse code in your next project. It’s also easier to test code that is broken into smaller components, which I discuss in Chapter 7.

But how do you tackle a large task? You could just write one big script to do the whole thing, and this might be fine at the start of a small project. But larger projects need to be broken into smaller pieces. To do this, you’ll need to think as far ahead into the future of the project as possible and try to anticipate what the overall system will do and what might be sensible places to divide it up. I’ll discuss this planning process in much more detail in Chapter 8.

Writing modular code is an ongoing process, and it’s not something you’ll get completely correct from the beginning, even if you have the best intentions. You should expect to change your code as your project evolves. I’ll cover techniques that will help you improve your code in “Refactoring”.

You might break a large data science project into a series of steps by thinking about it as a flowchart, as shown in Figure 1-1. First you extract some data, then explore it, then clean it, and then visualize it.

At first, this could be a series of Jupyter notebooks. At the end of each one, you could save the data to a file, then load it again into the next notebook. As your project matures, you might find that you want to run a similar analysis repeatedly. Then, you can decide what the skeleton of the system should be: maybe there’s one function that extracts the data, then passes it to the function that cleans the data. The example below uses the pass statement to create an empty function. This ensures that when you call this function, it won’t cause an error before it is written.

For example, this could be the skeleton of a system that loads some data, cleans it by cropping it to some maximum length, and plots it with some plotting parameters:

def load_data(csv_file):
    pass

def clean_data(input_data, max_length):
    pass

def plot_data(clean_data, x_axis_limit, line_width):
    pass

By creating this framework, you have broken down the system into individual components, and you know what each of those components should accept as an input. You can do the same thing at the level of a Python file. Using a programming paradigm such as object-oriented programming or functional programming can help you figure out how to break your code down into functions and classes (more on this in Chapter 4). However you divide up your system, each of the components should be as independent and self-contained as possible so that changing one component doesn’t change another. I’ll discuss modular code in more detail in Chapter 8.

Readability

…code is read much more often than it is written…

PEP8

When you write code, it’s important that other people are also able to use it. You might move on to a different project, or even a different job. If you leave a project for a while and come back to it in a month, six months, or even six years, can you still understand what you were doing at the time you wrote it? You wrote that code for a reason, for a task that was important, and making your code readable gives it longevity.

Methods to make your code more readable include adhering to standards and conventions for your programming language, choosing good names, removing unused code, and writing documentation for your code. It’s tempting to treat these as an afterthought and concentrate more on the functionality of the code, but if you pay attention to making your code readable at the time of writing it, you will write code that is less complex and easier to maintain. I’ll introduce these methods in this section, and I’ll cover them in much more detail in Chapters 6 and 9.

Standards and Conventions

Coding standards and formatting may seem like the least exciting topics I’ll cover in this book, but they are surprisingly important. There are many ways to express the same code, even down to small details such as the spacing around the + sign when adding two integers. Coding standards have been developed to encourage consistency across everyone writing Python code, and the aim is to make code feel familiar even when someone else has written it. This helps reduce the amount of effort it takes to read and edit code that you haven’t written yourself. I’ll cover this topic in more detail in Chapter 6.

Python is inherently very readable compared to many programming languages; sticking to a coding standard will make it even easier to read. The main coding standard for Python is PEP8 (Python Enhancement Proposal 8), established in 2001. The example below shows an extract from PEP8, and you can see that there are conventions for even the smallest details of your code. Style guides such as Google’s Python Style Guide complement PEP8 with additional guidance and information.

Here’s an example of the details that PEP8 specifies, showing the correct and incorrect ways of formatting spaces within brackets:

# Correct:
spam(ham[1], {eggs: 2})

# Wrong:
spam( ham[ 1 ], { eggs: 2 } )

Fortunately, there are many automated ways to check that your code conforms with coding standards, which saves you from the boring work of going through and checking that every + sign has one single space around it. Linters such as Flake8 and Pylint highlight places where your code doesn’t conform with PEP8. Automatic formatters such as Black will update your code automatically to conform with coding standards. I’ll cover how to use these in Chapter 6.

Names

When writing code for data science, you’ll need to choose names at many points: names of functions, variables, projects and even whole tools. Your choice of names affects how easy it is to work on your code. If you choose names that are not descriptive or precise, you’ll need to keep their true meaning in your head, which will increase your code’s cognitive load. For example, you could import the pandas library as p and name your variables x and f, as shown here:

import pandas as p

x = p.read_csv(f, index_col=0)

This code runs correctly, with no errors. But here’s an example code that’s easier to read because the variable names are more informative and follow standard conventions:

import pandas as pd

df = pd.read_csv(input_file, index_col=0)

There’s more detail about writing good names in Chapter 9.

Cleaning up

Another way to make your code more readable is to clean it up after you have finished creating a function. Once you’ve tested it and you are confident it is working, you should remove code that has been commented out and remove unnecessary calls to the print() function that you may have used as a simple form of debugging. It’s very confusing to see commented out sections in someone else’s code.

When you see untidy sections of code, it sends a message that poor code quality is acceptable in a project. This means there’s less incentive for other contributors to write good code. The untidy code may also be copied and adapted in other parts of the project. This is known as the Broken Window Theory. Setting high standards in a project encourages everyone working on it to write good code.

If you want to improve your code quality, you may decide to refactor it. Refactoring means changing the code without changing its overall behavior. You may have thought of ways that your code could be more efficient, or you have thought of a better way to structure it that would let your teammate use pieces of the code in another project. Tests are essential in this process, because they will check that your new code still has the same overall behavior. I’ll cover refactoring in “Refactoring”.

Documentation

Documentation also helps other people read your code. Code can be documented at multiple levels of detail, starting with simple inline comments, moving up to docstrings that explain a whole function, going on to the README page displayed in a GitHub repository and even tutorials to teach users how to use a package. All of these aspects of documentation help explain to other people how to use your code. They might even explain your code to you in the future (a very important audience!). If you want other people to use your code, you should make it easy for them by writing good documentation.

Writing great documentation is one thing, but you also need to maintain and keep it updated. Documentation that refers to an outdated version of the code is worse than no documentation at all. It will cause confusion that will take extra time to resolve. I’ll discuss all forms of documentation in much more detail in Chapter 9.

Performance

Good code needs to be performant. This is measured in both the running time of the code and in the memory usage. When you’re making decisions about how to write your code, it’s useful to know what data structures and algorithms are more efficient. It’s really good to know when you are doing things that will slow your code significantly, especially when there’s a readily available alternative. You should also be aware of which parts of your code are taking a long time.

Performance is particularly important when you are writing production code that is going to be called every time a user takes a particular action. If your user base grows, or your project is successful, your code could be called millions of times every day. In this case, even small improvements in your code can save many hours for your users. You don’t want your code to be the slow point in a large application. I’ll explain how to measure the performance of your code in Chapter 2, and I’ll show you how to choose the best data structures to optimize your code’s performance in Chapter 3.

Robustness

Good code also should be robust. By this, I mean it should be reproducible: you should be able to run your code from start to end without it failing. Your code also should be able to respond gracefully if system inputs change unexpectedly. Instead of throwing an unexpected error that could cause a larger system to fail, your code should be designed to respond to changes. Your code can be made more robust by properly handling errors, logging what has happened, and writing good tests.

Errors and Logging

Robust code shouldn’t behave unexpectedly when it gets an incorrect input. You should choose if you want your code to crash on an unexpected input, or handle that error and do something about it. For example, if your CSV file is missing half the rows of data you expect, do you want your code to return an error or continue to evaluate only half the data? You should make an explicit choice to give an alert that something is not as it should be, handle the error, or fail silently. I’ll cover errors in more detail in Chapter 5.

If the error is handled, it can still be important to record that it has happened so that it doesn’t fail silently, if that’s not what you want to happen. This is one use case for logging; I’ll explore other uses for logging in Chapter 5.

Testing

Testing is key to writing robust code. Software engineering uses two main types: user testing, where a person uses a piece of software to confirm that it works correctly, and automated testing. A common method for automated testing is sending an example input into a piece of code and confirming that the output is what you expected. I’ll cover only automated tests in this book.

Tests are necessary because even if your code runs perfectly on your machine, this doesn’t mean it will work on anyone else’s machine, or even on your own machine in the future. Data changes, libraries are updated, and different machines run different versions of Python. If someone else wants to use your code on their machine, they can run your tests to confirm it works.

There are several different types of tests. Unit tests test a single function, end-to-end tests test a whole project, and integration tests test a chunk of code that contains many functions but is still smaller than a whole project. I’ll describe testing strategies and libraries in detail in Chapter 7. But a good strategy for getting started, if you have a large codebase with no tests, is to write a test when something breaks to ensure that the same thing doesn’t happen again.

Data in This Book

Throughout this book, I’ll use data from the United Nations Sustainable Development Goals (SDGs). The SDGs are a set of 17 goals that are part of the 2030 Agenda for Sustainable Development adopted by UN member nations in 2015. The goals include ending poverty, ending hunger, access to education, gender equality, and many more. The SDGs are divided into subsidiary targets and tracked using a set of more than 200 statistical indicators. The indicators measure progress toward these goals quantitatively.

For example, Goal 1 is “End poverty in all its forms everywhere.” Target 1.1 is “By 2030, eradicate extreme poverty for all people everywhere, currently measured as people living on less than $1.25 a day.” Indicator 1.1.1 is “Proportion of the population living below the international poverty line by sex, age, employment status and geographic location (urban/rural).”

The data for these indicators is available in an online database and via an API. I’ll use the data from the indicators in the sample code in this book.

Key Takeaways

Writing good code will help you in many ways: it will be easier for other people to use your code; it will help you understand what you were doing when you come back to your work six months after you last touched it; and it will help your code scale up and interface with a larger system. Good code also will make your life much easier if you need to add features to your code that weren’t in the original project plan.

If you’d like to read more about the principles for writing good code, I recommend these books:

The Pragmatic Programmer, 20th Anniversary Edition, by David Thomas and Andrew Hunt (Addison-Wesley Professional)
A Philosophy of Software Architecture by John Ousterhout (Yaknyam Press)

In summary, here are some ways to think about how to write good code:

Simplicity: Your code should avoid repetition, unnecessary complexity, and unneeded lines of code.
Modularity: Your code should be broken down into logical functions, with well-defined inputs and outputs.
Readability: Your code should follow the PEP8 standard for formatting, contain well-chosen names, and be well documented.
Performance: Your code should not take an unnecessarily long time to run or use up more resources than are available.
Robustness: Your code should be reproducible, raise useful error messages, and handle unexpected inputs without failing.

In the next chapter, I’ll look in more detail at one aspect of good code: performance.

Get Software Engineering for Data Scientists now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Software Engineering for Data Scientists by Catherine Nelson