Preface

Programming is a force multiplier. We can write computer programs to free ourselves from tedious manual tasks and to accelerate research. Programming in any language will likely improve your productivity, but each language has different learning curves and tools that improve or impede the process of coding.

There is an adage in business that says you have three choices:

  1. Fast

  2. Good

  3. Cheap

Pick any two.

When it comes to programming languages, Python hits a sweet spot in that it’s fast because it’s fairly easy to learn and to write a working prototype of an idea—it’s pretty much always the first language I’ll use to write any program. I find Python to be cheap because my programs will usually run well enough on commodity hardware like my laptop or a tiny AWS instance. However, I would contend that it’s not necessarily easy to make good programs using Python because the language itself is fairly lax. For instance, it allows one to mix characters and numbers in operations that will crash the program.

This book has been written for the aspiring bioinformatics programmer who wants to learn about Python’s best practices and tools such as the following:

  • Since Python 3.6, you can add type hints to indicate, for instance, that a variable should be a type like a number or a list, and you can use the mypy tool to ensure the types are used correctly.

  • Testing frameworks like pytest can exercise your code with both good and bad data to ensure that it reacts in some predictable way.

  • Tools like pylint and flake8 can find potential errors and stylistic problems that would make your programs more difficult to understand.

  • The argparse module can document and validate the arguments to your programs.

  • The Python ecosystem allows you to leverage hundreds of existing modules like Biopython to shorten programs and make them more reliable.

Using these tools practices individually will improve your programs, but combining them all will improve your code in compounding ways. This book is not a textbook on bioinformatics per se. The focus is on what Python offers that makes it suitable for writing scientific programs that are reproducible. That is, I’ll show you how to design and test programs that will always produce the same outputs given the same inputs. Bioinformatics is saturated with poorly written, undocumented programs, and my goal is to reverse this trend, one program at a time.

The criteria for program reproducibility include:

Parameters

All program parameters can be set as runtime arguments. This means no hard-coded values which would require changing the source code to change the program’s behavior.

Documentation

A program should respond to a --help argument by printing the parameters and usage.

Testing

You should be able to run a test suite that proves the code meets some specifications

You might expect that this would logically lead to programs that are perhaps correct, but alas, as Edsger Dijkstra famously said, “Program testing can be used to show the presence of bugs, but never to show their absence!”

Most bioinformaticians are either scientists who’ve learned programming or programmers who’ve learned biology (or people like me who had to learn both). No matter how you’ve come to the field of bioinformatics, I want to show you practical programming techniques that will help you write correct programs quickly. I’ll start with how to write programs that document and validate their arguments. Then I’ll show how to write and run tests to ensure the programs do what they purport.

For instance, the first chapter shows you how to report the tetranucleotide frequency from a string of DNA. Sounds pretty simple, right? It’s a trivial idea, but I’ll take about 40 pages to show how to structure, document, and test this program. I’ll spend a lot of time on how to write and test several different versions of the program so that I can explore many aspects of Python data structures, syntax, modules, and tools.

Who Should Read This?

You should read this book if you care about the craft of programming, and if you want to learn how to write programs that produce documentation, validate their parameters, fail gracefully, and work reliably. Testing is a key skill both for understanding your code and for verifying its correctness. I’ll show you how to use the tests I’ve written as well as how to write tests for your programs.

To get the most out of this book, you should already have a solid understanding of Python. I will build on the skills I taught in Tiny Python Projects (Manning, 2020), where I show how to use Python data structures like strings, lists, tuples, dictionaries, sets, and named tuples. You need not be an expert in Python, but I definitely will push you to understand some advanced concepts I introduce in that book, such as types, regular expressions, and ideas about higher-order functions, along with testing and how to use tools like pylint, flake8, yapf, and pytest to check style, syntax, and correctness. One notable difference is that I will consistently use type annotations for all code in this book and will use the mypy tool to ensure the correct use of types.

Programming Style: Why I Avoid OOP and Exceptions

I tend to avoid object-oriented programming (OOP). If you don’t know what OOP means, that’s OK. Python itself is an OO language, and almost every element from a string to a set is technically an object with internal state and methods. You will encounter enough objects to get a feel for what OOP means, but the programs I present will mostly avoid using objects to represent ideas.

That said, Chapter 1 shows how to use a class to represent a complex data structure. The class allows me to define a data structure with type annotations so that I can verify that I’m using the data types correctly. It does help to understand a bit about OOP. For instance, classes define the attributes of an object, and classes can inherit attributes from parent classes, but this essentially describes the limits of how and why I use OOP in Python. If you don’t entirely follow that right now, don’t worry. You’ll understand it once you see it.

Instead of object-oriented code, I demonstrate programs composed almost entirely of functions. These functions are also pure in that they will only act on the values given to them. That is, pure functions never rely on some hidden, mutable state like global variables, and they will always return the same values given the same arguments. Additionally, every function will have an associated test that I can run to verify it behaves predictably. It’s my opinion that this leads to shorter programs that are more transparent and easier to test than solutions written using OOP. You may disagree and are of course welcome to write your solutions using whatever style of programming you prefer, so long as they pass the tests. The Python Functional Programming HOWTO documentation makes a good case for why Python is suited for functional programming (FP).

Finally, the programs in this book also avoid the use of exceptions, which I think is appropriate for short programs you write for personal use. Managing exceptions so that they don’t interrupt the flow of a program adds another level of complexity that I feel detracts from one’s ability to understand a program. I’m generally unhappy with how to write functions in Python that return errors. Many people would raise an exception and let a try/catch block handle the mistakes. If I feel an exception is warranted, I will often choose to not catch it, instead letting the program crash. In this respect, I’m following an idea from Joe Armstrong, the creator of the Erlang language, who said, “The Erlang way is to write the happy path, and not write twisty little passages full of error correcting code.”

If you choose to write programs and modules for public release, you will need to learn much more about exceptions and error handling, but that’s beyond the scope of this book.

Structure

The book is divided into two main parts. The first part tackles 14 of the programming challenges found at the Rosalind.info website.1 The second part shows more complicated programs that demonstrate other patterns or concepts I feel are important in bioinformatics. Every chapter of the book describes a coding challenge for you to write and provides a test suite for you to determine when you’ve written a working program.

Although the “Zen of Python” says “There should be one—and preferably only one—obvious way to do it,” I believe you can learn quite a bit by attempting many different approaches to a problem. Perl was my gateway into bioinformatics, and the Perl community’s spirit of “There’s More Than One Way To Do It” (TMTOWTDI) still resonates with me. I generally follow a theme-and-variations approach to each chapter, showing many solutions to explore different aspects of Python syntax and data structures.

Test-Driven Development

More than the act of testing, the act of designing tests is one of the best bug preventers known. The thinking that must be done to create a useful test can discover and eliminate bugs before they are coded—indeed, test-design thinking can discover and eliminate bugs at every stage in the creation of software, from conception to specification, to design, coding, and the rest.

Boris Beizer, Software Testing Techniques (Thompson Computer Press)

Underlying all my experimentation will be test suites that I’ll constantly run to ensure the programs continue to work correctly. Whenever I have the opportunity, I try to teach test-driven development (TDD), an idea explained in a book by that title written by Kent Beck (Addison-Wesley, 2002). TDD advocates writing tests for code before writing the code. The typical cycle involves the following:

  1. Add a test.

  2. Run all tests and see if the new test fails.

  3. Write the code.

  4. Run tests.

  5. Refactor code.

  6. Repeat.

In the book’s GitHub repository, you’ll find the tests for each program you’ll write. I’ll explain how to run and write tests, and I hope by the end of the material you’ll believe in the common sense and basic decency of using TDD. I hope that thinking about tests first will start to change the way you understand and explore coding.

Using the Command Line and Installing Python

My experience in bioinformatics has always been centered around the Unix command line. Much of my day-to-day work has been on some flavor of Linux server, stitching together existing command-line programs using shell scripts, Perl, and Python. While I might write and debug a program or a pipeline on my laptop, I will often deploy my tools to a high-performance compute (HPC) cluster where a scheduler will run my programs asynchronously, often in the middle of the night or over a weekend and without any supervision or intervention by me. Additionally, all my work building databases and websites and administering servers is done entirely from the command line, so I feel strongly that you need to master this environment to be successful in bioinformatics.

I used a Macintosh to write and test all the material for this book, and macOS has the Terminal app you can use for a command line. I have also tested all the programs using various Linux distributions, and the GitHub repository includes instructions on how to use a Linux virtual machine with Docker. Additionally, I tested all the programs on Windows 10 using the Ubuntu distribution Windows Subsystem for Linux (WSL) version 1. I highly recommend WSL for Windows users to have a true Unix command line, but Windows shells like cmd.exe, PowerShell, and Git Bash can sometimes work sufficiently well for some programs.

I would encourage you to explore integrated development environments (IDEs) like VS Code, PyCharm, or Spyder to help you write, run, and test your programs. These tools integrate text editors, help documentation, and terminals. Although I wrote all the programs, tests, and even this book using the vim editor in a terminal, most people would probably prefer to use at least a more modern text editor like Sublime, TextMate, or Notepad++.

I wrote and tested all the examples using Python versions 3.8.6 and 3.9.1. Some examples use Python syntax that was not present in version 3.6, so I would recommend you not use that version. Python version 2.x is no longer supported and should not be used. I tend to get the latest version of Python 3 from the Python download page, but I’ve also had success using the Anaconda Python distribution. You may have a package manager like apt on Ubuntu or brew on Mac that can install a recent version, or you may choose to build from source. Whatever your platform and installation method, I would recommend you try to use the most recent version as the language continues to change, mostly for the better.

Note that I’ve chosen to present the programs as command-line programs and not as Jupyter Notebooks for several reasons. I like Notebooks for data exploration, but the source code for Notebooks is stored in JavaScript Object Notation (JSON) and not as line-oriented text. This makes it very difficult to use tools like diff to find the differences between two Notebooks. Also, Notebooks cannot be parameterized, meaning I cannot pass in arguments from outside the program to change the behavior but instead have to change the source code itself. This makes the programs inflexible and automated testing impossible. While I encourage you to explore Notebooks, especially as an interactive way to run Python, I will focus on how to write command-line programs.

Getting the Code and Tests

All the code and tests are available from the book’s GitHub repository. You can use the program Git (which you may need to install) to copy the code to your computer with the following command. This will create a new directory called biofx_python on your computer with the contents of the repository:

$ git clone https://github.com/kyclark/biofx_python

If you enjoy using an IDE, it may be possible to clone the repository through that interface, as shown in Figure P-1. Many IDEs can help you manage projects and write code, but they all work differently. To keep things simple, I will show how to use the command line to accomplish most tasks.

mpfb 0001
Figure P-1. The PyCharm tool can directly clone the GitHub repository for you

Some tools, like PyCharm, may automatically try to create a virtual environment inside the project directory. This is a way to insulate the version of Python and modules from other projects on your computer. Whether or not you use virtual environments is a personal preference. It is not a requirement to use them.

You may prefer to make a copy of the code in your own account so that you can track your changes and share your solutions with others. This is called forking because you’re breaking off from my code and adding your programs to the repository.

To fork my GitHub repository, do the following:

  1. Create an account on GitHub.com.

  2. Go to https://github.com/kyclark/biofx_python.

  3. Click the Fork button in the upper-right corner (see Figure P-2) to make a copy of the repository in your account.

mpfb 0002
Figure P-2. The Fork button on my GitHub repository will make a copy of the code in your account

Now that you have a copy of all my code in your repository, you can use Git to copy that code to your computer. Be sure to replace YOUR_GITHUB_ID with your actual GitHub ID:

$ git clone https://github.com/YOUR_GITHUB_ID/biofx_python

I may update the repo after you make your copy. If you would like to be able to get those updates, you will need to configure Git to set my repository as an upstream source. To do so, after you have cloned your repository to your computer, go into your biofx_python directory:

$ cd biofx_python

Then execute this command:

$ git remote add upstream https://github.com/kyclark/biofx_python.git

Whenever you would like to update your repository from mine, you can execute this command:

$ git pull upstream main

Installing Modules

You will need to install several Python modules and tools. I’ve included a requirements.txt file in the top level of the repository. This file lists all the modules needed to run the programs in the book. Some IDEs may detect this file and offer to install these for you, or you can use the following command:

$ python3 -m pip install -r requirements.txt

Or use the pip3 tool:

$ pip3 install -r requirements.txt

Sometimes pylint may complain about some of the variable names in the programs, and mypy will raise some issues when you import modules that do not have type annotations. To silence these errors, you can create initialization files in your home directory that these programs will use to customize their behavior. In the root of the source repository, there are files called pylintrc and mypy.ini that you should copy to your home directory like so:

$ cp pylintrc ~/.pylintrc
$ cp mypy.ini ~/.mypy.ini

Alternatively, you can generate a new pylintrc with the following command:

$ cd ~
$ pylint --generate-rcfile > .pylintrc

Feel free to customize these files to suit your tastes.

Installing the new.py Program

I wrote a Python program called new.py that creates Python programs. So meta, I know. I wrote this for myself and then gave it to my students because I think it’s quite difficult to start writing a program from an empty screen. The new.py program will create a new, well-structured Python program that uses the argparse module to interpret command-line arguments. It should have been installed in the preceding section with the module dependencies. If not, you can use the pip module to install it, like so:

$ python3 -m pip install new-py

You should now be able to execute new.py and see something like this:

$ new.py
usage: new.py [-h] [-n NAME] [-e EMAIL] [-p PURPOSE] [-t] [-f] [--version]
              program
new.py: error: the following arguments are required: program

Each exercise will suggest that you use new.py to start writing your new programs. For instance, in Chapter 1 you will create a program called dna.py in the 01_dna directory, like so:

$ cd 01_dna/
$ new.py dna.py
Done, see new script "dna.py".

If you then execute ./dna.py --help, you will see that it generates help documentation on how to use the program. You should open the dna.py program in your editor, modify the arguments, and add your code to satisfy the requirements of the program and the tests.

Note that it’s never a requirement that you use new.py. I only offer this as an aid to getting started. This is how I start every one of my own programs, but, while I find it useful, you may prefer to go a different route. As long as your programs pass the test suites, you are welcome to write them however you please.

Why Did I Write This Book?

Richard Hamming spent decades as a mathematician and researcher at Bell Labs. He was known for seeking out people he didn’t know and asking them about their research. Then he would ask them what they thought were the biggest, most pressing unanswered questions in their field. If their answers for both of these weren’t the same, he’d ask, “So why aren’t you working on that?”

I feel that one of the most pressing problems in bioinformatics is that much of the software is poorly written and lacks proper documentation and testing, if it has any at all. I want to show you that it’s less difficult to use types and tests and linters and formatters because it will prove easier over time to add new features and release more and better software. You will have the confidence to know for certain when your program is correct, for at least some measure of correctness.

To that end, I will demonstrate best practices in software development. Though I’m using Python as the medium, the principles apply to any language from C to R to JavaScript. The most important thing you can learn from this book is the craft of developing, testing, documenting, releasing, and supporting software, so that together we can all advance scientific research computing.

My career in bioinformatics was a product of wandering and happy accidents. I studied English literature and music in college, and then started playing with databases, HTML, and eventually learned programming on the job in the mid-1990s. By 2001, I’d become a decent Perl hacker, and I managed to get a job as a web developer for Dr. Lincoln Stein, an author of several Perl modules and books, at Cold Spring Harbor Laboratory (CSHL). He and my boss, Dr. Doreen Ware, patiently spoon-fed me enough biology to understand the programs they wanted to be written. I spent 13 years working on a comparative plant genomics database called Gramene.org, learning a decent amount of science while continuing to explore programming languages and computer science.

Lincoln was passionate about sharing everything from data and code to education. He started the Programming for Biology course at CSHL, a two-week intensive crash course to teach Unix command-line, Perl programming, and bioinformatics skills. The course is still being taught, although using Python nowadays, and I’ve had several opportunities to act as a teaching assistant. I’ve always found it rewarding to help someone learn a skill they will use to further their research.

It was during my tenure at CSHL that I met Bonnie Hurwitz, who eventually left to pursue her PhD at the University of Arizona (UA). When she started her new lab at UA, I was her first hire. Bonnie and I worked together for several years, and teaching became one of my favorite parts of the job. As with Lincoln’s course, we introduced basic programming skills to scientists who wanted to branch out into more computational approaches.

Some of the materials I wrote for these classes became the foundation for my first book, Tiny Python Projects, where I try to teach the essential elements of Python language syntax as well as how to use tests to ensure that programs are correct and reproducible—elements crucial to scientific programming. This book picks up from there and focuses on the elements of Python that will help you write programs for biology.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions, as well as codons and DNA bases.

Constant width

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context.

This element signifies a tip or suggestion.

This element signifies a general note.

This element indicates a warning or caution.

Using Code Examples

Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/kyclark/biofx_python.

If you have a technical question or a problem using the code examples, please send email to .

This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Mastering Python for Bioinformatics by Ken Youens-Clark (O’Reilly). Copyright 2021 Charles Kenneth Youens-Clark, 978-1-098-10088-9.”

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at .

O’Reilly Online Learning

For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.

Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

  • O’Reilly Media, Inc.
  • 1005 Gravenstein Highway North
  • Sebastopol, CA 95472
  • 800-998-9938 (in the United States or Canada)
  • 707-829-0515 (international or local)
  • 707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/mastering-bioinformatics-python.

Email to comment or ask technical questions about this book.

For news and information about our books and courses, visit http://oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Follow us on Twitter: http://twitter.com/oreillymedia

Watch us on YouTube: http://www.youtube.com/oreillymedia

Acknowledgments

I want to thank the many people who have reviewed this book, including my editor, Corbin Collins; the entire production team but especially my production editor, Caitlin Ghegan; my technical reviewers, Al Scherer, Brad Fulton, Bill Lubanovic, Rangarajan Janani, and Joshua Orvis; and the many other people who provided much-appreciated feedback, including Mark Henderson, Marc Bañuls Tornero, and Dr. Scott Cain.

In my professional career, I’ve been extremely fortunate to have had many wonderful bosses, supervisors, and colleagues who’ve helped me grow and pushed me to be better. Eric Thorsen was the first person to see I had the potential to learn how to code, and he helped me learn various languages and databases as well as important lessons about sales and support. Steve Reppucci was my boss at boston.com, and he provided a much deeper understanding of Perl and Unix and how to be an honest and thoughtful team leader. Dr. Lincoln Stein at CSHL took a chance to hire someone who had no knowledge of biology to work in his lab, and he pushed me to create programs I didn’t imagine I could. Dr. Doreen Ware patiently taught me biology and pushed me to assume leadership roles and publish. Dr. Bonnie Hurwitz supported me through many years of learning about high-performance computing, more programming languages, mentoring, teaching, and writing. In every position, I also had many colleagues who taught me as much about programming as about being human, and I thank everyone who has helped me along the way.

In my personal life, I would be nowhere without my family, who have loved and supported me. My parents have shown great support throughout my life, and I surely wouldn’t be the person I am without them. Lori Kindler and I have been married 25 years, and I can’t imagine a life without her. Together we generated three offspring who have been an incredible source of delight and challenge.

1 Named for Rosalind Franklin, who should have received a Nobel Prize for her contributions to discovering the structure of DNA.

Get Mastering Python for Bioinformatics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.