Why Biologists Want to Program Computers

by James Tisdall

The students (from vice presidents to principle investigators to junior lab assistants) who attend these courses do so to learn about programming for biology research. I've often been asked to give my perspective on the benefits of learning programming, considering the expenditure of time and effort that is required to learn this new and important laboratory skill.

Over the last decade there has been an accelerating interest in acquiring programming skills on the part of biologists. My new book Beginning Perl for Bioinformatics from O'Reilly & Associates is designed to address the need for training in this area by teaching programming in the context of biologically relevant data and results.

This article will examine why a biologist would want to learn to program. There are two main reasons: scientific, and economic. I hope that the discussion will also be of some use to programmers thinking of entering the bioinformatics field. But first, I'll take a short tour of some history, define some terms, and make some general comments about how programming fits into biology research.

Definitions and History

Bioinformatics is the somewhat new and rather unfortunate term that is commonly employed for referring to the use of computers in biological research, especially in such fields as genomics, sequencing, and genetics. Oh well, we're stuck with it. Informatics comes from a European term for computer science.

Computational biology is an alternative term for the use of computers in biology research. While we could quibble over an exact definition of these and other terms, I tend to just take the word "bioinformatics" as all-inclusive, and use it to refer in general to the use of computers in biology research.

O'Reilly Bioinformatics Technology Conference James Tisdall will be presenting a tutorial on Beginning Perl for Bioinformatics at O'Reilly's first Bioinformatics Technology Conference, January 28-31, 2002, in Tucson, Arizona.

There is a long history of the use of computers in biology research, dating back to the early days of digital computers over 50 years ago. About ten years ago the large-scale international human genome project began to generate data of a volume and of a kind that required a carefully planned and significant computer programming effort. Since then, the field of bioinformatics has been growing at an increasing rate.

This growth can be measured in several ways. There are now many academic positions filled by bioinformaticians (also called bioinformaticists or computational biologists). Training programs have become fairly common, from workshops to Ph.D. programs to post-doctoral positions. Industrial concerns are staffing many bioinformatics positions, especially in the pharmaceutical, agricultural, and biotechnology industries. Programming staff positions in biology research organizations have increased. Coverage of bioinformatics in scientific journals; scientific conferences dedicated to or incorporating bioinformatics; bioinformatics Web sites; published books on the subject--all have seen significant and even accelerating growth.

Related Reading

Beginning Perl for Bioinformatics
By James Tisdall

Two Pursuits: Biology and Computer Science

There's a bit of a cultural divide between biologists and computer scientists. This is only natural; the two disciplines are really fairly different, and have their own literatures, techniques, and fundamental principles.

To combine both disciplines is necessary, but the fit is sometimes a little tight across the shoulders. It is relatively rare to find an individual who has solid academic training in both fields; most folks are trained in one, and find themselves wandering into the other field.

It is also important for researchers to understand enough of each other's discipline to successfully collaborate with one another. Biologists who know enough computer science to be able to explain a problem in terms a programmer can find useful--and programmers who know enough biology to understand what their biologist colleagues need in a program--are valuable.

Biologists learning programming can sometimes encounter significant snags. Some very talented biologists just don't "take to" programming; while others find they can absorb it without undue difficulty. For example, if you went into biology because you perceived that it was a science that didn't require too much math, then computer science may be an uncomfortable thing to learn. Although I must point out that you can do a lot of programming with only a bare minimum of math. (Also, lest it be thought I am slurring biologists, it is certainly true that many are well trained in mathematics, especially statistics, and use it extensively in their work.) Finally, it's important for biologists to realize that there's a lot more to computer science than simply learning a programming language or two.

That's the bad news. The good news is that most biologists aren't interested in learning the computer science concepts used in compiler design, structural complexity theory, advanced algorithms, or such; they just want to learn enough of a programming language to do some practical, useful tasks that will advance their research. And that is possible to do without a graduate degree in computer science. It will come as no surprise that I recommend the Perl programming language as a good place to begin for such practical, results-oriented programming skills. And even those biologists who are interested in pursuing the deeper results of computer science have to start at the beginning with learning a programming language.

If you're interested in learning Perl, don't miss O'Reilly's best-selling Learning Perl, 3rd Edition, which has been updated to cover Perl version 5.6 and rewritten to reflect the needs of programmers learning Perl today. For a complete list of O'Reilly's books on Perl, go to perl.oreilly.com.

In my opinion, programmers going into biology often have the harder time of it. I should mention a common pitfall that programmers entering biology research encounter. Biology is subtle, and it can take lots of work to begin to get a handle on the variety of living organisms. Programmers new to the field sometimes write a perfectly good program for what turns out to be the wrong problem! I recommend to programmers that they at least study a book like Recombinant DNA, 2nd edition by Watson, Gilman, Witkowski, Zoller, and Witkowski; and that they ask a lot of questions of the biologists for whom they're writing their programs.

Use Programs, Don't Write Them

You can do bioinformatics without learning how to program. It is not uncommon for bioinformatics specialists to become adept at using existing bioinformatics programs, without ever learning the programming skills necessary to actually build such tools. There are now many programs available on Web sites or elsewhere that give convenient access to a significant amount of biological data.

More and more biology researchers in all fields are finding that the use of bioinformatics computer tools has become a regular part of their research. Many, if not most, of these researchers do not have programming skills, and are getting along quite well without them. For this group, the answer to the topic of this article is "They don't want to program computers!".

Write Programs When There's Nothing to Use

However, many biologists do want to learn how to program. They believe that programming skills can significantly help them to achieve their research goals. In research, questions often arise that could be answered or facilitated by a computer program--but the program doesn't yet exist. So a programmer is needed.

These research questions can range from straightforward ones easily programmed, all the way to complex questions requiring the invention of new algorithms. Advanced programming skills are not usually necessary, however. A basic skill set will do nicely to advance the work of most labs.

The question then becomes not "shall we program?" but rather "who's going to do the programming?" Perhaps the PI has the interest to learn and practice this new skill; this happens fairly often, despite the demands of research and grant writing, since the payoffs can be considerable. Or a staff scientist, postdoc, or student may leap into the breach. Many times a department, or institution, will maintain staff bioinformatics specialists who divide their time between the various labs that need their labor.

Programming Takes Time

There's an important fact that is sometimes underappreciated by biologists who are new to programming. That fact is programming is labor-intensive. Writing a substantial program still takes skilled people a nontrivial amount of time. Think of the time it can take to work the kinks out of a new experimental protocol; writing a significant program can often take a similar amount of work. Although powerful computer hardware is now quite inexpensive, writing the programs that make the hardware useful can still require considerable resources.

Visit www.perl.com for the latest Perl technology news and CPAN updates.

One of the reasons Perl has become a popular bioinformatics programming language lies in its suitability for rapid prototyping, that is, the ability to quickly write a working program, thus saving precious time. Despite all the attention that the computer world focuses on the clock speed in megahertz of various computer models, in today's research the most important speed to look at is, usually, the speed of programming. Choosing the right language for a programming job is an important part of this, and often (not always) Perl is the right language in a biology research setting.

Before I get pulled further into this digression on software engineering, or the study of the art of programming, let me now return to my main point, that many biologists want to learn programming.

What are the scientific reasons for this?

Scientific Reasons for Learning Bioinformatics Programming

I'll touch on these representative (not comprehensive) reasons for learning bioinformatics programming:

  • Quantity of existing data
  • Dealing with new data
  • Automating the automation
  • Evaluating many targets

Quantity of Existing Data

There is now a huge amount of basic biological data, and without computers to store and search this data, we'd be severely constrained. The most well-known biological data is the map and sequence of the human genome; and there's a lot of other biological data as well, for humans as well as for other organisms. Just to use this data (and its use is an essential part of many research programs) requires computer storage, as well as programs to make it convenient to search, retrieve, and study the data. Imagine searching for a motif in the human genome (3 billion base pairs) without a computer. This point is obvious and well known, so I won't belabor it.

But I will add that the specific data you want to retrieve for your research may exist in several different databases. Depending on how you want to select and compare that data, there may not be an already existing program that you can use. For all but the most common tasks, you may need to write your own programs to handle the selection and comparison of data from the database or databases of interest to you. I'll give an example of this shortly.

Dealing with New Data

Many biologists find that their laboratory notebooks still work just fine for recording their results.

But for an increasing number of researchers, new laboratory techniques (microarrays, dHPLC, gene chips, high-throughput sequencing, and so on) are generating a volume of data that requires computer technology. This can run the gamut from fairly simple tools such as spreadsheets, all the way to complex relational databases, and beyond. Designing and implementing complex databases is a specialized skill, and a career, in itself; but most programmers have at least a working knowledge of these skills.

Some biology research areas have long had a need for computing with large amounts of experimental data. For instance the determination of a protein structure by X-ray crystallography generates large amounts of data, and requires sophisticated algorithms to determine the structure from that data.

The point is that handling the experimental data of a lab may require computer programming to make the data readily accessible, to perform statistical analyses of the data, and to share the data with colleagues. Many research grants now include provisions to make the data and results of a project available for public inspection. This is often accomplished by storing the data in a database, and providing a Web page and interactive programs to enable visitors to explore the results.

A very common approach to this task is to use the no-cost open source Perl language for the programming (using the ever popular CGI.pm module), perhaps combined with the no-cost Apache Web server and a no-cost database such as MySQL or PostgreSQL and a no-cost platform such as Linux or BSD. In other words, apart from the hardware, all the software is free (and high quality), which does wonders for the lab budget. For those with Macintosh or Windows computers, the same approach will also work, as Perl, Apache, and the databases are also available on those platforms.

Visit opensource.oreilly.com for a complete list of O'Reilly's books on open source technologies. You'll find books on Apache, MySQL, Linux, PostgreSQL, and more.

Automating the Automation

One of the most valuable, time-saving programming skills that a biologist can learn is how to automate the automation. This means writing programs that run, and collect the output from, other programs, thereby eliminating the need to run the other programs in person while sitting at the computer.

Let's give a simple example of a case in which you have some large number of items that need to be examined closely by a range of programs.

Evaluating Many Targets

Say you have two hundred candidate targets for some biological experiment. The experiment is lengthy and costly, and so you need to evaluate the targets in order to select the most promising ones. You have several programs whose combined results allow you to make a reasonable selection. If running the needed programs, and examining and prioritizing the results, takes an hour per target, then you've got about a month's worth of work sitting in front of a computer screen ahead of you.

So instead, you write a program. This program takes each target one at a time, then runs the auxiliary programs and collects and collates the results. After it's done, it prioritizes the targets according to the criteria you have programmed, and then it presents you with a "top ten" list of targets. The program may take a couple of days, or a week, to write; it takes 30 minutes to run. You've saved yourself the rest of the month for doing "real" biology, actually performing the experiments on the targets; and you're on track to get to the finish line three weeks sooner.

Economic Reasons for Learning Bioinformatics Programming

Bioinformatics skills are commanding a premium in the marketplace, with a lot of the demand coming from the private sector.

For many biologists now getting their training in graduate school, or doing their postdocs, it is an unpleasant fact that an oversupply of trained people, compared to the demand, may result in a relatively low rate of pay, depending of course on their area of specialization. Reports in leading journals have decried the overproduction of biology PhDs relative to the level of funding for biology in general. For some of these biologists, bioinformatics skills can significantly enhance their job prospects and their salaries, because there is an lack of trained bioinformatics people relative to the demand.

And that fact of supply and demand in the labor market for biology researchers is the economic reason that biologists want to learn programming. I've even seen young Ph.D.s in my classes who, despairing of finding a decent position, are learning programming with the intention of leaving biology research altogether. Of course, we could all lobby for an increase in funding for biology research. (This is the golden age of biological research, when such funding can be expected to yield great results.) But the competition for jobs and grants is likely to remain quite heated for some time to come.

It is not hard to find (say with a search engine like Google looking up the words "bioinformatics salary") lots of reports on the premium being paid for trained bioinformaticians, and especially for experienced bioinformaticians. The field is simply growing faster than the number of qualified individuals who can fill the need for bioinformatics skills.

Trends and Predictions

The upward trend of the use of computers in biology research has been going on for several years now. In the time-honored tradition of futurists everywhere, I predict that the current trend will continue!

Actually, there are solid reasons to suppose that the prediction is true. If a simple appeal to authority is acceptable, then you should note that continuing growth in bioinformatics has been predicted by many scientific leaders, commissions, universities, businesses, and granting agencies. Nor, come to think of it, can I recall anyone making a prediction of no growth, or decline, in the demand for bioinformatics. I'm not an economist, so I'll defer on the details of such predictions to those who are.

If your work is in biology research, then you must decide for yourself whether learning programming is a practical and effective way to advance your research. That is the bottom line, after all. For many bench experimentalists, programming is this kind of useful and productive research skill.

James Tisdall has worked as a musician, as a programmer and Member of Technical Staff at Bell Labs (where he programmed for speech research and discovered a formal language for musical rhythm), as a programmer and systems manager at the Human Genome Project in the Computational Biology and Informatics Laboratory (where he began using Perl for bioinformatics in 1991 with his program DNA WorkBench), as computational biologist at Mercator Genetics in Menlo Park, California (where his Perl programs helped discover the gene involved in the common hereditary disease hemochromatosis), as manager of bioinformatics at the Fox Chase Cancer Center in Philadelphia, and most recently as a consultant for Biocomputing Associates of Kimberton, Pennsylvania, and the Burke Medical Research Institute affiliated with Cornell University, working on neurodegenerative diseases such as Alzheimer's and Parkinson's.

O'Reilly & Associates will soon release (October 2001) Beginning Perl for Bioinformatics.