An Interview with Cynthia Gibas and Per Jambeckby Bruce Stewart
Stewart: What exactly is bioinformatics?
Gibas: Bioinformatics has become a catchall for any application of computational tools to study biological problems. More correctly, the term "bioinformatics" refers to information management (development of biological databases, DB interfaces, and query tools) and data mining (tools for detection of patterns in biological data). The broader term "computational biology," which you'll often hear in connection with bioinformatics, refers to methods such as physics-based simulation of molecular motions and interactions, or mathematical simulation of biochemical and cellular processes.
Stewart: Are your backgrounds rooted in biology or computer science?
Jambeck: Actually, computational biologists come from many disciplines, and some of the greatest contributions have come from people who were originally physicists, mathematicians, engineers, and English scholars before they were drawn to biology. As long as you're willing to learn the biology, you can come from any discipline and make a real contribution to bioinformatics.
Gibas: That said, we both have a background in experimental biology. Per's background is almost evenly split between biophysics and computer science. My background is in chemistry and biophysics, a common background for computational biologists, especially those who are interested in molecular structure. Biophysicists are used to thinking about biological problems in a quantitative way, and that skill set translates well to computational biology and bioinformatics.
To learn more about the latest trends and research in this field, don't miss the O'Reilly Bioinformatics Technology Conference January 28-31, 2002, in Tuscon, Arizona.
Stewart: Who should read your book, Developing Bioinformatics Computer Skills?
Jambeck: The book was intended for an audience who, while educated in their home discipline, may not be familiar with the ocean of methods used in bioinformatics.
Gibas: When we started writing this book, we were working in an academic research group with a focus on bioinformatics and computational biology. We often found ourselves in the position of both orienting new students and collaborating with biologists who did experimental work. These collaborators and students came from a wide variety of backgrounds: biology and biochemistry of course, but also computational biology, physical sciences, and engineering.
What we often wished for was a manual that would outline for students all of the things they needed to learn to work in the group. We sat down and made a list, which started with Unix, Web literacy, and the ability to intelligently organize large data sets on a computer, and we included topics as wide-ranging as fundamental database principles, data visualization methods, and elementary programming concepts. Many students needed to learn or be reminded of what DNA is, how it gets translated into protein, and why the order of the characters in a 1-dimensional sequence is important in the 3-dimensional world of biology. And that was before we even got to the main bioinformatics tools: sequence analysis, molecular structure analysis, and modeling based on sequence similarity. We thought we'd better point out that bioinformatics doesn't begin and end with sequence, and that the whole landscape of biology is changing to be more information-oriented.
So, this book is for anyone who wonders what bioinformatics is and where to start on learning the skills they might need to learn to work in a bioinformatics research environment. We hope it helps.
To learn more about bioinformatics, read Computers + Biology = Bioinformatics by Cynthia Gibas, coauthor of Developing Bioinformatics Computer Skills. In this article, Cynthia briefly defines bioinformatics and summarizes the knowledge, skills, and education needed to become a professional bioinformatician.
Stewart: What skills do both biologists and computer scientists bring to the bioinformatics table? What can each tell the other?
Gibas: First and foremost, bioinformatics is a biological science. It has to do with answering biological questions. Biologists have a huge knowledge base of information about how living things work: They need to be able to communicate that knowledge base to their colleagues from other disciplines. Further, they need to be able to describe the problems that are important.
Computer scientists have expertise in areas with which biologists are often unfamiliar: how to design efficient algorithms, how to standardize and store large datasets in databases, how to develop tools to query those databases, and how to present those tools to users in a usable interface. Computer scientists also have a world-view that lends itself to abstraction and generalization of problems. Biologists tend to look at every system (especially "their" system) as unique; computer scientists (and physical scientists, mathematicians, etc.) are trained to see patterns and develop abstractions that are useful for modeling.
Still, biology is a science of the specific, and the main purpose of bioinformatics is to make it easier for biologists to solve real biological problems. Biologists understand how living things work. They know the rules, rules derived from experimental observation. They know which information in a dataset is important to them and which is just decoration. Biologists provide validation for computer methods, whether of the usefulness of an interface design, or of a model's ability to explain or be tested by experimental observations.
Stewart: What are the primary tools of bioinformatics?
Jambeck: The core tools of the bioinformatics researcher are probably sequence comparison, databases, and visualization. More generally, we would say they are statistics, applied mathematics (particularly linear algebra and discrete math), a knowledge of algorithmics, and expertise in at least one area of biology.
The bioinformatics tool that most people will come in contact with first is sequence comparison. The way they'll use it is through a Web interface which allows them to search the biological sequence databases. Understanding how to set up a sequence search and interpret search results is now a vital skill for anyone doing molecular biology. In addition, many of the other computational tools that are commonly used--phylogenetic analysis, sequence profiles, homology modeling--are based on sequence comparison. It's the most basic method in bioinformatics.
Obviously, databases are a major part of bioinformatics. Designing databases that support new kinds of scientific inquiry is a major theme in bioinformatics, and there are a lot of "value-added" databases which build specialized analytical tools into the database query process in an attempt to provide expert analysis automatically. A good example of this integration of data and tools is the SDSC Biology Workbench, developed by Shankar Subramaniam and coworkers. It connects sequence and structure analysis tools with a number of databases, allowing even novice users to readily carry out useful research.
Visualization tools are also tremendously important; the average biologist doesn't have the expertise to immediately make sense of a mass of results from computational analysis (for example, the results of comparing two extremely long sequences). Development of software that can present such results in a way that is interpretable by humans is another major theme in bioinformatics.
Stewart:What role do computer-generated visualization techniques play in bioinformatics?
Jambeck:Visualization is a vital part of understanding biological data. You may be able to generate a huge list of results with a computer program, but unless you can interpret them sensibly, it will be difficult to use them to draw any conclusions. Visualization has been important to structural biology for decades, since molecular structures can be naturally represented as rotatable, three-dimensional objects. Higher-order analyses of genomic information, like gene expression data, comparisons between multiple genomes, and representation of interconnected metabolic pathways, have even greater complexity, and visualization of that kind of information is definitely a major theme in bioinformatics.
Stewart: What is BLAST?
Jambeck: BLAST is the collective name for a group of sequence comparison tools developed by Steven Altschul (National Center for Biotechnology Information), Warren Gish (Washington University), and coworkers, based on work done by Altschul, Samuel Karlin, and Amir Dembo at Stanford University.
Given a DNA or protein sequence, BLAST searches a database for similar sequences. It's kind of like a search engine for biological sequence data. Biology is very much concerned with finding things that are significantly similar to other things. If you have two genes and their sequences are almost identical, you can bet that they probably do the same thing. This knowledge is very important if you have a new gene sequence and you want to know what it does.
Stewart: There's been a great deal of excitement about the sequencing of the human genome. Now that the sequence is known, what is the next problem? How will computational methods help us solve it?
Jambeck: Well, the big problem now is making sense of it. What you get from genome sequencing is a list of genes and their complete sequences, and the list of genes isn't even complete because we still can't identify unknown genes in a DNA sequence with perfect confidence. It's like having a very tersely written encyclopedia in an utterly alien language.
Nevertheless, having the genome sequence is an important step in building a comprehensive understanding of how a living thing works. From there, knowledge from experiments and predictions must be integrated to answer questions: When is each gene turned on and off? How many times is each gene translated when it's turned on? What does the product of the gene do biochemically? What other gene products does it interact with? And so on. This information continues to be collected in laboratories all over the world, and the challenge for bioinformatics is to help assemble it all into a coherent picture.
Stewart: Are there any privacy risks when dealing with gene and genome information?
Gibas: There will be. Right now, much of the concern doesn't center on personal privacy but on who "owns" particular gene sequences and who can make money from them. Personal privacy risks to individuals become an issue as we figure out where markers for diseases are on the genome and how to detect them rapidly in the genome sequence. Just like any other health-related information, we as a society will have to decide who "owns" a particular person's DNA sequence and how much control that person will have over how their DNA sequence information is distributed and used. It behooves people working with genetics to stay aware of policy and public perception of their area.
Stewart: Overall, what kinds of biological problems can computational methods solve, or give us insights into solving? What are the real problems? What breakthroughs or interesting news might we expect to hear about over the next five years?
Gibas: Computational methods will be very useful in making sense of biochemical pathways, and they will continue to lend insight into macromolecular structure. The main effect of bioinformatics on biological research, though, will be one of acceleration. When used properly, bioinformatics methods help researchers to organize knowledge and use it to make informed decisions in their research. Access to genomic information and its integration with other experimental results means that we'll have a more detailed understanding of why and how genes work. More importantly, for society in general, this knowledge will translate into a speedup in the discovery of disease treatments and safer innovations in biotechnology.
Stewart: Over the last five years there's been a lot of license wrangling in the open source world. Given that we're just at the beginning of the gene and genome patenting cycle, do you see anything that the OS licensing wars can teach biologists?
Jambeck: There's a lot of excitement in that area now. Sean Eddy of Washington University and Ewan Birney of EBI have argued persuasively for improved data availability in genomics. Their open letter is available online.
Phil Bourne (co-director of the Research Collaboratory for Structural Bioinformatics) has pointed out that there was a similar struggle in structural biology during the late 1980s. The three-dimensional structure of a macromolecule can be an incredibly difficult thing to solve, and after investing lots of effort into solving a structure, researchers were often less than thrilled about giving away their hard-won data. The matter was resolved when the major journals refused to publish articles describing data which was not made publicly available within a few months of publication. That decision has led to structural biology becoming one of the most fruitful and competitive areas of research. Hopefully, these fights over ownership of data will have a similarly happy ending.
Gibas: As far as software goes, some researchers have done an excellent job of distributing their programs and source code. We hope to see this trend increase. You can describe a program and leave out seemingly minor details that effectively cripple another researcher's ability to replicate your experiments. Especially in publically-funded scientific research, the standard is that researchers should report their expermental methods in a way that would allow their results to be independently verified. As we see it, bioinformatics software should be open source, at least within the academic research community. Otherwise it's potentially a "black box" that can't be independently validated.
Stewart: Peer-to-peer (P2P) is a hot topic in the computing world. How might biologists take advantage of distributed computing?
Jambeck: P2P is a great idea in biology. You often have experts in specific functions or gene families at different universities. Ideally, P2P allows those researchers to combine their expertise in collaborative projects like genome annotation. Lincoln Stein, a luminary in the Perl and bioinformatics database communities, has been an active proponent of P2P in such applications.
O'Reilly & Associates is dedicated to providing you with information about critical trends and innovations in computer technology. We're pleased to announce our second Peer-to-Peer and Web Services Conference (September 18-21, 2001, Washington, D.C.), an event exploring the technical, business, and legal dimensions of these technologies.
Stewart: Are distributed computing models that harness the power of many machines to provide greater computational strength important to bioinformatics? Have any of the current distributed programs that are focused on medical science, like Popular Power's influenza research or Parabon's Compute Against Cancer, actually made an important contribution to the scientific effort?
Jambeck: If they haven't yet, we think they will. Currently there are attempts to fold proteins, find drug candidates, annotate genes, and increase understanding of diseases via distributed computing. It's a matter of finding the right representation: If you can find a problem that can be broken down into a lot of independent subproblems with fairly minimal communication, then distributed models will do brilliantly on it.
Stewart: A recent article by George Johnson from The New York Times makes the claim that all science is computer science and suggests that, with the genome project, biology has become today's most demanding computational scientific discipline. What do you think about that assessment?
Gibas: The view that "all science is computer science" might be a little extreme. However, it is quite true that all science benefits greatly from the application of a quantitative, rigorously analytical approach--the sort of approach that is best supported by computers. Biological systems are complex, and insight into their function will come not just from computer science, but from physics, mathematics, statistics, engineering--all quantitative disciplines.
Stewart: I gather the pharmaceutical industry is one of the primary backers of bioinformatics research programs. What do they expect to achieve from advances in the field?
Gibas: The pharmaceutical industry is hugely involved in bioinformatics, as is the agricultural biotechnology industry. Companies in these sectors are looking for faster and more specific identification of "targets" for chemical agents that can be sold at a profit, and a more effective design for new drugs. Pharmaceutical companies expend huge amounts of effort to identify even one new compound that will be the next hot thing on pharmacy shelves. They screen thousands or millions of chemicals to find one that is effective as a treatment and has no horrible side effects. Computational methods allow them to narrow the range of options that they have to search.
Another application of genomic information in the pharmaceutical world is "nutraceuticals" and "molecular farming". There is talk about using grains or bananas to produce vaccines so people in developing countries can be vaccinated easily and cheaply. Understanding biochemical pathways well enough to modify them is what lets us do that. Also, it's a lot cheaper to produce a substance by getting a vat of genetically modified bacteria, or a field of tobacco plants to produce massive quantities of it, than it is to isolate the substance from a natural source.
Stewart: If you could read any book other than your own, what would it be?
Jambeck: Donald Knuth's The Art of Computer Programming, Volume 4.
Gibas: The new edition of Baxevanis & Ouellette's Bioinformatics.
Stewart: Are there any "holy grails" in bioinformatics?
Gibas: Predicting protein structure from sequence has been on the "holy grail" list for about the last 30 years. Now I'd say that the new "holy grail," the one we're all working on together, is to predict cellular function from genomic sequence. In other words, if we know what all the genes are, when they turn on, and what they do, can we put together a complete model of a working cell?
That's going to take a while, and it's not going to be accomplished by one genius working in isolation. It'll be the collaborative effort of the whole molecular biology research community. The glue that holds this collaborative effort together will be bioinformatics.
Cynthia Gibas is an assistant professor of biology at Virginia Tech, in Blacksburg, Virginia. Her research interest is in physicochemical properties of proteins and protein structure/function relationships. While at Virginia Tech, she has built a 32-node AMD Athlon-based Linux cluster from parts and helped her colleagues design curriculum options in bioinformatics. She teaches introductory courses in bioinformatics and biological sequence analysis. She has a Ph.D. in biophysics and computational biology from the University of Illinois.
Per Jambeck is a Ph.D. student in the bioengineering department at the University of California, San Diego. He has worked on computational biology since 1994, concentrating on machine-learning applications in understanding multidimensional biological data.
O'Reilly & Associates recently released Developing Bioinformatics Computer Skills (April 2001).