The glib answers to the question run something like this: "Bioinformatics is the intersection of information technology and biology;" "Bioinformatics is information management for biology;" "Bioinformatics means tools for data mining in biological databases."
But what does that really mean? Those answers leave open a lot of questions. "Information technology" and "data mining" don't really mean a whole lot to a biologist, and they don't convey a sense of the possibilities that computers create for researchers. And from the opposite perspective, "biology" doesn't mean a whole lot to a computer professional. What is it biologists do? What do they want to find out? How do they go about finding it out? And finally, what are the benefits of applying information technology to biological research, and why is bioinformatics such a hot area as a result?
What are biologists trying to find out?
One of the big questions in biology is: how does the genomic code translate into a real, live human (or animal or plant or bacterium)? For a long time, biologists didn't have access to a complete version of that code, and they had to study it one letter (or word or sentence) at a time.
To learn more about the latest trends and research in this field, don't miss the O'Reilly Bioinformatics Technology Conference January 28-31, 2002, in Tuscon, Arizona.
The genomic code breaks down into thousands of individual genes. Genes tell cells to make proteins, individual molecules that each have a unique chemical mission. Proteins interact with each other to carry out thousands of functions, from digesting your dinner to synthesizing the small molecules that form a barrier between the inside of your cells and the outside world.
Biologists want to collect all of the information they can about every gene in every genome, and from that information construct models of how genes work together to build up and maintain a living body, whether it's a bacterium or a star quarterback.
What are the data types that are collected to answer these questions?
There are as many kinds of biological data as there are experiments. Bioinformaticians, however, can only work easily with data types that are collected systematically from the entire biological research community. The Web makes it possible to collect such data by electronic submission.
Currently, gene and genome sequences are the most abundantly collected data types, followed by protein atomic coordinates. DNA sequences are reported as strings of characters, and they are usually annotated with descriptions of features associated with particular regions of the string. Proteins are reported as Cartesian coordinates, with some (incompletely standardized) identifying information about the protein attached. New high-throughput experimental methods such as DNA microarrays produce large matrices of values which describe gene expression levels, protein-protein interactions, and other information about how genes and proteins interact in living cells.
How does computation support the whole enterprise?
Computers play many roles in modern biology:
Collecting and processing signals detected by laboratory equipment: DNA sequencers, CCD devices, spectrophotometers, and just about any other device that can be connected to a computer via an analog to digital converter.
Tracking samples and managing experiments in industrial-style laboratories (e.g., in gene sequencing centers). Most smaller labs don't have the resources to invest in automated laboratory management, but using software to manually maintain lab-notebook-style electronic records is rapidly becoming more common.
Storing data in public databases, and more importantly, public access to the database via sophisticated Web searches and deposition mechanisms. NCBI, home of Genbank, PubMed, and other public databases, is the premier example of the kind of information services that can be built onto a public biological database.
Extracting patterns and rules from large data collections and using these observed patterns to characterize and predict features in new data. This is the core of bioinformatics: developing tools which can recognize pattern matches and feature signatures within an otherwise inscrutable data set.
Annotation: using automatic computational methods to assign functional meaning to uncharacterized data and to create informative links between different data collections. For example, many annotation systems use automated sequence comparison searches to identify potential genes in new genome data.
Simulation: using known information about a system, along with a mathematical or physicochemical model, to simulate properties of the system. This category is incredibly diverse, from simulating the motions of interacting protein molecules to modeling the flow of chemicals through biochemical pathways.
Bioinformaticians are professional data analysts--they work with data generated by the experimental biology community and by a growing number of "data factory" projects (e.g., genome sequencing projects). Mining this data to develop new hypotheses, new models of how biological systems function, and even rules and patterns (which can be used to screen new data sets), is the work of bioinformatics.
Bioinformatics is a subset of a larger general trend to apply systematic and quantitative methods to the analysis of biological systems, which is in turn a subset of computational science in general. Bioinformatics may be primarily about data storage and genome sequence analysis, but computational approaches are already in use across the whole spectrum of biological research. With the increasing automation of experimentation and data collection, this trend can only continue.
For Tim O'Reilly's thoughts on trends in computational science, see Business Computing Isn't Where the Action Is Going to Be, in which Tim writes that a recent New York Times article, "All Science Is Computer Science," captures a trend he's been seeing for some time. "Every time we've had a radical lowering of the barriers of entry into a computing market, that market has exploded," says Tim. "Now, hackers and scientists are working together to break down the barriers to discovery."
As professional data analysts in a specialized field, bioinformaticians need to have a solid understanding of both computational analysis methods and the biological questions they're meant to answer. A lack of biological understanding can result in sophisticated computational methods being applied naively and in ways which aren't really helpful to biologists. A lack of analytical sophistication means that interesting features of biological data may go undiscovered.
In 1998, Dr. Russ Altman, now president of the International Society for Computational Biology, published an article called A Curriculum for Bioinformatics: The Time is Ripe, which enumerated some of the many skills that are useful for aspiring bioinformaticians.
Critical knowledge and skills he identified for bioinformaticians include:
Established methods for sequence analysis, such as pairwise and multiple sequence alignment and construction of phylogenetic trees, sequence fragment and map assembly, and prediction or extraction of features from sequences.
Established methods for molecular structure analysis and simulation, such as geometry analysis, structure modeling, and molecular dynamics.
Computational support of laboratory biology, a broad category that could be conceived of as including everything from signal detection and processing to statistical analysis.
Design, implementation, and integration of biological databases.
Key algorithms and methods of bioinformatics, such as dynamic programming, optimization, classification and cluster analysis, and neural networks.
Even more basic, however, are the key skills pointed out to us by some of our colleagues in the bioinformatics field:
Understanding of the scientific method: how experiments are designed and carried out to test hypotheses, and standards for reporting scientific research.
Understanding the foundations of molecular biology: how genomic information is transmitted and used in living cells.
Facility with computers: everything from the ability to learn to use new software quickly to the ability to work comfortably in a command-line (Unix) environment.
Knowledge of a programming language such as C or C++ and a scripting language, such as Perl or Python.
The first two areas are the province of biologists, the latter two of computer scientists. Both sets of knowledge are considered basic to their field and are usually the focus of a good deal of training, generally an entire undergraduate degree. It's rare to find a combination of these skills in one individual. It sometimes seems that in order to retrain for bioinformatics, you'd need an entire new degree. But that's not very practical.
The answer to the retraining question depends on how far you want to go on the continuum from programming to scientific research.
If you're going to be a programmer on a bioinformatics project, what you need to learn is enough biology so that you can talk to biological scientists, because they will be asking you to put their ideas into action on the computer. That means knowing on a general level what the important molecules of life are (DNA, RNA, proteins, metabolites), what they're made of, and what kinds of things they do. It's also helpful to understand how the information in the genome is used in living systems by translation into molecules that subsequently interact with each other to carry out life processes.
O'Reilly & Associates is dedicated to providing you with information about critical trends and innovations in computer technology. We're pleased to announce our second Peer-to-Peer and Web Services Conference (September 17-20, 2001, Washington, D.C.), an event exploring the technical, business, and legal dimensions of these technologies.
Once you know these basics, then you may want to learn about some existing bioinformatics and computational biology methods and how they work. Some universities offer bioinformatics certification programs for computer professionals.
If you see yourself making the transition from programmer to scientist and actually developing new bioinformatics methods, you'll need more than a thin gloss of biology over your computer competence. This is where bioinformatics graduate programs come in. Scientists go through the arduous and life-sucking process of graduate school to do more than just take a few more classes. They're there to learn the rules and process of scientific research from hypothesis to experiment to publication. If this is the road that you choose to take, consider applying to one of the many new graduate programs in bioinformatics and computational biology. And keep an eye on the O'Reilly books catalog.
Copyright © 2009 O'Reilly Media, Inc.