Chapter 4. Sequences and Strings
In this chapter you will begin to write Perl programs that manipulate biological sequence data, that is, DNA and proteins. Once you have the sequences in the computer, you’ll start writing programs that do the following with the sequence data:
Transcribe DNA to RNA
Concatenate sequences
Make the reverse complement of sequences
Read sequence data from files
You’ll also write programs that give information about your sequences. How GC-rich is your DNA? How hydrophobic is your protein? You’ll see programming techniques you can use to answer these and similar questions.
The Perl skills you will learn in this chapter involve the basics of the language. Here are some of those basics:
Scalar variables
Array variables
String operations such as substitution and translation
Reading data from files
Representing Sequence Data
The majority of this book deals with manipulating symbols that represent the biological sequences of DNA and proteins. The symbols used in bioinformatics to represent these sequences are the same symbols biologists have been using in the literature for this same purpose.
As stated earlier, DNA is composed of four building blocks: the nucleic acids, also called nucleotides or bases. Proteins are composed of 20 building blocks, the amino acids, also called residues. Fragments of proteins are called peptides. Both DNA and proteins are essentially polymers, made from their building blocks attached end to end. So it’s possible to summarize the primary structure of a DNA molecule or protein by simply giving the sequence of bases or amino acids.
These are brief definitions; I’m assuming you are either already familiar with them or are willing to consult an introductory textbook on molecular biology for more specific details. Table 4-1 shows bases; add a sugar and you get the nucleosides adenosine, guanosine, cytidine, thymidine, and uridine. You can further add a phosphate and get the nucleotides adenylic acid, guanylic acid, cytidylic acid, thymidylic acid, and uridylic acid. A nucleic acid is a chemically linked sequence of nucleotides. A peptide is a small number of joined amino acids; a longer chain is a polypeptide. A protein is a biologically functional unit made of one or more polypeptides. A residue is an amino acid in a polypeptide chain.
For expediency, the names of the nucleic acids and the amino acids are often represented as one- or three-letter codes, as shown in Table 4-1 and Table 4-2. (This book mostly uses the one-letter codes for amino acids.)
One-letter code |
Amino acid |
Three-letter code |
A |
Alanine |
Ala |
B |
Aspartic acid or Asparagine |
Asx |
C |
Cysteine |
Cys |
D |
Aspartic acid |
Asp |
E |
Glutamic acid |
Glu |
F |
Phenylalanine |
Phe |
G |
Glycine |
Gly |
H |
Histidine |
His |
I |
Isoleucine |
Ile |
K |
Lysine |
Lys |
L |
Leucine |
Leu |
M |
Methionine |
Met |
N |
Asparagine |
Asn |
P |
Proline |
Pro |
Q |
Glutamine |
Gln |
R |
Arginine |
Arg |
S |
Serine |
Ser |
T |
Threonine |
Thr |
V |
Valine |
Val |
W |
Tryptophan |
Trp |
X |
Unknown |
Xxx |
Y |
Tyrosine |
Tyr |
Z |
Glutamic acid or Glutamine |
Glx |
The nucleic acid codes in Table 4-1 include letters for the four basic nucleic acids; they also define single letters for all possible groups of two, three, or four nucleic acids. In most cases in this book, I use only A, C, G, T, U, and N. The letters A, C, G, and T represent the nucleic acids for DNA. U replaces T when DNA is transcribed into ribonucleic acid (RNA). N is the common representation for “unknown,” as when a sequencer can’t determine a base with certainty. Later on, in Chapter 9, we’ll need the other codes, for groups of nucleic acids, when programming restriction maps. Note that the lowercase versions of these single-letter codes is also used on occasion, frequently for DNA, rarely for protein.
The computer-science terminology is a little different from the biology terminology for the codes in Table 4-1 and Table 4-2. In computer-science parlance, these tables define two alphabets , finite sets of symbols that can make strings. A sequence of symbols is called a string. For instance, this sentence is a string. A language is a (finite or infinite) set of strings. In this book, the languages are mainly DNA and protein sequence data. You often hear bioinformaticians referring to an actual sequence of DNA or protein as a “string,” as opposed to its representation as sequence data. This is an example of the terminologies of the two disciplines crossing over into one another.
As you’ve seen in the tables, we’ll be representing data as simple letters, just as written on a page. But computers actually use additional codes to represent simple letters. You won’t have to worry much about this; just remember when using your text editor to save as ASCII, or plain text.
ASCII is a way for computers to store textual (and control) data in their memory. Then when a program such as a text editor reads the data, and it knows it’s reading ASCII, it can actually draw the letters on the screen in a recognizable fashion because it’s programmed to know that particular code. So the bottom line is: ASCII is a code to represent text on a computer.[1]
A Program to Store a DNA Sequence
Let’s write a small program that stores some
DNA in a variable and prints it to the screen. The DNA is written in the
usual fashion, as a string made of the letters A, C, G, and T, and we’ll call the
variable $DNA
. In other words,
$DNA
is the name of the DNA sequence data
used in the program. Note that in Perl, a variable is really the name for some data
you wish to use. The name gives you full access to the data. Example 4-1 shows the entire
program.
#!/usr/bin/perl -w # Storing DNA in a variable, and printing it out # First we store the DNA in a variable called $DNA $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; # Next, we print the DNA onto the screen print $DNA; # Finally, we'll specifically tell the program to exit. exit;
Using what you’ve already learned about text editors and running Perl programs in Chapter 2, enter the code (or copy it from the book’s web site) and save it to a file. Remember to save the program as ASCII or text-only format, or Perl may have trouble reading the resulting file.
The second step is to run the program. The details of how to run a program depend on the type of computer you have (see Chapter 2). Let’s say the program is on your computer in a file called example4-1. As you recall from Chapter 2, if you are running this program on Unix or Linux, you type the following in a shell window:
perl example4-1
On a Mac, open the file with the MacPerl application and save it as a droplet, then just double-click on the droplet. On Windows, type the following in an MS-DOS command window:
perl example4-1
If you’ve successfully run the program, you’ll see the output printed on your computer screen.
Control Flow
Example 4-1 illustrates many of the ideas all our Perl programs will rely on. One of these ideas is control flow , or the order in which the statements in the program are executed by the computer.
Every program starts at the first line and executes the statements one after the other until it reaches the end, unless it is explicitly told to do otherwise. Example 4-1 simply proceeds from top to bottom, with no detours.
In later chapters, you’ll learn how programs can control the flow of execution.
Comments Revisited
Now let’s take a look at the parts of Example 4-1. You’ll notice lots of blank lines. They’re there to make the program easy for a human to read. Next, notice the comments that begin with the # sign. Remember from Chapter 3 that when Perl runs, it throws these away along with the blank lines. In fact, to Perl, the following is exactly the same program as Example 4-1:
#!/usr/bin/perl -w $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; print $DNA; exit;
In Example 4-1, I’ve made liberal use of comments. Comments at the beginning of code can make it clear what the program is for, who wrote it, and present other information that can be helpful when someone needs to understand the code. Comments also explain what each section of the code is for and sometimes give explanations on how the code achieves its goals.
It’s tempting to belabor the point about the importance of comments. Suffice it to say that in most university-level, computer-science class assignments, the program without comments typically gets a low or failing grade; also, the programmer on the job who doesn’t comment code is liable to have a short and unsuccessful career.
Command Interpretation
Because it starts with a # sign, the first line of the program looks like a comment, but it doesn’t seem like a very informative comment:
#!/usr/bin/perl -w
This is a special line called command interpretation that tells the computer
running Unix and Linux that this is a Perl program. It may look slightly
different on different computers. On some machines, it’s also unnecessary
because the computer recognizes Perl from other information. A Windows machine
is usually configured to assume that any program ending in .pl is a Perl program. In Unix or Linux, a
Windows command window, or a Mac OS X shell, you can type perl my_program
, and your Perl program my_program
won’t need the special line. However,
it’s commonly used, so we’ll have it at start all our programs.
Notice that the first line of code uses a flag -w
. The “w” stands for warnings, and it causes Perl to print messages in case of an error.
Very often the error message suggests the line number where it thinks the error
began. Sometimes the line number is wrong, but the error is usually on or just
before the line the message suggests. Later in the book, you’ll also see the
statement use
warnings
as an alternative to -w
.
Statements
The next line of Example 4-1 stores the DNA in a variable:
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
This is a very common, very important thing to do in a computer language, so let’s take a leisurely look at it. You’ll see some basic features about Perl and about programming languages in general, so this is a good place to stop skimming and actually read.
This line of code is called a statement. In Perl, statements end in a semicolon (;). The use of the semicolon is similar to the use of the period in the English language.
To be more accurate, this line of code is an assignment
statement. Its purpose in this program is to store some DNA into a
variable called $DNA
. There are several
fundamental things happening here as you will see in the next sections.
Variables
First, let’s look at the
variable $DNA
. Its name is
somewhat arbitrary. You can pick another name for it, and the program
behaves the same way. For instance, if you replace the two lines:
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; print $DNA;
with these:
$A_poem_by_Seamus_Heaney = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; print $A_poem_by_Seamus_Heaney;
the program behaves in exactly the same way, printing out the DNA to the computer screen. The point is that the names of variables in a computer program are your choice. (Within certain restrictions: in Perl, a variable name must be composed from upper- or lowercase letters, digits, and the underscore _ character. Also the first character must not be a digit.)
This is another important point along the same lines as the remarks I’ve
already made about using blank lines and comments to make your code more
easily readable by humans. The computer attaches no meaning to the use of
the variable name $DNA
instead of
$A_poem_by_Seamus_Heaney
, but whoever
reads the program certainly will. One name makes perfect sense, clearly
indicates what the variable is for in the program, and eases the chore of
understanding the program. The other name makes it unclear what the program
is doing or what the variable is for. Using well-chosen variable names is
part of what’s called
self-documenting code. You’ll still need comments, but perhaps
not as many, if you pick your variable names well.
You’ve noticed that the variable name $DNA
starts with dollar sign. In Perl this kind of variable
is called a scalar
variable, which is a variable that holds a single item of data.
Scalar variables are used for such data as strings or various kinds of
numbers (e.g., the string hello
or
numbers such as 25, 6.234, 3.5E10, -0.8373). A scalar variable holds just
one item of data at a time.
Strings
In Example 4-1, the scalar
variable $DNA
is holding
some DNA, represented in the usual way by the letters A, C, G,
and T. As stated earlier, in computer science a sequence of letters is
called a string. In Perl you designate a string by putting it in quotes. You
can use single quotes, as in Example
4-1, or double quotes. (You’ll learn the difference later.) The
DNA is thus represented by:
'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'
Assignment
In Perl, to set a variable to a certain value, you use the =
sign. The =
sign is called
the assignment operator
. In Example 4-1,
the value:
'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'
is assigned to the variable $DNA
. After
the assignment, you can use the name of the variable to get the value, as in
the print
statement in Example 4-1.
The order of the parts is important in an assignment statement. The value assigned to something appears on the right of the assignment operator. The variable that is assigned a value is always to the left of the assignment operator. In programming manuals, you sometimes come across the terms lvalue and rvalue to refer to the left and right sides of the assignment operator.
This use of the
=
sign has a long history in
programming languages. However, it can be a source of confusion: for
instance, in most mathematics, using =
means that the two things on either side of the sign are equal. So it’s
important to note that in Perl, the =
sign doesn’t mean equality. It assigns a value to a variable. (Later, we’ll
see how to represent equality.)
So, to summarize what we’ve learned so far about this statement:
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
It’s an assignment statement that sets the value of the
scalar variable $DNA
to a
string representing some DNA.
The statement:
print $DNA;
prints ACGGGAGGACGGGAAAATTACTACGGCATTAGC
out to the computer screen.
Notice that the print
statement deals with scalar variables by printing out their values—in this case, the
string that the variable $DNA
contains.
You’ll see more about printing later.
Exit
Finally, the statement exit;
tells the computer to exit the program. Perl doesn’t require an exit
statement at the end of a program; once you get to the
end, the program exits automatically. But it doesn’t hurt to put one in, and
it clearly indicates the program is over. You’ll see other programs that
exit if something goes wrong before the program normally finishes, so the
exit
statement is definitely
useful.
Concatenating DNA Fragments
Now we’ll make a simple modification of Example 4-1 to show how to concatenate two DNA fragments. Concatenation is attaching something to the end of something else. A biologist is well aware that joining DNA sequences is a common task in the biology lab, for instance when a clone is inserted into a cell vector or when splicing exons together during the expression of a gene. Many bioinformatics software packages have to deal with such operations; hence its choice as an example.
Example 4-2 demonstrates a few more things to do with strings, variables, and print statements.
#!/usr/bin/perl -w # Concatenating DNA # Store two DNA fragments into two variables called $DNA1 and $DNA2 $DNA1 = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; $DNA2 = 'ATAGTGCCGTGAGAGTGATGTAGTA'; # Print the DNA onto the screen print "Here are the original two DNA fragments:\n\n"; print $DNA1, "\n"; print $DNA2, "\n\n"; # Concatenate the DNA fragments into a third variable and print them # Using "string interpolation" $DNA3 = "$DNA1$DNA2"; print "Here is the concatenation of the first two fragments (version 1):\n\n"; print "$DNA3\n\n"; # An alternative way using the "dot operator": # Concatenate the DNA fragments into a third variable and print them $DNA3 = $DNA1 . $DNA2; print "Here is the concatenation of the first two fragments (version 2):\n\n"; print "$DNA3\n\n"; # Print the same thing without using the variable $DNA3 print "Here is the concatenation of the first two fragments (version 3):\n\n"; print $DNA1, $DNA2, "\n"; exit;
As you can see, there are three variables here, $DNA1
, $DNA2
, and $DNA3
. I’ve added print
statements for a running commentary, so that the output of the
program that appears on the computer screen makes more sense and isn’t simply some
DNA fragments one after the other.
Here’s what the output of Example 4-2 looks like:
Here are the original two DNA fragments: ACGGGAGGACGGGAAAATTACTACGGCATTAGC ATAGTGCCGTGAGAGTGATGTAGTA Here is the concatenation of the first two fragments (version 1): ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGATGTAGTA Here is the concatenation of the first two fragments (version 2): ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGATGTAGTA Here is the concatenation of the first two fragments (version 3): ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGATGTAGTA
Example 4-2 has many similarities to
Example 4-1. Let’s look at the
differences. To start with, the print
statements have some extra, unintuitive parts:
print $DNA1, "\n"; print $DNA2, "\n\n";
The print
statements have variables containing
the DNA, as before, but now they also have a
comma and then
"\n"
or "\n\n"
. These are instructions to print newlines. A
newline is invisible on the page or screen, but it tells
the computer to go on to the beginning of the next line for subsequent printing. One
newline, "\n"
, simply positions you at the
beginning of the next line. Two new lines, "\n\n"
, moves to the next line and then positions you at the beginning of
the line after that, leaving a blank line in between.
Look at the code for Example 4-2 and
its output to make sure that you see what these newline directives do to the output.
A blank line is a line with nothing printed on it. Depending on your operating
system, it may be just a newline character or a combination formfeed and carriage
return (in which cases, it may also be called an empty line), or it may include
nonprinting whitespace characters such as spaces and tabs. Notice that the newlines
are enclosed in double quotes, which means they are parts of strings. (Here’s one
difference between single and double quotes, as mentioned earlier: "\n"
prints a newline; '\n'
prints \n
as written.)
Notice the comma in the print
statement. A
comma separates items in a list. The print
statement prints all the items that are listed. Simple as that.
Now let’s look at the statement that concatenates the two DNA fragments $DNA1
and $DNA2
into the variable $DNA3
:
$DNA3 = "$DNA1$DNA2";
The assignment to $DNA3
is just a typical
assignment as you saw in Example 4-1, a
variable name followed by the =
sign, followed
by a value to be assigned.
The value to the right of the assignment statement is a string enclosed in double
quotes. The double quotes allow the variables in the string to be replaced with
their values. This is called string interpolation
.[2] So, in effect, the string here is just the DNA of variable $DNA1
, followed directly by the DNA of variable
$DNA2
. That concatenation of the two DNA
fragments is then assigned to variable $DNA3
.
After assigning the concatenated DNA to variable $DNA3
, you print it out, followed by a blank line:
print "$DNA3\n\n";
One of the Perl catch phrases is, “There’s more than one way to do it.” So, the next part of the program shows another way to concatenate two strings, using the dot operator. The dot operator, when placed between two strings, creates a single string that concatenates the two original strings. So the line:
$DNA3 = $DNA1 . $DNA2;
illustrates the use of this operator.
Tip
An operator in a computer language takes some arguments—in this case,
the strings $DNA1
and $DNA2
—and does something to them, returning a
value—in this case, the concatenated string placed in the variable $DNA3
. The most familiar operators from
arithmetic—plus, minus, multiply, and divide—are all operators that take two
numbers as arguments and return a number as a value.
Finally, just to exercise the different parts of the language, let’s accomplish
the same concatenation using only the print
statement:
print $DNA1, $DNA2, "\n";
Here the print
statement has three parts,
separated by commas: the two DNA fragments in the two variables and a newline. You
can achieve the same result with the following print
statement:
print "$DNA1$DNA2\n";
Maybe the Perl slogan should be, “There are more than two ways to do it.”
Before leaving this section, let’s look ahead to other uses of Perl variables.
You’ve seen the use of variables to hold strings of DNA sequence data. There are
other types of data, and programming languages need variables for them, too. In
Perl, a
scalar variable such as $DNA
can hold
a string, an integer, a floating-point number (with a decimal point), a boolean
(true
or false
) value, and more. When it’s required, Perl figures out what
kind of data is in the variable. For now, try adding the following lines to Example 4-1 or Example 4-2, storing a number in a scalar
variable and printing it
out:
$number = 17; print $number,"\n";
Transcription: DNA to RNA
A large part of what you, the Perl bioinformatics programmer, will spend your time doing amounts to variations on the same theme as Examples 4-1 and 4-2. You’ll get some data, be it DNA, proteins, GenBank entries, or what have you; you’ll manipulate the data; and you’ll print out some results.
Example 4-3 is another program that
manipulates DNA; it transcribes DNA to RNA. In the cell, this transcription of DNA
to RNA is the outcome of the workings of a delicate, complex, and error-correcting
molecular machinery.[3] Here it’s a simple substitution. When DNA is transcribed to RNA,
all the T
’s are changed to U
’s, and that’s all that our program needs to
know.[4]
#!/usr/bin/perl -w # Transcribing DNA into RNA # The DNA $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; # Print the DNA onto the screen print "Here is the starting DNA:\n\n"; print "$DNA\n\n"; # Transcribe the DNA to RNA by substituting all T's with U's. $RNA = $DNA; $RNA =~ s/T/U/g; # Print the RNA onto the screen print "Here is the result of transcribing the DNA to RNA:\n\n"; print "$RNA\n"; # Exit the program. exit;
Here’s the output of Example 4-3:
Here is the starting DNA: ACGGGAGGACGGGAAAATTACTACGGCATTAGC Here is the result of transcribing the DNA to RNA: ACGGGAGGACGGGAAAAUUACUACGGCAUUAGC
This short program introduces an important part of Perl: the ability to easily manipulate text data such as a string of DNA. The manipulations can be of many different sorts: translation, reversal, substitution, deletions, reordering, and so on. This facility of Perl is one of the main reasons for its success in bioinformatics and among programmers in general.
First, the program makes a copy of the DNA, placing it in a variable called $RNA
:
$RNA = $DNA;
Note that after this statement is executed, there’s a variable called $RNA
that actually contains DNA.[5] Remember this is perfectly legal—you can call variables anything you
like—but it is potentially confusing to have inaccurate variable names. Now in this
case, the copy is preceded with informative comments and followed immediately with a
statement that indeed causes the variable $RNA
to
contain RNA, so it’s all right. Here’s a way to prevent $RNA
from containing anything except RNA:
($RNA = $DNA) =~ s/T/U/g;
In Example 4-3, the transcription happens in this statement:
$RNA =~ s/T/U/g;
There are two new items in this statement: the binding operator (=~)
and the substitute command s/T/U/g
.
The binding operator
=~
is used, obviously enough, on variables containing strings; here the variable $RNA
contains DNA sequence data. The binding operator means “apply
the operation on the right to the string in the variable on the left.”
The substitution operator
, shown in
Figure 4-1, requires a little more
explanation. The different parts of the command are separated (or delimited) by the
forward slash. First, the s
indicates this is a
substitution. After the first /
comes a T
, which represents the element in the string that
will be substituted. After the second /
comes a
U
, which represents the element that’s going
to replace the T
. Finally, after the third
/
comes g
.
This g
stands for "global” and is one of several possible modifiers that can appear in this
part of the statement. Global means “make this substitution throughout the entire
string,” that is to say, everywhere possible in the string.
Thus, the meaning of the statement is: “substitute all T
’s for U
’s in the string data
stored in the variable $RNA
.”
The substitution operator is an example of the use of regular expressions. Regular expressions are the key to text manipulation, one of the most powerful features of Perl as you’ll see in later chapters.
Using the Perl Documentation
A Perl programmer’s most important resource is the Perl documentation. It should be installed on your computer, and it may also be found on the Internet at the Perl site. The Perl documentation may come in slightly different forms on your computer system, but the web version is the same for everybody. That’s the version I refer to in this book. See the references in Appendix A for more discussion about different sources of Perl documentation.
Just to try it out, let’s look up the print
operator. First, open your web browser, and go to http://www.perl.com. Then click on the Documentation link. Select “Perl’s Builtin Functions” and then
“Alphabetical Listing of Perl’s Functions”. You’ll see a rather lengthy alphabetical
listing of Perl’s functions. Once you’ve found this page, you may want to bookmark
it in your browser, as you may find yourself turning to it frequently. Now click on
Print to view the print
operator.
Check out the examples they give to see how the language feature is actually used. This is usually the quickest way to extract what you need to know.
Once you’ve got the documentation on your screen, you may find that reading it
answers some questions but raises others. The documentation tends to give the entire
story in a concise form, and this can be daunting for beginners. For instance, the
documentation for the print
function starts
out: “Prints a string or a comma-separated list of strings. Returns TRUE if
successful.” But then comes a bunch of gibberish (or so it seems at this point in
your learning curve!) Filehandles? Output streams? List context?
All this information is necessary in documentation; after all, you need to get the whole story somewhere! Usually you can ignore what doesn’t make sense.
The Perl documentation also includes several tutorials that can be a great help in learning Perl. They occasionally assume more than a beginner’s knowledge about programming languages, but you may find them very useful. Exploring the documentation is a great way to get up to speed on the Perl language.
Calculating the Reverse Complement in Perl
As you recall from Chapter 1, a DNA polymer is composed of nucleotides. Given the close relationship between the two strands of DNA in a double helix, it turns out that it’s pretty straightforward to write a program that, given one strand, prints out the other. Such a calculation is an important part of many bioinformatics applications. For instance, when searching a database with some query DNA, it is common to automatically search for the reverse complement of the query as well, since you may have in hand the opposite strand of some known gene.
Without further ado, here’s Example 4-4, which uses a few new Perl features. As you’ll see, it first tries one method, which fails, and then tries another method, which succeeds.
#!/usr/bin/perl -w # Calculating the reverse complement of a strand of DNA # The DNA $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; # Print the DNA onto the screen print "Here is the starting DNA:\n\n"; print "$DNA\n\n"; # Calculate the reverse complement # Warning: this attempt will fail! # # First, copy the DNA into new variable $revcom # (short for REVerse COMplement) # Notice that variable names can use lowercase letters like # "revcom" as well as uppercase like "DNA". In fact, # lowercase is more common. # # It doesn't matter if we first reverse the string and then # do the complementation; or if we first do the complementation # and then reverse the string. Same result each time. # So when we make the copy we'll do the reverse in the same statement. # $revcom = reverse $DNA; # # Next substitute all bases by their complements, # A->T, T->A, G->C, C->G # $revcom =~ s/A/T/g; $revcom =~ s/T/A/g; $revcom =~ s/G/C/g; $revcom =~ s/C/G/g; # Print the reverse complement DNA onto the screen print "Here is the reverse complement DNA:\n\n"; print "$revcom\n"; # # Oh-oh, that didn't work right! # Our reverse complement should have all the bases in it, since the # original DNA had all the bases--but ours only has A and G! # # Do you see why? # # The problem is that the first two substitute commands above change # all the A's to T's (so there are no A's) and then all the # T's to A's (so all the original A's and T's are all now A's). # Same thing happens to the G's and C's all turning into G's. # print "\nThat was a bad algorithm, and the reverse complement was wrong!\n"; print "Try again ... \n\n"; # Make a new copy of the DNA (see why we saved the original?) $revcom = reverse $DNA; # See the text for a discussion of tr/// $revcom =~ tr/ACGTacgt/TGCAtgca/; # Print the reverse complement DNA onto the screen print "Here is the reverse complement DNA:\n\n"; print "$revcom\n"; print "\nThis time it worked!\n\n"; exit;
Here’s what the output of Example 4-4 should look like on your screen:
Here is the starting DNA: ACGGGAGGACGGGAAAATTACTACGGCATTAGC Here is the reverse complement DNA: GGAAAAGGGGAAGAAAAAAAGGGGAGGAGGGGA That was a bad algorithm, and the reverse complement was wrong! Try again ... Here is the reverse complement DNA: GCTAATGCCGTAGTAATTTTCCCGTCCTCCCGT This time it worked!
You can check if two strands of DNA are reverse complements of each other by reading one left to right, and the other right to left, that is, by starting at different ends. Then compare each pair of bases as you read the two strands: they should always be paired C to G and A to T.
Just by reading in a few characters from the starting DNA and the reverse complement DNA from the first attempt, you’ll see the that first attempt at calculating the reverse complement failed. It was a bad algorithm.
This is a taste of what you’ll sometimes experience as you program. You’ll write a program to accomplish a job and then find it didn’t work as you expected. In this case, we used parts of the language we already knew and tried to stretch them to handle a new problem. Only they weren’t quite up to the job. What went wrong?
You’ll find that this kind of experience becomes familiar: you write some code, and it doesn’t work! So you either fix the syntax (that’s usually the easy part and can be done from the clues the error messages provide), or you think about the problem some more, find why the program failed, and then try to devise a new and successful approach. Often this requires browsing the language documentation, looking for the details of how the language works and hoping to find a feature that fixes the problem. If it can be solved on a computer, you can solve it using Perl. The trick is, how exactly?
In Example 4-4, the first attempt to calculate the reverse complement failed. Each base in the string was translated as a whole, using four substitutions in a global fashion. Another way is needed. You could march though the DNA left to right, look at each base one at a time, make the change to the complement, and then look at the next base in the DNA, marching on to the end of the string. Then just reverse the string, and you’re done. In fact, this is a perfectly good method, and it’s not hard to do in Perl, although it requires some parts of the language not found until Chapter 5.
However, in this case, the tr
operator—which
stands for transliterate or translation—is exactly suited for this task. It looks
like the substitute command, with the three forward slashes separating the different
parts.
tr
does exactly what’s needed; it translates
a set of characters into new characters, all at once. Figure 4-2 shows how it works: the set
of characters to be translated are between the first two forward slashes. The set of
characters that replaces the originals are between the second and third forward
slashes. Each character in the first set is translated into the character at the
same position in the second set. For instance, in Example 4-4, C
is the second character in the first set, so it’s translated into
the second character of the second set, namely, G
. Finally, since DNA sequence data can use upper- or lowercase letters
(even though in this program the DNA is in uppercase only), both cases are included
in the tr
statement in Example 4-4.
The reverse
function also does exactly what’s
needed, with a minimum of fuss. It’s designed to reverse the
order of elements, including strings as seen
in Example 4-4.
Proteins, Files, and Arrays
So far we’ve been writing programs with DNA sequence data. Now we’ll also include the equally important protein sequence data. Here’s an overview of what is covered in the following sections:
How to use protein sequence data in a Perl program
How to read protein sequence data in from a file
Arrays in the Perl language
For the rest of the chapter, both protein and DNA sequence data are used.
Reading Proteins in Files
Programs interact with files on a computer disk. These files can be on hard disk, CD, floppy disk, Zip drive, magnetic tape—any kind of permanent storage.
Let’s take a look at how to read protein sequence data from a file. First, create a file on your computer (use your text editor) and put some protein sequence data into it. Call the file NM_021964fragment.pep (you can download it from this book’s web site). You will be using the following data (part of the human zinc finger protein NM_021964):
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR
You can use any name for the file, except one that’s already in use in the same folder.
Just as well-chosen variable names can be critical to understanding a program, well-chosen file and folder names can also be critical. If you have a project that generates lots of computer files, you need to carefully consider how to name and organize the files and folders. This is as true for individual researchers as for large, multi-national teams. It’s important to put some effort into assigning informative names to files.
The filename NM_021964fragment.pep is taken from the GenBank ID of the record where this protein is found. It also indicates the fragmentary nature of the data and contains the filename extension .pep to remind you that the file contains peptide or protein sequence data. Of course, some other scheme might work better for you; the point is to get some idea of what’s in the file without having to look into it.
Now that you’ve created or downloaded a file with protein sequence data in it, let’s develop a program that reads the protein sequence data from the file and stores it into a variable. Example 4-5 shows a first attempt, which will be added to as we progress.
#!/usr/bin/perl -w # Reading protein sequence data from a file # The filename of the file containing the protein sequence data $proteinfilename = 'NM_021964fragment.pep'; # First we have to "open" the file, and associate # a "filehandle" with it. We choose the filehandle # PROTEINFILE for readability. open(PROTEINFILE, $proteinfilename); # Now we do the actual reading of the protein sequence data from the file, # by using the angle brackets < and > to get the input from the # filehandle. We store the data into our variable $protein. $protein = <PROTEINFILE>; # Now that we've got our data, we can close the file. close PROTEINFILE; # Print the protein onto the screen print "Here is the protein:\n\n"; print $protein; exit;
Here’s the output of Example 4-5:
Here is the protein: MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD
Notice that only the first line of the file prints out. I’ll show why in a moment.
Let’s look at Example 4-5 in more
detail. After putting a filename into the variable $proteinfilename
, the file is opened with the following statement:
open(PROTEINFILE, $proteinfilename);
After opening the file, you can do various things with it, such as reading,
writing, searching, going to a specific location in the file, erasing everything in
the file, and so on. Notice that the program assumes the file named in the variable
$proteinfilename
exists and can be opened.
You’ll see in a little bit how to check for that, but here’s something to try:
change the name of the filename in $proteinfilename
so that there’s no file of that name on your
computer, and then run the program. You’ll get some error messages if the file
doesn’t exist.
If you look at the documentation for the open
function, you’ll see many options. Mostly, they enable you to specify
exactly what the file will be used for after it’s opened.
Let’s examine the term PROTEINFILE
, which is
called a filehandle
. With filehandles, it’s not important to understand what they really
are. They’re just things you use when you’re dealing with files. They don’t have to
have capital letters, but it’s a widely followed convention. After the open
statement assigns a filehandle, all the
interaction with a file is done by naming the filehandle.
The data is actually read in to the program with the statement:
$protein = <PROTEINFILE>;
Why is the filehandle PROTEINFILE
enclosed
within
angle brackets? These angle brackets are called input
operators; a filehandle within angle brackets is how you bring in
data from some source outside the program. Here, we’re reading the file called
NM_021964fragment.pep whose name is stored
in variable $proteinfilename
, and which has a
filehandle associated with it by the open
statement. The data is being stored in the variable $protein
and then printed out.
However, as you’ve already noticed, only the first line of this multiline file is printed out. Why? Because there are a few more things to learn about reading in files.
There are several ways to read in a whole file. Example 4-6 shows one way.
#!/usr/bin/perl -w # Reading protein sequence data from a file, take 2 # The filename of the file containing the protein sequence data $proteinfilename = 'NM_021964fragment.pep'; # First we have to "open" the file, and associate # a "filehandle" with it. We choose the filehandle # PROTEINFILE for readability. open(PROTEINFILE, $proteinfilename); # Now we do the actual reading of the protein sequence data from the file, # by using the angle brackets < and > to get the input from the # filehandle. We store the data into our variable $protein. # # Since the file has three lines, and since the read only is # returning one line, we'll read a line and print it, three times. # First line $protein = <PROTEINFILE>; # Print the protein onto the screen print "\nHere is the first line of the protein file:\n\n"; print $protein; # Second line $protein = <PROTEINFILE>; # Print the protein onto the screen print "\nHere is the second line of the protein file:\n\n"; print $protein; # Third line $protein = <PROTEINFILE>; # Print the protein onto the screen print "\nHere is the third line of the protein file:\n\n"; print $protein; # Now that we've got our data, we can close the file. close PROTEINFILE; exit;
Here’s the output of Example 4-6:
Here is the first line of the protein file: MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD Here is the second line of the protein file: SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ Here is the third line of the protein file: GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR
The interesting thing about this program is that it shows how reading from a file
works. Every time you read into a scalar variable such as $protein
, the next line of the file is read. Something is remembering
where the previous read was and is picking it up from there.
On the other hand, the drawbacks of this program are obvious. Having to write a few lines of code for each line of an input file isn’t convenient. However, there are two Perl features that can handle this nicely: arrays (in the next section) and loops (in Chapter 5).
Arrays
In computer languages an array is a variable that stores multiple scalar values. The values can be numbers, strings, or, in this case, lines of an input file of protein sequence data. Let’s examine how they can be used. Example 4-7 shows how to use an array to read all the lines of an input file.
#!/usr/bin/perl -w # Reading protein sequence data from a file, take 3 # The filename of the file containing the protein sequence data $proteinfilename = 'NM_021964fragment.pep'; # First we have to "open" the file open(PROTEINFILE, $proteinfilename); # Read the protein sequence data from the file, and store it # into the array variable @protein @protein = <PROTEINFILE>; # Print the protein onto the screen print @protein; # Close the file. close PROTEINFILE; exit;
Here’s the output of Example 4-7:
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR
which, as you can see, is exactly the data that’s in the file. Success!
The convenience of this is clear—just one line to read all the data into the program.
Notice that the array variable starts with an at sign (@) rather than the
dollar sign ($) scalar variables begin with. Also notice that the
print
function can handle arrays as well as scalar variables. Arrays are used
a lot in Perl, so you will see plenty of array examples as the book
continues.
An array is a variable that can hold many scalar values. Each item or element is a
scalar value that can be referenced by giving its position in the array (its
subscript or offset). Let’s look at some examples of arrays and their most common
operations. We’ll define an array @bases
that
holds the four bases A, C, G, and T. Then we’ll apply some of the most common array
operators.
Here’s a piece of code that demonstrates how to initialize an array and how to use subscripts to access the individual elements of an array:
# Here's one way to declare an array, initialized with a list of four scalar values. @bases = ('A', 'C', 'G', 'T'); # Now we'll print each element of the array print "Here are the array elements:"; print "\nFirst element: "; print $bases[0]; print "\nSecond element: "; print $bases[1]; print "\nThird element: "; print $bases[2]; print "\nFourth element: "; print $bases[3];
This code snippet prints out:
First element: A Second element: C Third element: G Fourth element: T
You can print the elements one after another like this:
@bases = ('A', 'C', 'G', 'T'); print "\n\nHere are the array elements: "; print @bases;
which produces the output:
Here are the array elements: ACGT
You can also print the elements separated by spaces (notice the double quotes in
the print
statement):
@bases = ('A', 'C', 'G', 'T'); print "\n\nHere are the array elements: "; print "@bases";
which produces the output:
Here are the array elements: A C G T
You can take an element off the end of an array with pop
:
@bases = ('A', 'C', 'G', 'T'); $base1 = pop @bases; print "Here's the element removed from the end: "; print $base1, "\n\n"; print "Here's the remaining array of bases: "; print "@bases";
which produces the output:
Here's the element removed from the end: T Here's the remaining array of bases: A C G
You can take a base off of the beginning of the array with shift
:
@bases = ('A', 'C', 'G', 'T'); $base2 = shift @bases; print "Here's an element removed from the beginning: "; print $base2, "\n\n"; print "Here's the remaining array of bases: "; print "@bases";
which produces the output:
Here's an element removed from the beginning: A Here's the remaining array of bases: C G T
You can put an element at the beginning of the array with unshift
:
@bases = ('A', 'C', 'G', 'T'); $base1 = pop @bases; unshift (@bases, $base1); print "Here's the element from the end put on the beginning: "; print "@bases\n\n";
which produces the output:
Here's the element from the end put on the beginning: T A C G
You can put an element on the end of the array with push
:
@bases = ('A', 'C', 'G', 'T'); $base2 = shift @bases; push (@bases, $base2); print "Here's the element from the beginning put on the end: "; print "@bases\n\n";
which produces the output:
Here's the element from the beginning put on the end: C G T A
@bases = ('A', 'C', 'G', 'T'); @reverse = reverse @bases; print "Here's the array in reverse: "; print "@reverse\n\n";
which produces the output:
Here's the array in reverse: T G C A
You can get the length of an array:
@bases = ('A', 'C', 'G', 'T'); print "Here's the length of the array: "; print scalar @bases, "\n";
which produces the output:
Here's the length of the array: 4
Here’s how to insert an element at an arbitrary place in an
array using the Perl splice
function:
@bases = ('A', 'C', 'G', 'T'); splice ( @bases, 2, 0, 'X'); print "Here's the array with an element inserted after the 2nd element: "; print "@bases\n";
which produces the output:
Here's the array with an element inserted after the 2nd element: A C X G T
Scalar and List Context
Many Perl operations behave differently depending on the context in which they are used. Perl has scalar context and list context; both are demonstrated in Example 4-8.
#!/usr/bin/perl -w # Demonstration of "scalar context" and "list context" @bases = ('A', 'C', 'G', 'T'); print "@bases\n"; $a = @bases; print $a, "\n"; ($a) = @bases; print $a, "\n"; exit;
Here’s the output of Example 4-8:
A C G T 4 A
First, Example 4-8 declares an array
of the four bases. Then the assignment statement tries to assign an array (which is a kind of list)
to a scalar variable $a
:
$a = @bases;
In this kind of scalar context , an array evaluates to the size of the array, that is, the number of elements in the array. The scalar context is supplied by the scalar variable on the left side of the statement.
Next, Example 4-8 tries to assign an
array (to repeat, a kind of list) to another list, in this case, having just one
variable, $a
:
($a) = @bases;
In this kind of list context , an array evaluates to a list of its elements. The list context is supplied by the list in parentheses on the left side of the statement. If there aren’t enough variables on the left side to assign to, only part of the array gets assigned to variables. This behavior of Perl pops up in many situations; by design, many features of Perl behave differently depending on whether they are in scalar or list context. See Appendix B for more about scalar and list content.
Now you’ve seen the use of strings and arrays to hold sequence and file data, and learned the basic syntax of Perl, including variables, assignment, printing, and reading files. You’ve transcribed DNA to RNA and calculated the reverse complement of a strand of DNA. By the end of Chapter 5, you’ll have covered the essentials of Perl programming.
Exercises
- Exercise 4.1
Explore the sensitivity of programming languages to errors of syntax. Try removing the semicolon from the end of any statement of one of our working programs and examining the error messages that result, if any. Try changing other syntactical items: add a parenthesis or a curly brace; misspell “print” or some other reserved word; just type in, or delete, anything. Programmers get used to seeing such errors; even after getting to know the language well, it is still common to have some syntax errors as you gradually add code to a program. Notice how one error can lead to many lines of error reporting. Is Perl accurately reporting the line where the error is?
- Exercise 4.2
Write a program that stores an integer in a variable and then prints it out.
- Exercise 4.3
Write a program that prints DNA (which could be in upper- or lowercase originally) in lowercase (acgt); write another that prints the DNA in uppercase (ACGT). Use the function
tr///
.- Exercise 4.4
Do the same thing as Exercise 4.3, but use the string directives
\U
and\L
for upper- and lowercase. For instance,print
"\U$DNA"
prints the data in$DNA
in uppercase.- Exercise 4.5
Sometimes information flows from RNA to DNA. Write a program to reverse transcribe RNA to DNA.
- Exercise 4.6
Read two files of data, and print the contents of the first followed by the contents of the second.
- Exercise 4.7
This is a more difficult exercise. Write a program to read a file, and then print its lines in reverse order, the last line first. You may want to look up the functions
push
,pop
,shift
, andunshift
, and choose one or more of them to accomplish this exercise. You may want to look ahead to Chapter 5 so you can use a loop in this program, but this may not be necessary depending on the approach you take. Or, you may want to usereverse
on an array of lines.
[1] A new character encoding called Unicode, which can handle all the symbols in all the world’s languages, is becoming widely accepted and is supported by Perl as well.
[2] There are occasions when you might add curly braces during string interpolation. The extra curly braces
make sure the variable names aren’t confused with anything else in the
double-quoted string. For example, if you had variable $prefix
and tried to interpolate it into the
string I am
$prefixinterested
, Perl might not
recognize the variable, confusing it with a nonexistent variable $prefixinterested
. But the string I
am ${prefix}interested
is unambiguous to
Perl.
[3] Briefly, the coding DNA strand is the reverse complement of the other strand, which is used as a template to synthesize its reverse complement as RNA, with T’s replaced as U’s. With the two reverse complements, this is the same as the coding strand with the T→U replacement.
[4] We’re ignoring the mechanism of the splicing out of introns, obviously.
The T
stands for thymine; the U
stands for uracil.
[5] Recall the discussion in Section 4.2.4.3 about the importance of the order of the parts
in an assignment statement. Here, the value of $DNA
, that is, the DNA sequence data that has been stored in
the $DNA
variable, is being assigned to
the variable $RNA
. If you had written
$DNA = $RNA;
, the value of the
$RNA
variable (which is empty) would
have been assigned to the $DNA
variable,
in effect wiping out the DNA sequence data in that variable and leaving two
empty variables.
Get Beginning Perl for Bioinformatics now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.