Search the Catalog
Beginning Perl for Bioinformatics

Beginning Perl for Bioinformatics

James Tisdall
October 2001
0-596-00080-4, Order Number: 0804
384 pages, $39.95

Chapter 10
GenBank

GenBank (Genetic Sequence Data Bank) is a rapidly growing international repository of known genetic sequences from a variety of organisms. Its use is central to modern biology and to bioinformatics.

This chapter shows you how to write Perl programs to extract information from GenBank files and libraries. Exercises include looking for patterns; creating special libraries; and parsing the flat-file format to extract the DNA, annotation, and features. You will learn how to make a DBM database to create your own rapid-access lookups on selected data in a GenBank library.

Perl is a great tool for dealing with GenBank files. It enables you to extract and use any of the detailed data in the sequence and in the annotation, such as in the FEATURES table and elsewhere. When I first started using Perl, I wrote a program that searched GenBank for all sequence records annotated as being located on human chromosome 22. I found many genes where that information was so deeply buried within the annotation, that the major gene mapping database, Genome Database (GDB), hadn't included them in their chromosome map. I think you'll discover the same feeling of power over the information when you start applying Perl to GenBank files.

Most biologists are familiar with GenBank. Researchers can perform a search, e.g., a BLAST search on some query sequence, and collect a set of GenBank files of related sequences as a result. Because the GenBank records are maintained by the individual scientists who discovered the sequences, if you find some new sequence of interest, you can publish it in GenBank.

GenBank files have a great deal of information in them in addition to sequence data, including identifiers such as accession numbers and gene names, phylogenetic classification, and references to published literature. A GenBank file may also include a detailed FEATURES table that summarizes facts about the sequence, such as the location of the regulatory regions, the protein translation, and exons and introns.

GenBank is sometimes referred to as a databank or data store, which is different from a database. Databases typically have a relational structure imposed upon the data, including associated indices and links and a query language. GenBank in comparison is a flat file, that is, an ASCII text file that is easily readable by humans.[1]

From its humble beginnings GenBank has rapidly grown, and the flat-file format has seen signs of strain during the growth. With a quickly advancing body of knowledge, especially one that's growing as quickly as genetic data, it's difficult for the design of a databank to keep up. Several reworkings of GenBank have been done, but the flat-file format--in all its frustrating glory--still remains.

Due to a certain flexibility in the content of some sections of a GenBank record, extracting the information you're looking for can be tricky. This flexibility is good, in that it allows you to put what you think is most important into the data's annotation. It's bad, because that same flexibility makes it harder to write programs that to find and extract the desired annotations. As a result, the trend has been towards more structure in the annotations.

Since Perl's data structures and its use of regular expressions make it a good tool for manipulating flat files, Perl is especially well-suited to deal with GenBank data. Using these features in Perl and building on the skills you've developed from previous chapters, you can write programs to access the accumulated genetic knowledge of the scientific community in GenBank.

Since this is a beginning book that requires no programming experience, you should not expect to find the most finished, multipurpose software here. Instead you'll find a solid introduction to parsing and building fast lookup tables for GenBank files. If you've never done so, I strongly recommend you explore the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH) (http://www.ncbi.nlm.nih.gov). While you're at it, stop by the European Bioinformatics Institute (EBI) at http://www.ebi.ac.uk and the bioinformatics arm of the European Molecular Biology Laboratory (EMBL) at http://www.embl-heidelberg.de/. These are large, heavily funded governmental bioinformatics powerhouses, and they have (and distribute) a great deal of state-of-the-art bioinformatics software.

GenBank Files

The primary repositories for genetic information are the NCBI GenBank, EMBL in Europe, and the DNA Data Bank of Japan (DDBJ). All have almost identical information due to international cooperative agreements. Each entry or record in GenBank or its mirror sites may contain identifying, descriptive, and genetic information in ASCII-format files. Each record is written in a specific standard format, organized so that both humans and computer programs can extract the desired information with reasonable ease.

Let's look at a relatively short GenBank record and at how the fields are defined, before writing any code. I'll save this information in a file called record.gb, for use in later programs.

LOCUS       AB031069     2487 bp    mRNA            PRI       27-MAY-2000
DEFINITION  Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1,
            complete cds.
ACCESSION   AB031069
VERSION     AB031069.1  GI:8100074
KEYWORDS    .
SOURCE      Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to
            mRNA.
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (sites)
  AUTHORS   Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and
            Takano,T.
  TITLE     PCCX1, a novel DNA-binding protein with PHD finger and CXXC domain,
            is regulated by proteolysis
  JOURNAL   Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000)
  MEDLINE   20261256
REFERENCE   2  (bases 1 to 2487)
  AUTHORS   Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and
            Takano,T.
  TITLE     Direct Submission
  JOURNAL   Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases.
            Tadahiro Fujino, Keio University School of Medicine, Department of
            Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan
            (E-mail:fujino@microb.med.keio.ac.jp,
            Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508)
FEATURES             Location/Qualifiers
     source          1..2487
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /sex="male"
                     /cell_line="HuS-L12"
                     /cell_type="lung fibroblast"
                     /dev_stage="embryo"
     gene            229..2199
                     /gene="PCCX1"
     CDS             229..2199
                     /gene="PCCX1"
                     /note="a nuclear protein carrying a PHD finger and a CXXC
                     domain"
                     /codon_start=1
                     /product="protein containing CXXC domain 1"
                     /protein_id="BAA96307.1"
                     /db_xref="GI:8100075"
                     /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD
                     NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP
                     RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ
                     QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY
                     FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP
                     EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE
                     KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS
                     DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR
                     FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK
                     YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC
                     PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT
                     AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR"
BASE COUNT      564 a    715 c    768 g    440 t
ORIGIN      
        1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
       61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
      121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt
      181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat
      241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat
      301 gcgcccatct actgcatctg ccgcaaaccg gacatcaact gcttcatgat cgggtgtgac
      361 aactgcaatg agtggttcca tggggactgc atccggatca ctgagaagat ggccaaggcc
      421 atccgggagt ggtactgtcg ggagtgcaga gagaaagacc ccaagctaga gattcgctat
      481 cggcacaaga agtcacggga gcgggatggc aatgagcggg acagcagtga gccccgggat
      541 gagggtggag ggcgcaagag gcctgtccct gatccagacc tgcagcgccg ggcagggtca
      601 gggacagggg ttggggccat gcttgctcgg ggctctgctt cgccccacaa atcctctccg
      661 cagcccttgg tggccacacc cagccagcat caccagcagc agcagcagca gatcaaacgg
      721 tcagcccgca tgtgtggtga gtgtgaggca tgtcggcgca ctgaggactg tggtcactgt
      781 gatttctgtc gggacatgaa gaagttcggg ggccccaaca agatccggca gaagtgccgg
      841 ctgcgccagt gccagctgcg ggcccgggaa tcgtacaagt acttcccttc ctcgctctca
      901 ccagtgacgc cctcagagtc cctgccaagg ccccgccggc cactgcccac ccaacagcag
      961 ccacagccat cacagaagtt agggcgcatc cgtgaagatg agggggcagt ggcgtcatca
     1021 acagtcaagg agcctcctga ggctacagcc acacctgagc cactctcaga tgaggaccta
     1081 cctctggatc ctgacctgta tcaggacttc tgtgcagggg cctttgatga ccatggcctg
     1141 ccctggatga gcgacacaga agagtcccca ttcctggacc ccgcgctgcg gaagagggca
     1201 gtgaaagtga agcatgtgaa gcgtcgggag aagaagtctg agaagaagaa ggaggagcga
     1261 tacaagcggc atcggcagaa gcagaagcac aaggataaat ggaaacaccc agagagggct
     1321 gatgccaagg accctgcgtc actgccccag tgcctggggc ccggctgtgt gcgccccgcc
     1381 cagcccagct ccaagtattg ctcagatgac tgtggcatga agctggcagc caaccgcatc
     1441 tacgagatcc tcccccagcg catccagcag tggcagcaga gcccttgcat tgctgaagag
     1501 cacggcaaga agctgctcga acgcattcgc cgagagcagc agagtgcccg cactcgcctt
     1561 caggaaatgg aacgccgatt ccatgagctt gaggccatca ttctacgtgc caagcagcag
     1621 gctgtgcgcg aggatgagga gagcaacgag ggtgacagtg atgacacaga cctgcagatc
     1681 ttctgtgttt cctgtgggca ccccatcaac ccacgtgttg ccttgcgcca catggagcgc
     1741 tgctacgcca agtatgagag ccagacgtcc tttgggtcca tgtaccccac acgcattgaa
     1801 ggggccacac gactcttctg tgatgtgtat aatcctcaga gcaaaacata ctgtaagcgg
     1861 ctccaggtgc tgtgccccga gcactcacgg gaccccaaag tgccagctga cgaggtatgc
     1921 gggtgccccc ttgtacgtga tgtctttgag ctcacgggtg acttctgccg cctgcccaag
     1981 cgccagtgca atcgccatta ctgctgggag aagctgcggc gtgcggaagt ggacttggag
     2041 cgcgtgcgtg tgtggtacaa gctggacgag ctgtttgagc aggagcgcaa tgtgcgcaca
     2101 gccatgacaa accgcgcggg attgctggcc ctgatgctgc accagacgat ccagcacgat
     2161 cccctcacta ccgacctgcg ctccagtgcc gaccgctgag cctcctggcc cggacccctt
     2221 acaccctgca ttccagatgg gggagccgcc cggtgcccgt gtgtccgttc ctccactcat
     2281 ctgtttctcc ggttctccct gtgcccatcc accggttgac cgcccatctg cctttatcag
     2341 agggactgtc cccgtcgaca tgttcagtgc ctggtggggc tgcggagtcc actcatcctt
     2401 gcctcctctc cctgggtttt gttaataaaa ttttgaagaa accaaaaaaa aaaaaaaaaa
     2461 aaaaaaaaaa aaaaaaaaaa aaaaaaa
//

Even if you're used to seeing GenBank files, it's worth taking the time to look one over, while considering how you would write a program to extract various parts of the data. For instance, how would you extract the sequence data? What's the format of the FEATURES table and its various subfields?

There's a lot of information packed into a typical GenBank entry, and it's important to be able to separate the different parts. For instance, if you can extract the sequence, you can search for motifs, calculate statistics on the sequence, look for similarity with other sequences, and so forth. Similarly, you'll want to separate out--or parse--the various parts of the data annotation. In GenBank, this includes ID numbers, gene names, genus and species, publications, etc. The FEATURES table part of the annotation can include specific information about the DNA, such as the locations of exons, regulatory regions, important mutations, and so on.

The format specification of GenBank files and a great deal of other information about GenBank can be found in the GenBank release notes, gbrel.txt, on the GenBank web site at ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt.

gbrel.txt gives complete detail about the structure of GenBank files to help programmers, so you may want to refer to it as your searches become more complex. As a Perl programmer, you won't need all of the detail because you can parse data using regular expressions or the split function. You need to get the data out and make it available to your programs. The code that accomplishes this task can be fairly simple, as you will see in this chapter.

GenBank Libraries

GenBank is distributed as a set of libraries--flat files containing many records in succession.[2] As of GenBank release 125.0, August 2001, there are 243 files, most of which are over 200 MB in size. Altogether, GenBank contains 12,813516 loci and 13,543,364,296 bases from 12,813,516 reported sequences. The libraries are usually distributed compressed, which means you can download somewhat smaller files, but you need to uncompress them after you received them. Uncompressed, this amounts to about 50 GB of data. Since 1982, the number of sequences in GenBank has doubled about every 14 months.

GenBank libraries are further organized into divisions by the classification of the sequences they contain, either phylogenetically or by sequencing technology. Here are the divisions:

Some divisions are very large: the largest, the EST, or expressed sequence tag division, is comprised of 123 library files! A portion of human DNA is stored in the PRI division, which contains (as of this writing) 13 library files, for a total of almost 3.5 GB of data. Human data is also stored in the STS, GSS, HTGS, and HTC divisions. Human data alone in GenBank makes up almost 5 million record entries with over 8 trillion bases of sequence.

The public database servers such as Entrez or BLAST at http://www.ncbi.nlm.nih.gov/ give you access to well-maintained and updated sequence data and programs, but many researchers find that they need to write their own programs to manipulate and analyze the data. The problem is, there's so much data. For many purposes, you can download a selected set of records from NCBI or other locations, but sometimes you need the whole dataset.

It's possible to set up a desktop workstation (Windows, Mac, Unix, or Linux) that contains all of GenBank; just be sure to buy a very large hard disk! Getting all that data onto your hard drive, however, is more difficult. A Perl program called mirror.pl helps to address this need. Downloading it, even with a university-standard, high-speed Internet connection can be time-consuming; downloading an entire dataset with a modem can be an exercise in frustration. The best solution is to download only the files you need, in compressed form. The EST data, for example, is about half the entire database; don't download it unless you really need to. If you need to download GenBank, I recommend contacting the help desk at NCBI. They'll help you get the most up-to-date information.

Since you're learning to program, it makes more sense to practice on a tiny, five-record library file, but the programs you'll write will work just fine on the real files.

Separating Sequence and Annotation

In previous chapters you saw how to examine the lines of a file using Perl's array operations. Usually, you do this by saving the data in an array with each appearing as an element of the array.

Let's look at two methods to extract the annotation and the DNA from a GenBank file. In the first method, you'll slurp the file into an array and look through the lines, as in previous programs. In the second, you'll put the whole GenBank record into a scalar variable and use regular expressions to parse the information. Is one approach better than the other? Not necessarily: it depends on the data. There are advantages and disadvantages to each, but both get the job done.

I've put five GenBank records in a file called library.gb. As before, you can download the file from this book's web site. You'll use this datafile and the file record.gb in the next few examples.

Using Arrays

Example 10-1 shows the first method, which operates on an array containing the lines of the GenBank record. The main program is followed by a subroutine that does the real work.

Example 10-1: Extract annotation and sequence from GenBank file

#!/usr/bin/perl
# Extract annotation and sequence from GenBank file
 
use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module
 
# declare and initialize variables
my @annotation = (  );
my $sequence = '';
my $filename = 'record.gb';
 
parse1(\@annotation, \$sequence, $filename);
 
# Print the annotation, and then
#   print the DNA in new format just to check if we got it okay.
print @annotation;
 
print_sequence($sequence, 50);
 
exit;
 
################################################################################
# Subroutine
################################################################################
 
# parse1
#
# --parse annotation and sequence from GenBank record
 
sub parse1 {
 
    my($annotation, $dna, $filename) = @_;
 
    # $annotation--reference to array
    # $dna       --reference to scalar
    # $filename  --scalar
    
    # declare and initialize variables
    my $in_sequence = 0; 
    my @GenBankFile = (  );
    
    # Get the GenBank data into an array from a file
    @GenBankFile = get_file_data($filename);
    
    # Extract all the sequence lines
    foreach my $line (@GenBankFile) {
 
        if( $line =~ /^\/\/\n/ ) { # If $line is end-of-record line //\n,
            last; #break out of the foreach loop.
        } elsif( $in_sequence) { # If we know we're in a sequence,
            $$dna .= $line; # add the current line to $$dna.
        } elsif ( $line =~ /^ORIGIN/ ) { # If $line begins a sequence,
            $in_sequence = 1; # set the $in_sequence flag.
        } else{ # Otherwise
            push( @$annotation, $line); # add the current line to @annotation.
        }
    }
    
    # remove whitespace and line numbers from DNA sequence
    $$dna =~ s/[\s0-9]//g;
}

Here's the beginning and end of Example 10-1's output of the sequence data:

agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
...
cggtgcccgtgtgtccgttcctccactcatctgtttctccggttctccct
gtgcccatccaccggttgaccgcccatctgcctttatcagagggactgtc
cccgtcgacatgttcagtgcctggtggggctgcggagtccactcatcctt
gcctcctctccctgggttttgttaataaaattttgaagaaaccaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

The foreach loop in subroutine parse1 in Example 10-1 moves one by one through the lines from the GenBank file stored in the array @GenBankFile. It takes advantage of the structure of a GenBank file, which begins with annotation and runs until the line:

ORIGIN

is found, after which sequence appears until the end-of-record line // is reached. The loop uses a flag variable $in_sequence to remember that it has found the ORIGIN line and is now reading sequence lines.

The foreach loop has a new feature: the Perl built-in function last, which breaks out of the nearest enclosing loop. It's triggered by the end-of-record line //, which is reached when the entire record has been seen.

A regular expression is used to find the end-of-record line. To correctly match the end-of-record (forward) slashes, you must escape them by putting a backslash in front of each one, so that Perl doesn't interpret them as prematurely ending the pattern. The regular expression also ends with a newline \/\/\n, which is then placed inside the usual delimiters: /\/\/\n/. (When you have a lot of forward slashes in a regular expression, you can use another delimiter around the regular expression and precede it with an m, thus avoiding having to backslash the forward slashes. It's done like so: m!//\n!).

An interesting point about subroutine parse1 is the order of the tests in the foreach loop that goes through the lines of the GenBank record. As you read through the lines of the record, you want to first gather the annotation lines, set a flag when the ORIGIN start-of-sequence line is found, and then collect the lines until the end-of-record // line is found.

Notice that the order of the tests is exactly the opposite. First, you test for the end-of-record line, collect the sequence if the $in_sequence flag is set, and then test for the start-of-sequence ORIGIN line. Finally, you collect the annotation.

The technique of reading lines one by one and using flag variables to mark what section of the file you're in, is a common programming technique. So, take a moment to think about how the loop would behave if you changed the order of the tests. If you collected sequence lines before testing for the end-of-record, you'd never get to the end-of-record test!

Other methods of collecting annotation and sequence lines are possible, especially if you go through the lines of the array more than once. You can scan through the array, keeping track of the start-of-sequence and end-of-record line numbers, and then go back and extract the annotation and sequence using an array splice (which was described in the parseREBASE subroutine in Example 9-2). Here's an example:

# find line numbers of ORIGIN and // in the GenBank record
 
$linenumber = 0;
foreach my $line (@GenBankFile) {
     if ( $line =~ /^//\n/ ) {  # end-of-record // line
         $end = $linenumber;
         last;
     } elsif ( $line =~ /^ORIGIN/ ) { # end annotation, begin sequence
         $origin = $linenumber;
     }
     $linenumber++;
}
 
# extract annotation and sequence with "array splice"
 
@annotation = @GenBankFile[0..($origin-1)];
@sequence   = @GenBankFile[($origin+1)..($end-1)];
</programlisting>

Using Scalars

A second way to separate annotations from sequences in GenBank records is to read the entire record into a scalar variable and operate on it with regular expressions. For some kinds of data, this can be a more convenient way to parse the input (compared to scanning through an array, as in Example 10-1).

Usually string data is stored one line per scalar variable with its newlines, if any, at the end of the string. Sometimes, however, you store several lines concatenated together in one string that is, in turn, stored in a single scalar variable. These multiline strings aren't uncommon; you used them to gather the sequence from a FASTA file in Examples and . Regular expressions have pattern modifiers that can be used to make multiline strings with their embedded newlines easy to use.

Pattern modifiers

The pattern modifiers we've used so far are /g, for global matching, and /i, for case-insensitive matching. Let's take a look at two more that affect the way regular expressions interact with the newlines in scalars.

Recall that previous regular expressions have used the caret (^), dot (.), and dollar sign ($) metacharacters. The ^ anchors a regular expression to the beginning of a string, by default, so that /^THE BEGUINE/ matches a string that begins with "THE BEGUINE". Similarly, $ anchors an expression to the end of the string, and the dot (.) matches any character except a newline.

The following pattern modifiers affect these three metacharacters:

Examples of pattern modifiers

Here's an example of the default behavior of caret (^), dot (.), and dollar sign ($):

use warnings;
"AAC\nGTT" =~ /^.*$/;
print $&, "\n";

This demonstrates the default behavior without the /m or /s modifiers and prints the warning:

Use of uninitialized value in print statement at line 3.

The print statement tries to print $&, a special variable that is always set to the last successful pattern match. This time, since the pattern doesn't match, the variable $& isn't set, and you get a warning message for attempting to print an uninitialized value.

Why doesn't the match succeed? First, let's examine the ^.*$ pattern. It begins with a ^, which means it must match from the beginning of the string. It ends with a $, which means it must also match at the end of the string (the end of the string may contain a single newline, but no other newlines are allowed). The .* means it must match zero or more (*) of any characters (.) except the newline. So, in other words, the pattern ^.*$ matches any string that doesn't contain a newline except for a possible single newline as the last character. But since the string in question, "ACC\nGTT" does contain an embedded newline \n that isn't the last character, the pattern match fails.

In the next examples, the pattern modifiers /m and /s change the default behaviors for the metacharacters ^, and $, and the dot:

"AAC\nGTT" =~ /^.*$/m;
print $&, "\n";

This snippet prints out AAC and demonstrates the /m modifier. The /m extends the meaning of the ^ and the $ so they also match around embedded newlines. Here, the pattern matches from the beginning of the string up to the first embedded newline.

The next snippet of code demonstrates the /s modifier:

"AAC\nGTT" =~ /^.*$/s;
print $&, "\n";

which produces the output:

AAC
GTT

The /s modifier changes the meaning of the dot metacharacter so that it matches any character including newlines. With the /s modifier, the pattern matches everything from the beginning of the string to the end of the string, including the newline. Notice when it prints, it prints the embedded newline.

Separating annotations from sequence

Now that you've met the pattern-matching modifiers and regular expressions that will be your main tools for parsing a GenBank file as a scalar, let's try separating the annotations from the sequence.

The first step is to get the GenBank record stored as a scalar variable. Recall that a GenBank record starts with a line beginning with the word "LOCUS" and ends with the end-of-record separator: a line containing two forward slashes.

First you want to read a GenBank record and store it in a scalar variable. There's a device called an input record separator denoted by the special variable $/ that lets you do exactly that. The input record separator is usually set to a newline, so each call to read a scalar from a filehandle gets one line. Set it to the GenBank end-of-record separator like so:

$/ = "//\n";

A call to read a scalar from a filehandle takes all the data up to the GenBank end-of-record separator. So the line $record = <GBFILE> in Example 10-2 stores the multiline GenBank record into the scalar variable $record. Later, you'll see that you can keep repeating this call in order to read in successive GenBank records from a GenBank library file.

After reading in the record, you'll parse it into the annotation and sequence parts making use of /s and /m pattern modifiers. Extracting the annotation and sequence is the easy part; parsing the annotation will occupy most of the remainder of the chapter.

Example 10-2: Extract annotation and sequence from Genbank record

#!/usr/bin/perl
# Extract the annotation and sequence sections from the first
#   record of a GenBank library
 
use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module
 
# Declare and initialize variables
my $annotation = '';
my $dna = '';
my $record = '';
my $filename = 'record.gb';
my $save_input_separator = $/;
 
# Open GenBank library file
unless (open(GBFILE, $filename)) {
    print "Cannot open GenBank file \"$filename\"\n\n";
    exit;
}
 
# Set input separator to "//\n" and read in a record to a scalar
$/ = "//\n";
 
$record = <GBFILE>;
 
# reset input separator 
$/ = $save_input_separator;
 
# Now separate the annotation from the sequence data
($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);
 
# Print the two pieces, which should give us the same as the
#  original GenBank file, minus the // at the end
print $annotation, $dna;
 
exit;

The output from this program is the same as the GenBank file listed previously, minus the last line, which is the end-of-record separator //.

Let's focus on the regular expression that parses the annotation and sequence out of the $record variable. This is the most complicated regular expression so far:

$record = /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s.

There are two pairs of parentheses in the regular expression: (LOCUS.*ORIGIN\s*\n) and (.*). The parentheses are metacharacters whose purpose is to remember the parts of the data that match the pattern within the parentheses, namely, the annotation and the sequence. Also note that the pattern match returns an array whose elements are the matched parenthetical patterns. After you match the annotation and the sequence within the pairs of parentheses in the regular expression, you simply assign the matched patterns to the two variables $annotation and $dna, like so:

($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);

Notice that at the end of the pattern, we've added the /s pattern matching modifier, which, as you've seen earlier, allows a dot to match any character including an embedded newline. (Of course, since we've got a whole GenBank record in the $record scalar, there are a lot of embedded newlines.)

Next, look at the first pair of parentheses:

(LOCUS.*ORIGIN\s*\n)

This whole expression is anchored at the beginning of the string by preceding it with a ^ metacharacter. (/s doesn't change the meaning of the ^ character in a regular expression.)

Inside the parentheses, you match from where the string LOCUS appears at the beginning of the GenBank record, followed by any number of characters including newlines with .*, followed by the string ORIGIN, followed by possibly some whitespace with \s*, followed by a newline \n. This matches the annotation part of the GenBank record.

Now, look at the second parentheses and the remainder:

(.*)\/\/\n

This is easier. The .* matches any character, including newlines because of the /s pattern modifier at the end of the pattern match. The parentheses are followed by the end-of-record line, //, including the newline at the end, with the slashes preceded by backslashes to show that you want to match them exactly. They're not delimiters of the pattern matching operator. The end result is the GenBank record with the annotation and the sequence separated into the variables $annotation and $sequence. Although the regular expression I used requires a bit of explanation, the attractive thing about this approach is that it took only one line of Perl code to extract both annotation and sequence.

Parsing Annotations

Now that you've successfully extracted the sequence, let's look at parsing the annotations of a GenBank file.

Looking at a GenBank record, it's interesting to think about how to extract the useful information. The FEATURES table is certainly a key part of the story. It has considerable structure: what should be preserved, and what is unnecessary? For instance, sometimes you just want to see if a word such as "endonuclease" appears anywhere in the record. For this, you just need a subroutine that searches for any regular expression in the annotation. Sometimes this is enough, but when detailed surgery is necessary, Perl has the necessary tools to make the operation successful.

Using Arrays

Example 10-3 parses a few pieces of information from the annotations in a GenBank file. It does this using the data in the form of an array.

Example 10-3: Parsing GenBank annotations using arrays

#!/usr/bin/perl
# Parsing GenBank annotations using arrays
 
use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module
 
# Declare and initialize variables
my @genbank = (  );
my $locus = '';
my $accession = '';
my $organism = '';
 
# Get GenBank file data
@genbank = get_file_data('record.gb');
 
# Let's start with something simple.  Let's get some of the identifying
# information, let's say the locus and accession number (here the same
# thing) and the definition and the organism.
 
for my $line (@genbank) {
  if($line =~ /^LOCUS/) {
    $line =~ s/^LOCUS\s*//;
    $locus = $line;
  }elsif($line =~ /^ACCESSION/) {
    $line =~ s/^ACCESSION\s*//;
    $accession = $line;
  }elsif($line =~ /^  ORGANISM/) {
    $line =~ s/^\s*ORGANISM\s*//;
    $organism = $line;
  }
}
 
print "*** LOCUS ***\n";
print $locus;
print "*** ACCESSION ***\n";
print $accession;
print "*** ORGANISM ***\n";
print $organism;
 
exit;

Here's the output from Example 10-3:

*** LOCUS ***
AB031069     2487 bp    mRNA            PRI       27-MAY-2000
*** ACCESSION ***
AB031069
*** ORGANISM ***
Homo sapiens

Now let's slightly extend that program to handle the DEFINITION field. Notice that the DEFINITION field can extend over more than one line. To collect that field, use a trick you've already seen in Example 10-1: set a flag when you're in the "state" of collecting a definition. The flag variable is called, unsurprisingly, $flag.

Example 10-4: Parsing GenBank annotations using arrays, take 2

#!/usr/bin/perl
# Parsing GenBank annotations using arrays, take 2
 
use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module
 
# Declare and initialize variables
my @genbank = (  );
my $locus = '';
my $accession = '';
my $organism = '';
my $definition = '';
my $flag = 0;
 
# Get GenBank file data
@genbank = get_file_data('record.gb');
 
# Let's start with something simple.  Let's get some of the identifying
# information, let's say the locus and accession number (here the same
# thing) and the definition and the organism.
 
for my $line (@genbank) {
  if($line =~ /^LOCUS/) {
    $line =~ s/^LOCUS\s*//;
    $locus = $line;
  }elsif($line =~ /^DEFINITION/) {
    $line =~ s/^DEFINITION\s*//;
    $definition = $line;
    $flag = 1;
  }elsif($line =~ /^ACCESSION/) {
    $line =~ s/^ACCESSION\s*//;
    $accession = $line;
    $flag = 0;
  }elsif($flag) {
    chomp($definition);
    $definition .= $line;
  }elsif($line =~ /^  ORGANISM/) {
    $line =~ s/^\s*ORGANISM\s*//;
    $organism = $line;
  }
}
 
print "*** LOCUS ***\n";
print $locus;
print "*** DEFINITION ***\n";
print $definition;
print "*** ACCESSION ***\n";
print $accession;
print "*** ORGANISM ***\n";
print $organism;
 
exit;

Example 10-4 outputs:

*** LOCUS ***
AB031069     2487 bp    mRNA            PRI       27-MAY-2000
*** DEFINITION ***
Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1, complete cds.
*** ACCESSION ***
AB031069
*** ORGANISM ***
Homo sapiens

This use of flags to remember which part of the file you're in, from one iteration of a loop to the next, is a common technique when extracting information from files that have multiline sections. As the files and their fields get more complex, the code must keep track of many flags at a time to remember which part of the file it's in and what information needs to be extracted. It works, but as the files become more complex, so does the code. It becomes hard to read and hard to modify. So let's look at regular expressions as a vehicle for parsing annotations.

When to Use Regular Expressions

We've used two methods to parse GenBank files: regular expressions and looping through arrays of lines and setting flags. We used both methods to separate the annotation from the sequence in a previous section of this chapter. Both methods were equally well suited, since in GenBank files, the annotation is followed by the sequence, clearly delimited by an ORIGIN line: a simple structure. However, parsing the annotations seems a bit more complicated; therefore, let's try to use regular expressions to accomplish the task.

To begin, let's wrap the code we've been working on into some convenient subroutines to focus on parsing the annotations. You'll want to fetch GenBank records one at a time from a library (a file containing one or more GenBank records), extract the annotations and the sequence, and then if desired parse the annotations. This would be useful if, say, you were looking for some motif in a GenBank library. Then you can search for the motif, and, if found, you can parse the annotations to look for additional information about the sequence.

As mentioned previously, we'll use the file library.gb, which you can download from this book's web site.

Since dealing with annotation data is somewhat complex, let's take a minute to break our tasks into convenient subroutines. Here's the pseudocode:

sub open_file
    given the filename, return the filehandle
 
sub get_next_record
    given the filehandle, get the record
    (we can get the offset by first calling "tell")
 
sub get_annotation_and_dna
    given a record, split it into annotation and cleaned-up sequence
 
sub search_sequence
    given a sequence and a regular expression,
      return array of locations of hits
 
sub search_annotation
    given a GenBank annotation and a regular expression,
      return array of locations of hits
 
sub parse_annotation
    separate out the fields of the annotation in a convenient form
 
sub parse_features
    given the features field, separate out the components

The idea is to make a subroutine for each important task you want to accomplish and then combine them into useful programs. Some of these can be combined into other subroutines: for instance, perhaps you want to open a file and get the record from it, all in one subroutine call.

You're designing these subroutines to work with library files, that is, files with multiple GenBank records. You pass the filehandle into the subroutines as an argument, so that your subroutines can access open library files as represented by the filehandles. Doing so enables you to have a get_next_record function, which is handy in a loop. Using the Perl function tell also allows you to save the byte offset of any record of interest, and then return later and extract the record at that byte offset very quickly. (A byte offset is just the number of characters into the file where the information of interest lies.) The operating system supports Perl in letting you go immediately to any byte offset location in even huge files, thus bypassing the usual way of opening the file and reading from the beginning until you get where you want to be.

Using a byte offset is important when you're dealing with large files. Perl gives you built-in functions such as seek that allow you, on an open file, to go immediately to any location in the file. The idea is that when you find something in a file, you can save the byte offset using the Perl function tell. Then, when you want to return to that point in the file, you can just call the Perl function seek with the byte offset as an argument. You'll see this later in this chapter when you build a DBM file to look up records based on their accession numbers. But the main point is that with a 250-MB file, it takes too long to find something by searching from the beginning, and there are ways of getting around it.

The parsing of the data is done in three steps, according to the design:

  1. You'll separate out the annotation and the sequence (which you'll clean up by removing whitespace, etc., and making it a simple string of sequence). Even at this step, you can search for motifs in the sequence, as well as look for text in the annotation.
  2. Extract out the fields.
  3. Parse the features table.

These steps seem natural, and, depending on what you want to do, allow you to parse to whatever depth is needed.

Here's a main program in pseudocode that shows how to use those subroutines:

open_file 
 
while ( get_next_record )
 
    get_annotation_and_dna
 
    if ( search_sequence for a motif AND 
         search_annotation for chromosome 22 )
 
        parse_annotation
 
        parse_features to get sizes of exons, look for small sizes
    }
}
 
return accession numbers of records meeting the criteria

This example shows how to use subroutines to answer a question such as: what are the genes on chromosome 22 that contain a given motif and have small exons?

Main Program

Let's test these subroutines with Example 10-5, which has some subroutine definitions that will be added to the BeginPerlBioinfo.pm module:

Example 10-5: GenBank library subroutines

#!/usr/bin/perl
#  - test program of GenBank library subroutines
 
use strict;
use warnings;
# Don't use BeginPerlBioinfo
# Since all subroutines defined in this file
# use BeginPerlBioinfo;     # see Chapter 6 about this module
 
# Declare and initialize variables
my $fh; # variable to store filehandle
my $record;
my $dna;
my $annotation;
my $offset;
my $library = 'library.gb';
 
# Perform some standard subroutines for test
$fh = open_file($library);
 
$offset = tell($fh);
 
while( $record = get_next_record($fh) ) {
 
    ($annotation, $dna) = get_annotation_and_dna($record);
 
    if( search_sequence($dna, 'AAA[CG].')) {
        print "Sequence found in record at offset $offset\n";
    }
    if( search_annotation($annotation, 'homo sapiens')) {
        print "Annotation found in record at offset $offset\n";
    }
 
    $offset = tell($fh);
}
 
exit;
 
################################################################################
# Subroutines
################################################################################
 
# open_file
#
#   - given filename, set filehandle
 
sub open_file {
 
    my($filename) = @_;
    my $fh;
 
    unless(open($fh, $filename)) {
        print "Cannot open file $filename\n";
        exit;
    }
    return $fh;
}
 
# get_next_record
#
#   - given GenBank record, get annotation and DNA
 
sub get_next_record {
 
    my($fh) = @_;
 
    my($offset);
    my($record) = '';
    my($save_input_separator) = $/;
 
    $/ = "//\n";
 
    $record = <$fh>;
 
    $/ = $save_input_separator;
 
    return $record;
}
 
# get_annotation_and_dna
#
#   - given GenBank record, get annotation and DNA
 
sub get_annotation_and_dna {
 
    my($record) = @_;
 
    my($annotation) = '';
    my($dna) = '';
 
    # Now separate the annotation from the sequence data
    ($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);
 
    # clean the sequence of any whitespace or / characters 
    #  (the / has to be written \/ in the character class, because
    #   / is a metacharacter, so it must be "escaped" with \)
    $dna =~ s/[\s\/]//g;
 
    return($annotation, $dna)
}
 
# search_sequence
#
#   - search sequence with regular expression
 
sub search_sequence {
 
    my($sequence, $regularexpression) = @_;
 
    my(@locations) = (  );
 
    while( $sequence =~ /$regularexpression/ig ) {
        push( @locations, pos );
    }
 
    return (@locations);
}
 
# search_annotation
#
#   - search annotation with regular expression
 
sub search_annotation {
 
    my($annotation, $regularexpression) = @_;
 
    my(@locations) = (  );
 
    # note the /s modifier--. matches any character including newline
    while( $annotation =~ /$regularexpression/isg ) {
        push( @locations, pos );
    }
 
    return (@locations);
}

Example 10-5 generates the following output on our little GenBank library:

Sequence found in record at offset 0
Annotation found in record at offset 0
Sequence found in record at offset 6256
Annotation found in record at offset 6256
Sequence found in record at offset 12366
Annotation found in record at offset 12366
Sequence found in record at offset 17730
Annotation found in record at offset 17730
Sequence found in record at offset 22340
Annotation found in record at offset 22340

The tell function reports the byte offset of the file up to the point where it's been read; so you want to first call tell and then read the record to get the proper offset associated with the beginning of the record.

Parsing Annotations at the Top Level

Now let's parse the annotations.

There is a document from NCBI we mentioned earlier that gives the details of the structure of a GenBank record. This file is gbrel.txt and is part of the GenBank release, available at the NCBI web site and their FTP site. It's updated with every release (every two months at present), and it includes notices of changes to the format. If you program with GenBank records, you should read this document and keep a copy around for reference use, and check periodically for announced changes in the GenBank record format.

If you look back at the complete GenBank record earlier in this chapter, you'll see that the annotations have a certain structure. You have some fields, such as LOCUS, DEFINITION, ACCESSION, VERSION, KEYWORDS, SOURCE, REFERENCE, FEATURES, and BASE COUNT that start at the beginning of a line. Some of these fields have subfields, especially the FEATURES field, which has a fairly complex structure.

But for now, let's just extract the top-level fields. You will need a regular expression that matches everything from a word at the beginning of a line to a newline that just precedes another word at the beginning of a line.

Here's a regular expression that matches our definition of a field:

/^[A-Z].*\n(^\s.*\n)*/m

What does this regular expression say? First of all, it has the /m pattern matching modifier, which means the caret ^ and the dollar sign $ also match around embedded newlines (not just at the beginning and end of the entire string, which is the default behavior).

The first part of the regular expression:

^[A-Z].*\n

matches a capital letter at the beginning of a line, followed by any number of characters (except newlines), followed by a newline. That's a good description of the first lines of the fields you're trying to match.

The second part of the regular expression:

(^\s.*\n)*

matches a space or tab \s at the beginning of a line, followed by any number of characters (except newlines), followed by a newline. This is surrounded by parentheses and followed by a *, which means 0 or more such lines. This matches succeeding lines in a field, lines that start with whitespace. A field may have no extra lines of this sort or several such lines.

So, the two parts of the regular expression combined match the fields with their optional additional lines.

Example 10-6 shows a subroutine that, given the annotations section of a GenBank record stored in a scalar variable, returns a hash with keys equal to the names of the top-level fields and values equal to the contents of those fields.

Example 10-6: Parsing Genbank annotation

#!/usr/bin/perl
#  - test program for parse_annotation subroutine
 
use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module
 
# Declare and initialize variables
my $fh;
my $record;
my $dna;
my $annotation;
my %fields;
my $library = 'library.gb';
 
# Open library and read a record
$fh = open_file($library);
 
$record = get_next_record($fh);
 
# Parse the sequence and annotation
($annotation, $dna) = get_annotation_and_dna($record);
 
# Extract the fields of the annotation
%fields = parse_annotation($annotation);
 
# Print the fields
foreach my $key (keys %fields) {
    print "******** $key *********\n";
    print $fields{$key};
}
 
exit;
 
################################################################################
# Subroutine
################################################################################
 
# parse_annotation
#
#  given a GenBank annotation, returns a hash  with
#   keys: the field names
#   values: the fields
 
sub parse_annotation {
 
    my($annotation) = @_; 
    my(%results) = (  );
 
    while( $annotation =~ /^[A-Z].*\n(^\s.*\n)*/gm ) {
        my $value = $&;
        (my $key = $value) =~ s/^([A-Z]+).*/$1/s;
        $results{$key} = $value;
    }
 
    return %results;
}

In the subroutine parse_annotation, note how the variables $key and $value are scoped within the while block. One benefit of this is that you don't have to reinitialize the variables each time through the loop. Also note that the key is the name of the field, and the value is the whole field.

You should take the time to understand the regular expression that extracts the field name for the key:

(my $key = $value) =~ s/^([A-Z]+).*/$1/s;

This first assigns $key the value $value. It then replaces everything in $key (note the /s modifier for embedded newlines) with $1, which is a special variable pattern between the first pair of parentheses ([A-Z]+). This pattern is one or more capital letters (anchored to the beginning of the string, i.e., the field name), so it sets $key to the value of the first word in the field name.

You get the following output from Example 10-6 (the test just fetches the first record in the GenBank library):

******** SOURCE *********
SOURCE      Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to
            mRNA.
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
******** DEFINITION *********
DEFINITION  Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1,
            complete cds.
******** KEYWORDS *********
KEYWORDS    .
******** VERSION *********
VERSION     AB031069.1  GI:8100074
******** FEATURES *********
FEATURES             Location/Qualifiers
     source          1..2487
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /sex="male"
                     /cell_line="HuS-L12"
                     /cell_type="lung fibroblast"
                     /dev_stage="embryo"
     gene            229..2199
                     /gene="PCCX1"
     CDS             229..2199
                     /gene="PCCX1"
                     /note="a nuclear protein carrying a PHD finger and a CXXC
                     domain"
                     /codon_start=1
                     /product="protein containing CXXC domain 1"
                     /protein_id="BAA96307.1"
                     /db_xref="GI:8100075"
                     /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD
                     NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP
                     RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ
                     QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY
                     FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP
                     EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE
                     KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS
                     DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR
                     FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK
                     YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC
                     PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT
                     AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR"
******** REFERENCE *********
REFERENCE   2  (bases 1 to 2487)
  AUTHORS   Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and
            Takano,T.
  TITLE     Direct Submission
  JOURNAL   Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases.
            Tadahiro Fujino, Keio University School of Medicine, Department of
            Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan
            (E-mail:fujino@microb.med.keio.ac.jp,
            Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508)
******** ACCESSION *********
ACCESSION   AB031069
******** LOCUS *********
LOCUS       AB031069     2487 bp    mRNA            PRI       27-MAY-2000
******** ORIGIN *********
ORIGIN      
******** BASE *********
BASE COUNT      564 a    715 c    768 g    440 t

As you see, the method is working, and apart from the difficulty of reading the regular expressions (which will become easier with practice), the code is very straightforward, just a few short subroutines.

Parsing the FEATURES Table

Let's take this one step further and parse the features table to its next level, composed of the source, gene, and CDS features keys. (See later in this section for a more complete list of these features keys.) In the exercises at the end of the chapter, you'll be challenged to descend further into the FEATURES table.

To study the FEATURES table, you should first look over the NCBI gbrel.txt document mentioned previously. Then you should study the most complete documentation for the FEATURES table, available at http://www.ncbi.nlm.nih.gov/collab/FT/index.html.

Features

Although our GenBank entry is fairly simple and includes only three features, there are actually quite a few of them. Notice that the parsing code will find all of them, because it's just looking at the structure of the document, not for specific features.

The following is a list of the features defined for GenBank records. Although lengthy, I think it's important to read through it to get an idea of the range of information that may be present in a GenBank record.

allele
Obsolete; see variation feature key

attenuator
Sequence related to transcription termination

C_region
Span of the C immunological feature

CAAT_signal
CAAT box in eukaryotic promoters

CDS
Sequence coding for amino acids in protein (includes stop codon)

conflict
Independent sequence determinations differ

D-loop
Displacement loop

D_segment
Span of the D immunological feature

enhancer
Cis-acting enhancer of promoter function

exon
Region that codes for part of spliced mRNA

gene
Region that defines a functional gene, possibly including upstream (promoter, enhancer, etc.) and downstream control elements, and for which a name has been assigned

GC_signal
GC box in eukaryotic promoters

iDNA
Intervening DNA eliminated by recombination

intron
Transcribed region excised by mRNA splicing

J_region
Span of the J immunological feature

LTR
Long terminal repeat

mat_peptide
Mature peptide coding region (doesn't include stop codon)

misc_binding
Miscellaneous binding site

misc_difference
Miscellaneous difference feature

misc_feature
Region of biological significance that can't be described by any other feature

misc_recomb
Miscellaneous recombination feature

misc_RNA
Miscellaneous transcript feature not defined by other RNA keys

misc_signal
Miscellaneous signal

misc_structure
Miscellaneous DNA or RNA structure

modified_base
The indicated base is a modified nucleotide

mRNA
Messenger RNA

mutation
Obsolete: see variation feature key

N_region
Span of the N immunological feature

old_sequence
Presented sequence revises a previous version

polyA_signal
Signal for cleavage and polyadenylation

polyA_site
Site at which polyadenine is added to mRNA

precursor_RNA
Any RNA species that isn't yet the mature RNA product

prim_transcript
Primary (unprocessed) transcript

primer
Primer binding region used with PCR

primer_bind
Noncovalent primer binding site

promoter
A region involved in transcription initiation

protein_bind
Noncovalent protein binding site on DNA or RNA

RBS
Ribosome binding site

rep_origin
Replication origin for duplex DNA

repeat_region
Sequence containing repeated subsequences

repeat_unit
One repeated unit of a repeat_region

rRNA
Ribosomal RNA

S_region
Span of the S immunological feature

satellite
Satellite repeated sequence

scRNA
Small cytoplasmic RNA

sig_peptide
Signal peptide coding region

snRNA
Small nuclear RNA

source
Biological source of the sequence data represented by a GenBank record; mandatory feature, one or more per record; for organisms that have been incorporated within the NCBI taxonomy database, an associated /db_xref="taxon:NNNN" qualifier will be present (where NNNNN is the numeric identifier assigned to the organism within the NCBI taxonomy database)

stem_loop
Hairpin loop structure in DNA or RNA

STS
Sequence Tagged Site: operationally unique sequence that identifies the combination of primer spans used in a PCR assay

TATA_signal
TATA box in eukaryotic promoters

terminator
Sequence causing transcription termination

transit_peptide
Transit peptide coding region

transposon
Transposable element (TN)

tRNA
Transfer RNA

unsure
Authors are unsure about the sequence in this region

V_region
Span of the V immunological feature

variation
A related population contains stable mutation

-
Placeholder (hyphen)

-10_signal
Pribnow box in prokaryotic promoters

-35_signal
-35 box in prokaryotic promoters

3'clip
3'-most region of a precursor transcript removed in processing

3'UTR
3' untranslated region (trailer)

5'clip
5'-most region of a precursor transcript removed in processing

5'UTR
5' untranslated region (leader)

These feature keys can have their own additional features, which you'll see here and in the exercises.

Parsing

Example 10-8 finds whatever features are present and returns an array populated with them. It doesn't look for the complete list of features as presented in the last section; it finds just the features that are actually present in the GenBank record and returns them for further use.

It's often the case that there are multiple instances of the same feature in a record. For instance, there may be several exons specified in the FEATURES table of a GenBank record. For this reason we'll store the features as elements in an array, rather than in a hash keyed on the feature name (as this allows you to store, for instance, only one instance of an exon).

Example 10-7: Testing subroutine parse_features

#!/usr/bin/perl
#  - main program to test parse_features
 
use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module
 
# Declare and initialize variables
my $fh;
my $record;
my $dna;
my $annotation;
my %fields;
my @features;
my $library = 'library.gb';
 
# Get the fields from the first GenBank record in a library
$fh = open_file($library);
 
$record = get_next_record($fh);
 
($annotation, $dna) = get_annotation_and_dna($record);
 
%fields = parse_annotation($annotation);
 
# Extract the features from the FEATURES table
@features = parse_features($fields{'FEATURES'});
 
# Print out the features
foreach my $feature (@features) {
 
     # extract the name of the feature (or "feature key")
     my($featurename) = ($feature =~ /^ {5}(\S+)/);
 
     print "******** $featurename *********\n";
     print $feature;
}
 
exit;
 
############################################################################ 
# Subroutine
############################################################################ 
 
# parse_features
#
#  extract the features from the FEATURES field of a GenBank record
 
sub parse_features {
 
     my($features) = @_;   # entire FEATURES field in a scalar variable
 
     # Declare and initialize variables
     my(@features) = ();   # used to store the individual features
 
     # Extract the features
     while( $features =~ /^ {5}\S.*\n(^ {21}\S.*\n)*/gm ) {
         my $feature = $&;
     push(@features, $feature);
     }
 
     return @features;
}

Example 10-8 gives the output:

******** source *********
      source          1..2487
                      /organism="Homo sapiens"
                      /db_xref="taxon:9606"
                      /sex="male"
                      /cell_line="HuS-L12"
                      /cell_type="lung fibroblast"
                      /dev_stage="embryo"
******** gene *********
      gene            229..2199
                      /gene="PCCX1"
******** CDS *********
      CDS             229..2199
                      /gene="PCCX1"
                      /note="a nuclear protein carrying a PHD finger and a CXXC
                      domain"
                      /codon_start=1
                      /product="protein containing CXXC domain 1"
                      /protein_id="BAA96307.1"
                      /db_xref="GI:8100075"
                      /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD
                      NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP
                      RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ
                      QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY
                      FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP
                      EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE
                      KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS
                      DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR
                      FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK
                      YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC
                      PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT
                      AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR

In subroutine parse_features of Example 10-8, the regular expression that extracts the features is much like the regular expression used in Example 10-6 to parse the top level of the annotations. Let's look at the essential parsing code of Example 10-8:

     while( $features =~ /^ {5}\S.*\n(^ {21}\S.*\n)*/gm ) {

On the whole, and in brief, this regular expression finds features formatted with the first lines beginning with 5 spaces, and optional continuation lines beginning with 21 spaces.

First, note that the pattern modifier /m enables the ^ metacharacter to match after embedded newlines. Also, the {5} and {21} are quantifiers that specify there should be exactly 5, or 21, of the preceding item, which in both cases is a space.

The regular expression is in two parts, corresponding to the first line and optional continuation lines of the feature. The first part ^ {5}\S.*\n means that the beginning of a line (^) has 5 spaces ({5}), followed by a non-whitespace character (\S) followed by any number of non-newlines (.*) followed by a newline (\n). The second part of the regular expression, (^ {21}\S.*\n)* means the beginning of a line (^) has 21 spaces ({21}) followed by a non-whitespace character (\S) followed by any number of non-newlines (.*) followed by a newline (\n); and there may be 0 or more such lines, indicated by the ()* around the whole expression.

The main program has a short regular expression along similar lines to extract the feature name (also called the feature key) from the feature.

So, again, success. The FEATURES table is now decomposed or "parsed" in some detail, down to the level of separating the individual features. The next stage in parsing the FEATURES table is to extract the detailed information for each feature. This includes the location (given on the same line as the feature name, and possibly on additional lines); and the qualifiers indicated by a slash, a qualifier name, and if applicable, an equals sign and additional information of various kinds, possibly continued on additional lines.

I'll leave this final step for the exercises. It's a fairly straightforward extension of the approach we've been using to parse the features. You will want to consult the documentation from the NCBI web site for complete details about the structure of the FEATURES table before trying to parse the location and qualifiers from a feature.

The method I've used to parse the FEATURES table maintains the structure of the information. However, sometimes you just want to see if some word such as "endonulease" appears anywhere in the record. For this, recall that you created a search_annotation subroutine in Example 10-5 that searches for any regular expression in the entire annotation; very often, this is all you really need. As you've now seen, however, when you really need to dig into the FEATURES table, Perl has its own features that make the job possible and even fairly easy.

Indexing GenBank with DBM

DBM stands for Database Management. Perl provides a set of built-in functions that give Perl programmers access to DBM files.

DBM Essentials

When you open a DBM file, you access it like a hash: you give it keys and it returns values, and you can add and delete key-value pairs. What's useful about DBM is that it saves the key-value data in a permanent disk file on your computer. It can thus save information between the times you run your program; it can also serve as a way to share information between different programs that need the same data. A DBM file can get very big without killing the main memory on your computer and making your program--and everything else--slow to a crawl.

There are two functions, dbmopen and dbmclose, that "tie" a hash to a DBM file; then you just use the hash. As you've seen, with a hash, lookups are easy, as are definitions. You can get a list of all the keys from a hash called %my_hash by typing keys %my_hash. You then can get a list of all values by typing values %my_hash. For large DBM files, you may not want to do this; the Perl function each allows you to read key-value pairs one at a time, thus saving the memory of your running program. There is also a delete function to remove the definitions of keys:

delete $my_hash{'DNA'}

entirely removes that key from the hash.

DBM files are a very simple database. They don't have the power of a relational database such as MySQL, Oracle, or PostgreSQL; however, it's remarkable how often a simple database is all that a problem really needs. When you have a set of key-value data (or several such sets), consider using DBM. It's certainly easy to use with Perl.

The main wrinkle to using DBM is that there are several, slightly different DBM implementations--NDBM, GDBM, SDBM, and Berkeley DB. The differences are small but real; but for most purposes, the implementations are interchangeable. Newer versions of Perl give you Berkeley DB by default, and it's easy to get it and install it for your Perl if you want. If you don't have really long keys or values, it's not a problem. Some older DBMs require you to add null bytes to keys and delete them from values:

$value = $my_hash{"$key\0"};
chop $value;

Chances are good that you won't have to do that. Berkeley DB handles long strings well (some of the other DBM implementations have limits), and because you have some potentially long strings in biology, I recommend installing Berkeley DB if you don't have it.

A DBM Database for GenBank

You've seen how to extract information from a GenBank record or from a library of GenBank records. You've just seen how DBM files can save your hash data on your hard disk between program runs. You've also seen the use of tell and seek to quickly access a location in a file.

Now let's combine the three ideas and use DBM to build a database of information about a GenBank library. It'll be something simple: you'll extract the accession numbers for the keys and store the byte offsets in the GenBank library of records for the values. You'll add some code that, given a library and an offset, returns the record at that offset, and write a main program that allows the user to interactively request GenBank records by accession number. When complete, your program should very quickly return a GenBank record if given its accession number.

This general idea is extended in the exercises at the end of the chapter to a considerable extent; you may want to glance ahead at them now to get an idea of the potential power of the technique I'm about to present.

With just the appropriate amount of further ado, here is a code fragment that opens (creating if necessary) a DBM file:

unless(dbmopen(%my_hash, 'DBNAME', 0644)) {
 
    print "Cannot open DBM file DBNAME with mode 0644\n";
    exit;
}

%my_hash is like any other hash in Perl, but it will be tied to the DBM file with this statement. DBNAME is the basename of the actual DBM files that will be created. Some DBM versions create one file of exactly that name; others create two files with file extensions .dir and .pag.

Another parameter is called the mode. Unix or Linux users will be familiar with file permissions in this form. Many possibilities exist; here are the most common ones:

0644
You can read and write; others can just read.

0600
Only you can read or write.

0666
Anyone can read or write.

0444
Anyone can read (nobody can write).

0400
Only you can read (nobody else can do anything).

The dbmopen call fails if you try to open a file with a mode that assumes there are more permissions than were conferred on the DBM file when it was created. Usually, the mode 0644 is declared by the owner if only the owner should be allowed to write, and 0444 is declared by readers. Mode 0666 is declared by the owner and others if the file is meant to be read or written by anyone.

That's pretty much it; DBM files are that simple. Example 10-8 displays a DBM file that stores key-value pairs of accession numbers of GenBank records for keys, and byte offsets of the records as values.

Example 10-8: A DBM index of a GenBank library

#!/usr/bin/perl
#  - make a DBM index of a GenBank library,
#     and demonstrate its use interactively
 
use strict;
use warnings;
use BeginPerlBioinfo;     # see Chapter 6 about this module
 
# Declare and initialize variables
my $fh;
my $record;
my $dna;
my $annotation;
my %fields;
my %dbm;
my $answer;
my $offset;
my $library = 'library.gb';
 
# open DBM file, creating if necessary
unless(dbmopen(%dbm, 'GB', 0644)) {
    print "Cannot open DBM file GB with mode 0644\n";
    exit;
}
 
# Parse GenBank library, saving accession number and offset in DBM file
$fh = open_file($library);
 
$offset = tell($fh);
 
while ( $record = get_next_record($fh) ) {
 
    # Get accession field for this record.
    ($annotation, $dna) = get_annotation_and_dna($record);
 
    %fields = parse_annotation($annotation);
 
    my $accession = $fields{'ACCESSION'};
 
    # extract just the accession number from the accession field
    # --remove any trailing spaces
    $accession =~ s/^ACCESSION\s*//;
 
    $accession =~ s/\s*$//;
 
    # store the key/value of  accession/offset
    $dbm{$accession} = $offset;
 
    # get offset for next record
    $offset = tell($fh);
}
 
# Now interactively query the DBM database with accession numbers
#  to see associated records
 
print "Here are the available accession numbers:\n";
 
print join ( "\n", keys %dbm ), "\n";
 
print "Enter accession number (or quit): ";
 
while( $answer = <STDIN> ) {
    chomp $answer;
    if($answer =~ /^\s*q/) {
        last;
    }
    $offset = $dbm{$answer};
 
    if ($offset) {
        seek($fh, $offset, 0);
        $record = get_next_record($fh);
        print $record;
    }else{
        print "Do not have an entry for accession number $answer\n";
    }
 
    print "\nEnter accession number (or quit): ";
}
 
dbmclose(%dbm);
 
close($fh);
 
exit;

Here's the truncated output of Example 10-8:

Here are the available accession numbers:
XM_006271
NM_021964
XM_009873
AB031069
XM_006269
Enter accession number (or quit): NM_021964
LOCUS       NM_021964    3032 bp    mRNA            PRI       14-MAR-2001
DEFINITION  Homo sapiens zinc finger protein 148 (pHZ-52) (ZNF148), mRNA.
...
//
 
Enter accession number (or quit): q

Exercises

Exercise 10.1
Go to the NCBI, EMBL, and EBI web sites and become familiar with their use.

Exercise 10.2
Read the GenBank format documentation, gbrel.txt.

Exercise 10.3
Write a subroutine that passes a hash by value. Now rewrite it to pass the hash by reference.

Exercise 10.4
Design a module of subroutines to handle the following kinds of data: a flat file containing records consisting of gene names on a line and extra information of any sort on succeeding lines, followed by a blank line. Your subroutines should be able to read in the data and then do a fast lookup on the information associated with a gene name. You should also be able to add new records to the flat file. Now reuse this module to build an address book program.

Exercise 10.5
Descend further into the FEATURES table. Parse the features in the table into their next level by parsing the feature names, locations, and qualifiers. Check the document gbrel.txt for definitions of the structures of the fields.

Exercise 10.6
Write a program that takes a long DNA sequence as input and outputs the counts of all four-base subsequences (256 of them in all), sorted by frequency. A four-base subsequence starts at each location 1, 2, 3, and so on. (This kind of word-frequency analysis is common to many fields of study, including linguistics, computer science, and music.)

Exercise 10.7
Extend the program in Exercise 10.6 to count all the sequences in a GenBank library.

Exercise 10.8
Given an amino acid, find the frequency of occurrence of the adjacent amino acids coded in a DNA sequence; or in a GenBank library.

Exercise 10.10
Extract all the words (excluding words like "the" or other unnecessary words) from the annotation of a library of GenBank records. For each word found, add the offset of the GenBank record in the library to a DBM file that has keys equal to the words, and values that are strings with offsets separated by spaces. In other words, one key can have a space-separated list of offsets for a value. Then you can quickly find all records containing a word like "fibroblast" with a simple lookup, followed by extracting the offsets and seeking into the library with those offsets. How big is your DBM file compared to the GenBank library? What might be involved in constructing a search engine for the annotations in all of GenBank? For human DNA only?

Exercise 10.10
Write a program to make a custom library of oncogenes from the GBPRI division of GenBank.


1. GenBank is also distributed in ASN.1 format, for which you need specialized tools, provided by NCBI.

2. The data is also distributed in the ASN.1 format.

Back to: Beginning Perl for Bioinformatics


oreilly.com Home | O'Reilly Bookstores | How to Order | O'Reilly Contacts
International | About O'Reilly | Affiliated Companies | Privacy Policy

© 2001, O'Reilly & Associates, Inc.
webmaster@oreilly.com