BUY THIS BOOK
Add to Cart

Print Book $29.95


Safari Books Online

What is this?

Add to UK Cart

Print Book £20.95

What is this?

Looking to Reprint this content?


Sequence Analysis in a Nutshell: A Guide to Tools A Guide to Common Tools and Databases By Scott Markel, Darryl León
January 2003
Pages: 302

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: FASTA Format
The most common sequence format you'll encounter is FASTA. This format is quite simple. The first line of a sequence entry consists of ">", followed by an identifier, which contains no whitespace. This can be followed by whitespace and a comment or description. This first line is referred to as the comment or description line. One or more sequence data lines may follow. The length of the sequence data lines may not be constant. Common line lengths are 60, 70, 72, and 80. For details, see Section 1.3 at the end of this chapter. Example 1-1 contains a sample FASTA entry.
Example 1-1. Sample FASTA entry
>gi|29848|emb|X61622.1|HSCDK2MR H.sapiens CDK2 mRNA
ATGGAGAACTTCCAAAAGGTGGAAAAGATCGGAGAGGGCACGTACGGAGTTGTGTACAAAGCCAGAAACA
AGTTGACGGGAGAGGTGGTGGCGCTTAAGAAAATCCGCCTGGACACTGAGACTGAGGGTGTGCCCAGTAC
TGCCATCCGAGAGATCTCTCTGCTTAAGGAGCTTAACCATCCTAATATTGTCAAGCTGCTGGATGTCATT
CACACAGAAAATAAACTCTACCTGGTTTTTGAATTTCTGCACCAAGATCTCAAGAAATTCATGGATGCCT
CTGCTCTCACTGGCATTCCTCTTCCCCTCATCAAGAGCTATCTGTTCCAGCTGCTCCAGGGCCTAGCTTT
CTGCCATTCTCATCGGGTCCTCCACCGAGACCTTAAACCTCAGAATCTGCTTATTAACACAGAGGGGGCC
ATCAAGCTAGCAGACTTTGGACTAGCCAGAGCTTTTGGAGTCCCTGTTCGTACTTACACCCATGAGGTGG
TGACCCTGTGGTACCGAGCTCCTGAAATCCTCCTGGGCTCGAAATATTATTCCACAGCTGTGGACATCTG
GAGCCTGGGCTGCATCTTTGCTGAGATGGTGACTCGCCGGGCCCTGTTCCCTGGAGATTCTGAGATTGAC
CAGCTCTTCCGGATCTTTCGGACTCTGGGGACCCCAGATGAGGTGGTGTGGCCAGGAGTTACTTCTATGC
CTGATTACAAGCCAAGTTTCCCCAAGTGGGCCCGGCAAGATTTTAGTAAAGTTGTACCTCCCCTGGATGA
AGATGGACGGAGCTTGTTATCGCAAATGCTGCACTACGACCCTAACAAGCGGATTTCGGCCAAGGCAGCC
CTGGCTCACCCTTTCTTCCAGGATGTGACCAAGCCAGTACCCCATCTTCGACTCTGATAGCCTTCTTGAA
GCCCCCGACCCTAATCGGCTCACCCTCTCCTCCAGTGTGGGCTTGACCAGCTTGGCCTTGGGCTATTTGG
ACTCAGGTGGGCCCTCTGAACTTGCCTTAAACACTCACCTTCTAGTCTTAACCAGCCAACTCTGGGAATA
CAGGGGTGAAAGGGGGGAACCAGTGAAAATGAAAGGAAGTTTCAGTATTAGATGCACTTAAGTTAGCCTC
CACCACCCTTTCCCCCTTCTCTTAGTTATTGCTGAAGAGGGTTGGTATAAAAATAATTTTAAAAAAGCCT
TCCTACACGTTAGATTTGCCGTACCAATCTCTGAATGCCCCATAATTATTATTTCCAGTGTTTGGGATGA
CCAGGATCCCAAGCCTCCTGCTGCCACAATGTTTATAAAGGCCAAATGATAGCGGGGGCTAAGTTGGTGC
TTTTGAGAATTAAGTAAAACAAAACCACTGGGAGGAGTCTATTTTAAAGAATTCGGTTAAAAAATAGATC
CAATCAGTTTATACCCTAGTTAGTGTTTTCCTCACCTAATAGGCTGGGAGACTGAAGACTCAGCCCGGGT
GGGGGT
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
NCBI's Sequence Identifier Syntax
The National Center for Biotechnology Information (NCBI) uses the following syntax for its BLAST server. NCBI is part of the National Library of Medicine (NLM) at the National Institutes of Health (NIH). The following (including the table) is NCBI's description. See ftp://ftp.ncbi.nih.gov/blast/db/README for details.
The syntax of sequence header lines used by the NCBI BLAST server depends on the database from which each sequence was obtained. The table below lists the identifiers for the databases from which the sequences were derived.
Database name
Identifier syntax
GenBank
gb|accession|locus
EMBL Data Library
emb|accession|locus
DDBJ, DNA Database of Japan
dbj|accession|locus
NBRF PIR
pir||entry
Protein Research Foundation
prf||name
SWISS-PROT
sp|accession|entry name
Brookhaven Protein Data Bank
pdb|entry|chain
Patents
pat|country|number
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
NCBI's Non-Redundant Database Syntax
You should be aware of one additional syntax that's used by the NCBI for their non-redundant database. Since the whole point of the database is to have sequence entries listed only once, the description line syntax allows for more than one set of identifier and description. The sets are delimited by Ctrl-A characters. Here's what NCBI has to say about this.
These files are all non-redundant; identical sequences are merged into one entry. To be merged two sequences must have identical lengths and every residue (or basepair) at every position must be the same. The FASTA deflines for the different entries that belong to one sequence are separated by control-A's (^A). In the following example, both entries gi|1469284 and gi|1477453 have the same sequence, in every respect.
>gi|1469284 (U05042) afuC gene product [Actinobacillus 
pleuropneumoniae]^Agi|1477453 (U04954) afuC gene product [Actinobacillus 
pleuropneumoniae]
MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVT
KSSIQNRDICIVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQ
QQRVALARALVLKPKVLILDEPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMN
KGTIMQKARQKIFIYDRILYSLRNFMGESTICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPE
AIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLINANPDQFDPDATKAFIHFTEQGIFLLNKE
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
References
  • Pearson, W.R., and D. J. Lipman. 1988. Improved Tools for Biological Sequence Analysis. Proceedings of teh National Academy of Sciences 85:2444-2448.
    NCBI Sequence Identifier Syntax
    ftp://ftp.ncbi.nih.gov/blast/db/README
    Non-redundant database
    ftp://ftp.ncbi.nih.gov/blast/db/README
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: GenBank/EMBL/DDBJ
GenBank is maintained by the National Center for Biotechnology Information (NCBI). It is joined by the DNA Data Bank of Japan (DDBJ, in Mishima, Japan) and the European Molecular Biology Laboratory (EMBL, in Heidelberg, Germany) nucleotide database from the European Bioinformatics Institute (EBI, in Hinxton, UK) to form the International Nucleotide Sequence Database Collaboration. Although the three repositories have separate sites for data submission, they share sequence data and allow daily downloads of sequence files by the public. We're using GenBank Release 132, EMBL Release 72, and DDBJ Release 51.
Sequence flat files are frequently used in many software tools. GenBank, DDBJ, and EMBL each have their own specific flat file format. Flat files from each of these databases are shown in the next several sections, and these examples are used to illustrate the field definitions and the feature table sections for each repository. The sequence from cyclin-dependent kinase-2 (CDK2) is used as the example for all of the sequence flat file entries and the fasta file.
Example 2-1 contains a sample sequence entry from GenBank. This entry contains terms from the GenBank Field Definitions and the DDBJ/EMBL/GenBank Feature Table, discussed later in this chapter.
Example 2-1. Sample Genbank entry
LOCUS       HSCDK2MR                1476 bp    mRNA    linear   PRI 15-JAN-1992
DEFINITION  H.sapiens CDK2 mRNA.
ACCESSION   X61622
VERSION     X61622.1  GI:29848
KEYWORDS    CDK2 gene; cell cycle regulation protein; cyclin A binding; protein
            kinase.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 1476)
  AUTHORS   Elledge,S.J. and Spottswood,M.R.
  TITLE     A new human p34 protein kinase, CDK2, identified by complementation
            of a cdc28 mutation in Saccharomyces cerevisiae, is a homolog of
            Xenopus Eg1
  JOURNAL   EMBO J. 10 (9), 2653-2659 (1991)
  MEDLINE   91330891
REFERENCE   2  (bases 1 to 1476)
  AUTHORS   Elledge,S.J.
  TITLE     Direct Submission
  JOURNAL   Submitted (28-NOV-1991) S.J. Elledge, Dept. of Biochemistry, Baylor
            College of Medicine, 1 Baylor Place, Houston, TX 77030, USA
FEATURES             Location/Qualifiers
     source          1..1476
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /clone="pSE1000"
                     /cell_line="EBV transformed Human peripheral lymphocyte
                     (B-cell)"
                     /clone_lib="lambda YES-R cDNA library"
     gene            1..1476
                     /gene="CDK2"
     CDS             1..897
                     /gene="CDK2"
                     /function="protein kinase"
                     /note="cell division kinase. CDC2 homolog"
                     /codon_start=1
                     /protein_id="CAA43807.1"
                     /db_xref="GI:29849"
                     /db_xref="SWISS-PROT:P24941"
                     /translation="MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGV
                     PSTAIREISLLKELNHPNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLP
                     LIKSYLFQLLQGLAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYT
                     HEVVTLWYRAPEILLGSKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRT
                     LGTPDEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRIS
                     AKAALAHPFFQDVTKPVPHLRL"
BASE COUNT      368 a    372 c    351 g    385 t
ORIGIN      
        1 atggagaact tccaaaaggt ggaaaagatc ggagagggca cgtacggagt tgtgtacaaa
       61 gccagaaaca agttgacggg agaggtggtg gcgcttaaga aaatccgcct ggacactgag
      121 actgagggtg tgcccagtac tgccatccga gagatctctc tgcttaagga gcttaaccat
      181 cctaatattg tcaagctgct ggatgtcatt cacacagaaa ataaactcta cctggttttt
      241 gaatttctgc accaagatct caagaaattc atggatgcct ctgctctcac tggcattcct
      301 cttcccctca tcaagagcta tctgttccag ctgctccagg gcctagcttt ctgccattct
      361 catcgggtcc tccaccgaga ccttaaacct cagaatctgc ttattaacac agagggggcc
      421 atcaagctag cagactttgg actagccaga gcttttggag tccctgttcg tacttacacc
      481 catgaggtgg tgaccctgtg gtaccgagct cctgaaatcc tcctgggctc gaaatattat
      541 tccacagctg tggacatctg gagcctgggc tgcatctttg ctgagatggt gactcgccgg
      601 gccctgttcc ctggagattc tgagattgac cagctcttcc ggatctttcg gactctgggg
      661 accccagatg aggtggtgtg gccaggagtt acttctatgc ctgattacaa gccaagtttc
      721 cccaagtggg cccggcaaga ttttagtaaa gttgtacctc ccctggatga agatggacgg
      781 agcttgttat cgcaaatgct gcactacgac cctaacaagc ggatttcggc caaggcagcc
      841 ctggctcacc ctttcttcca ggatgtgacc aagccagtac cccatcttcg actctgatag
      901 ccttcttgaa gcccccgacc ctaatcggct caccctctcc tccagtgtgg gcttgaccag
      961 cttggccttg ggctatttgg actcaggtgg gccctctgaa cttgccttaa acactcacct
     1021 tctagtctta accagccaac tctgggaata caggggtgaa aggggggaac cagtgaaaat
     1081 gaaaggaagt ttcagtatta gatgcactta agttagcctc caccaccctt tcccccttct
     1141 cttagttatt gctgaagagg gttggtataa aaataatttt aaaaaagcct tcctacacgt
     1201 tagatttgcc gtaccaatct ctgaatgccc cataattatt atttccagtg tttgggatga
     1261 ccaggatccc aagcctcctg ctgccacaat gtttataaag gccaaatgat agcgggggct
     1321 aagttggtgc ttttgagaat taagtaaaac aaaaccactg ggaggagtct attttaaaga
     1381 attcggttaa aaaatagatc caatcagttt ataccctagt tagtgttttc ctcacctaat
     1441 aggctgggag actgaagact cagcccgggt gggggt

//
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Example Flat Files
Sequence flat files are frequently used in many software tools. GenBank, DDBJ, and EMBL each have their own specific flat file format. Flat files from each of these databases are shown in the next several sections, and these examples are used to illustrate the field definitions and the feature table sections for each repository. The sequence from cyclin-dependent kinase-2 (CDK2) is used as the example for all of the sequence flat file entries and the fasta file.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
GenBank Example Flat File
Example 2-1 contains a sample sequence entry from GenBank. This entry contains terms from the GenBank Field Definitions and the DDBJ/EMBL/GenBank Feature Table, discussed later in this chapter.
Example 2-1. Sample Genbank entry
LOCUS       HSCDK2MR                1476 bp    mRNA    linear   PRI 15-JAN-1992
DEFINITION  H.sapiens CDK2 mRNA.
ACCESSION   X61622
VERSION     X61622.1  GI:29848
KEYWORDS    CDK2 gene; cell cycle regulation protein; cyclin A binding; protein
            kinase.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 1476)
  AUTHORS   Elledge,S.J. and Spottswood,M.R.
  TITLE     A new human p34 protein kinase, CDK2, identified by complementation
            of a cdc28 mutation in Saccharomyces cerevisiae, is a homolog of
            Xenopus Eg1
  JOURNAL   EMBO J. 10 (9), 2653-2659 (1991)
  MEDLINE   91330891
REFERENCE   2  (bases 1 to 1476)
  AUTHORS   Elledge,S.J.
  TITLE     Direct Submission
  JOURNAL   Submitted (28-NOV-1991) S.J. Elledge, Dept. of Biochemistry, Baylor
            College of Medicine, 1 Baylor Place, Houston, TX 77030, USA
FEATURES             Location/Qualifiers
     source          1..1476
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /clone="pSE1000"
                     /cell_line="EBV transformed Human peripheral lymphocyte
                     (B-cell)"
                     /clone_lib="lambda YES-R cDNA library"
     gene            1..1476
                     /gene="CDK2"
     CDS             1..897
                     /gene="CDK2"
                     /function="protein kinase"
                     /note="cell division kinase. CDC2 homolog"
                     /codon_start=1
                     /protein_id="CAA43807.1"
                     /db_xref="GI:29849"
                     /db_xref="SWISS-PROT:P24941"
                     /translation="MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGV
                     PSTAIREISLLKELNHPNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLP
                     LIKSYLFQLLQGLAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYT
                     HEVVTLWYRAPEILLGSKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRT
                     LGTPDEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRIS
                     AKAALAHPFFQDVTKPVPHLRL"
BASE COUNT      368 a    372 c    351 g    385 t
ORIGIN      
        1 atggagaact tccaaaaggt ggaaaagatc ggagagggca cgtacggagt tgtgtacaaa
       61 gccagaaaca agttgacggg agaggtggtg gcgcttaaga aaatccgcct ggacactgag
      121 actgagggtg tgcccagtac tgccatccga gagatctctc tgcttaagga gcttaaccat
      181 cctaatattg tcaagctgct ggatgtcatt cacacagaaa ataaactcta cctggttttt
      241 gaatttctgc accaagatct caagaaattc atggatgcct ctgctctcac tggcattcct
      301 cttcccctca tcaagagcta tctgttccag ctgctccagg gcctagcttt ctgccattct
      361 catcgggtcc tccaccgaga ccttaaacct cagaatctgc ttattaacac agagggggcc
      421 atcaagctag cagactttgg actagccaga gcttttggag tccctgttcg tacttacacc
      481 catgaggtgg tgaccctgtg gtaccgagct cctgaaatcc tcctgggctc gaaatattat
      541 tccacagctg tggacatctg gagcctgggc tgcatctttg ctgagatggt gactcgccgg
      601 gccctgttcc ctggagattc tgagattgac cagctcttcc ggatctttcg gactctgggg
      661 accccagatg aggtggtgtg gccaggagtt acttctatgc ctgattacaa gccaagtttc
      721 cccaagtggg cccggcaaga ttttagtaaa gttgtacctc ccctggatga agatggacgg
      781 agcttgttat cgcaaatgct gcactacgac cctaacaagc ggatttcggc caaggcagcc
      841 ctggctcacc ctttcttcca ggatgtgacc aagccagtac cccatcttcg actctgatag
      901 ccttcttgaa gcccccgacc ctaatcggct caccctctcc tccagtgtgg gcttgaccag
      961 cttggccttg ggctatttgg actcaggtgg gccctctgaa cttgccttaa acactcacct
     1021 tctagtctta accagccaac tctgggaata caggggtgaa aggggggaac cagtgaaaat
     1081 gaaaggaagt ttcagtatta gatgcactta agttagcctc caccaccctt tcccccttct
     1141 cttagttatt gctgaagagg gttggtataa aaataatttt aaaaaagcct tcctacacgt
     1201 tagatttgcc gtaccaatct ctgaatgccc cataattatt atttccagtg tttgggatga
     1261 ccaggatccc aagcctcctg ctgccacaat gtttataaag gccaaatgat agcgggggct
     1321 aagttggtgc ttttgagaat taagtaaaac aaaaccactg ggaggagtct attttaaaga
     1381 attcggttaa aaaatagatc caatcagttt ataccctagt tagtgttttc ctcacctaat
     1441 aggctgggag actgaagact cagcccgggt gggggt

//
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
DDBJ Example Flat File
Example 2-2 contains a sample sequence entry from DDBJ. This entry contains terms from the DDBJ Field Definitions and the DDBJ/EMBL/GenBank Feature Table, discussed later in this chapter.
Example 2-2. Sample DDBJ entry
LOCUS       HSCDK2MR                1476 bp    RNA     linear   HUM 15-JAN-1992
DEFINITION  H.sapiens CDK2 mRNA.
ACCESSION   X61622
VERSION     X61622.1
KEYWORDS    CDK2 gene; cell cycle regulation protein; cyclin A binding; protein 
            kinase. 
SOURCE      human.
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; 
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 
REFERENCE   1  (bases 1 to 1476)
  AUTHORS   Elledge,S.J. and Spottswood,M.R. 
  TITLE     A new human p34 protein kinase, CDK2, identified by complementation 
            of a cdc28 mutation in Saccharomyces cerevisiae, is a homolog of 
            Xenopus Eg1 
  JOURNAL   EMBO J. 10, 2653-2659(1991). 
  MEDLINE   91330891
REFERENCE   2  (bases 1 to 1476)
  AUTHORS   Elledge,S.J. 
  JOURNAL   Submitted (28-NOV-1991) to the EMBL/GenBank/DDBJ databases. S.J. 
            Elledge, Dept. of Biochemistry, Baylor College of Medicine, 1 Baylor
            Place, Houston, TX 77030, USA 
FEATURES             Location/Qualifiers
     source          1..1476
                     /db_xref="taxon:9606"
                     /organism="Homo sapiens"
                     /cell_line="EBV transformed Human peripheral lymphocyte
                     (B-cell)"
                     /clone_lib="lambda YES-R cDNA library"
                     /clone="pSE1000"
     CDS             1..897
                     /db_xref="SWISS-PROT:P24941"
                     /note="cell division kinase. CDC2 homolog"
                     /gene="CDK2"
                     /function="protein kinase"
                     /protein_id="CAA43807.1"
                     /translation="MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVP
                     STAIREISLLKELNHPNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLI
                     KSYLFQLLQGLAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEV
                     VTLWYRAPEILLGSKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTP
                     DEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAAL
                     AHPFFQDVTKPVPHLRL"
BASE COUNT      368 a    372 c    351 g    385 t
ORIGIN
        1 atggagaact tccaaaaggt ggaaaagatc ggagagggca cgtacggagt tgtgtacaaa
       61 gccagaaaca agttgacggg agaggtggtg gcgcttaaga aaatccgcct ggacactgag
      121 actgagggtg tgcccagtac tgccatccga gagatctctc tgcttaagga gcttaaccat
      181 cctaatattg tcaagctgct ggatgtcatt cacacagaaa ataaactcta cctggttttt
      241 gaatttctgc accaagatct caagaaattc atggatgcct ctgctctcac tggcattcct
      301 cttcccctca tcaagagcta tctgttccag ctgctccagg gcctagcttt ctgccattct
      361 catcgggtcc tccaccgaga ccttaaacct cagaatctgc ttattaacac agagggggcc
      421 atcaagctag cagactttgg actagccaga gcttttggag tccctgttcg tacttacacc
      481 catgaggtgg tgaccctgtg gtaccgagct cctgaaatcc tcctgggctc gaaatattat
      541 tccacagctg tggacatctg gagcctgggc tgcatctttg ctgagatggt gactcgccgg
      601 gccctgttcc ctggagattc tgagattgac cagctcttcc ggatctttcg gactctgggg
      661 accccagatg aggtggtgtg gccaggagtt acttctatgc ctgattacaa gccaagtttc
      721 cccaagtggg cccggcaaga ttttagtaaa gttgtacctc ccctggatga agatggacgg
      781 agcttgttat cgcaaatgct gcactacgac cctaacaagc ggatttcggc caaggcagcc
      841 ctggctcacc ctttcttcca ggatgtgacc aagccagtac cccatcttcg actctgatag
      901 ccttcttgaa gcccccgacc ctaatcggct caccctctcc tccagtgtgg gcttgaccag
      961 cttggccttg ggctatttgg actcaggtgg gccctctgaa cttgccttaa acactcacct
     1021 tctagtctta accagccaac tctgggaata caggggtgaa aggggggaac cagtgaaaat
     1081 gaaaggaagt ttcagtatta gatgcactta agttagcctc caccaccctt tcccccttct
     1141 cttagttatt gctgaagagg gttggtataa aaataatttt aaaaaagcct tcctacacgt
     1201 tagatttgcc gtaccaatct ctgaatgccc cataattatt atttccagtg tttgggatga
     1261 ccaggatccc aagcctcctg ctgccacaat gtttataaag gccaaatgat agcgggggct
     1321 aagttggtgc ttttgagaat taagtaaaac aaaaccactg ggaggagtct attttaaaga
     1381 attcggttaa aaaatagatc caatcagttt ataccctagt tagtgttttc ctcacctaat
     1441 aggctgggag actgaagact cagcccgggt gggggt                          
//
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
GenBank/DDBJ Field Definitions
The field terms found in GenBank/DDBJ sequence flat files are used to help organize the information for human readabilty and machine parsing. There are several GenBank/DDBJ field terms found in a sequence flat file, but the repositories themselves share the same field definitions. Table 2-1 summarizes each of the field definitions.
Table 2-1: GenBank/DDBJ field definitions
Field
Description
LOCUS
A short mnemonic name for the entry, chosen to suggest the sequence's definition. Mandatory keyword/exactly one record.
DEFINITION
A concise description of the sequence. Mandatory keyword/one or more records.
ACCESSION
The primary accession number is a unique, unchanging code assigned to each entry. Mandatory keyword/one or more records.
VERSION
A compound identifier consisting of the primary accession number and a numeric version number associated with the current version of the sequence data in the record. This is followed by an integer key (a "GI") assigned to the sequence by NCBI. Mandatory keyword/exactly one record.
NID
An alternative method of presenting the NCBI GI identifier (described above). The NID is obsolete and was removed from the GenBank flat file format in December 1999.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
EMBL Example Flat File
Example 2-3 contains a sample sequence entry from EMBL. This entry contains terms from the EMBL Field Definitions and the DDBJ/EMBL/GenBank Feature Table, discussed later in this chapter.
Example 2-3. Sample EMBL entry
ID   HSCDK2MR   standard; RNA; HUM; 1476 BP.
XX
AC   X61622;
XX
SV   X61622.1
XX
DT   15-JAN-1992 (Rel. 30, Created)
DT   15-JAN-1992 (Rel. 30, Last updated, Version 1)
XX
DE   H.sapiens CDK2 mRNA
XX
KW   CDK2 gene; cell cycle regulation protein; cyclin A binding; protein kinase.
XX
OS   Homo sapiens (human)
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC   Eutheria; Primates; Catarrhini; Hominidae; Homo.
XX
RN   [1]
RP   1-1476
RX   MEDLINE; 91330891.
RA   Elledge S.J., Spottswood M.R.;
RT   "A new human p34 protein kinase, CDK2, identified by complementation of a
RT   cdc28 mutation in Saccharomyces cerevisiae, is a homolog of Xenopus Eg1";
RL   EMBO J. 10:2653-2659(1991).
XX
RN   [2]
RP   1-1476
RA   Elledge S.J.;
RT   ;
RL   Submitted (28-NOV-1991) to the EMBL/GenBank/DDBJ databases.
RL   S.J. Elledge, Dept. of Biochemistry, Baylor College of Medicine, 1 Baylor
RL   Place, Houston, TX 77030, USA
XX
DR   GDB; 128984; CDK2.
DR   SWISS-PROT; P24941; CDK2_HUMAN.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..1476
FT                   /db_xref="taxon:9606"
FT                   /organism="Homo sapiens"
FT                   /cell_line="EBV transformed Human peripheral lymphocyte
FT                   (B-cell)"
FT                   /clone_lib="lambda YES-R cDNA library"
FT                   /clone="pSE1000"
FT   CDS             1..897
FT                   /db_xref="SWISS-PROT:P24941"
FT                   /note="cell division kinase. CDC2 homolog"
FT                   /gene="CDK2"
FT                   /function="protein kinase"
FT                   /protein_id="CAA43807.1"
FT                   /translation="MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVP
FT                   STAIREISLLKELNHPNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLI
FT                   KSYLFQLLQGLAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEV
FT                   VTLWYRAPEILLGSKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTP
FT                   DEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAAL
FT                   AHPFFQDVTKPVPHLRL"
XX
SQ   Sequence 1476 BP; 368 A; 372 C; 351 G; 385 T; 0 other;
     atggagaact tccaaaaggt ggaaaagatc ggagagggca cgtacggagt tgtgtacaaa        60
     gccagaaaca agttgacggg agaggtggtg gcgcttaaga aaatccgcct ggacactgag       120
     actgagggtg tgcccagtac tgccatccga gagatctctc tgcttaagga gcttaaccat       180
     cctaatattg tcaagctgct ggatgtcatt cacacagaaa ataaactcta cctggttttt       240
     gaatttctgc accaagatct caagaaattc atggatgcct ctgctctcac tggcattcct       300
     cttcccctca tcaagagcta tctgttccag ctgctccagg gcctagcttt ctgccattct       360
     catcgggtcc tccaccgaga ccttaaacct cagaatctgc ttattaacac agagggggcc       420
     atcaagctag cagactttgg actagccaga gcttttggag tccctgttcg tacttacacc       480
     catgaggtgg tgaccctgtg gtaccgagct cctgaaatcc tcctgggctc gaaatattat       540
     tccacagctg tggacatctg gagcctgggc tgcatctttg ctgagatggt gactcgccgg       600
     gccctgttcc ctggagattc tgagattgac cagctcttcc ggatctttcg gactctgggg       660
     accccagatg aggtggtgtg gccaggagtt acttctatgc ctgattacaa gccaagtttc       720
     cccaagtggg cccggcaaga ttttagtaaa gttgtacctc ccctggatga agatggacgg       780
     agcttgttat cgcaaatgct gcactacgac cctaacaagc ggatttcggc caaggcagcc       840
     ctggctcacc ctttcttcca ggatgtgacc aagccagtac cccatcttcg actctgatag       900
     ccttcttgaa gcccccgacc ctaatcggct caccctctcc tccagtgtgg gcttgaccag       960
     cttggccttg ggctatttgg actcaggtgg gccctctgaa cttgccttaa acactcacct      1020
     tctagtctta accagccaac tctgggaata caggggtgaa aggggggaac cagtgaaaat      1080
     gaaaggaagt ttcagtatta gatgcactta agttagcctc caccaccctt tcccccttct      1140
     cttagttatt gctgaagagg gttggtataa aaataatttt aaaaaagcct tcctacacgt      1200
     tagatttgcc gtaccaatct ctgaatgccc cataattatt atttccagtg tttgggatga      1260
     ccaggatccc aagcctcctg ctgccacaat gtttataaag gccaaatgat agcgggggct      1320
     aagttggtgc ttttgagaat taagtaaaac aaaaccactg ggaggagtct attttaaaga      1380
     attcggttaa aaaatagatc caatcagttt ataccctagt tagtgttttc ctcacctaat      1440
     aggctgggag actgaagact cagcccgggt gggggt                                1476
//
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
EMBL Field Definitions
The field codes found in EMBL sequence flat files are used to help organize the information for human readability and machine-based parsing. There are several field codes found in an EMBL sequence flat file, and they are designated with a two-letter abbreviation. Table 2-2 summarizes the content of each field code.
Table 2-2: EMBL field definitions
Line code
Content
ID
Identification
AC
Accession number(s)
SV
New sequence identifier
DT
Date
DE
Description
KW
Keyword
OS
Organism species
OC
Organism classification
OG
Organelle
RN
Reference number
RC
Reference comment(s)
RP
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
DDBJ/EMBL/GenBank Feature Table
In February 1986, GenBank and EMBL (joined by DDBJ in 1987) started a collaborative effort to create a common feature table format. The overall objective of the feature table was to supply an in-depth vocabulary for describing nucleotide (and protein) features. We're using Version 4 of the feature table.
A feature is a single word or abbreviation indicating a functional role or region associated with a sequence. A list of DDBJ/EMBL/GenBank features is presented in Table 2-3. In the Definition column of the table, the appropriate qualifiers for each feature are in brackets. Mandatory qualifiers are highlighted in bold.
Table 2-3: DDBJ/EMBL/GenBank feature key table
Feature Key
Definition
attenuator
1) region of DNA at which regulation of termination of transcription occurs, which controls the expression of some bacterial operons.
2) sequence segment located between the promoter and the first structural gene that causes partial termination of transcription.
[citation, db_xref, evidence, gene, label, map, note, phenotype, usedin]
C_region
Constant region of immunoglobulin light and heavy chains, and T-cell receptor alpha, beta, and gamma chains; includes one or more exons depending on the particular chain.
[citation, db_xref, evidence, gene, label, map, note, product, pseudo, standard_name, usedin]
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
References
  • Tateno, Y., T. Imanishi, S. Miyazaki, K. Fukami-Kobayashi, N. Saitou, H. Sugawara, and T. Gojobori. 2002. DNA Data Bank of Japan (DDBJ) for genome scale research in life science. Nucleic Acids Research 30 (1):27-30.
    Main site
    http://www.ddbj.nig.ac.jp/
    Release notes
    http://www.ddbj.nig.ac.jp/ddbjnew/ddbj_relnote.html
    Download
    ftp://ftp.ddbj.nig.ac.jp/database/ddbj/
  • Stoesser, G., W. Baker, A. van den Broek, E. Camon, M. Garcia-Pastor, C. Kanz, T. Kulikova, R. Leinonen, Q. Lin, V. Lombard, R. Lopez, N. Redaschi, P. Stoehr, M. A. Tuli, K. Tzouvara, and R. Vaughan. 2002. The EMBL Nucleotide Sequence Database. Nucleic Acids Research 30 (1):21-26.
    Main page
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: SWISS-PROT
SWISS-PROT is an annotated protein sequence database that was started in 1986. It is currently overseen by the Swiss Institute of Bioinformatics (SIB) in association with the European Bioinformatics Institute (EBI). SWISS-PROT is the preferred protein sequence database for most bioinformaticians because many of the sequence annotations are curated by scientists. TrEMBL, another sequence database, is a computer-annotated supplement that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT. It has essentially the same sequence flat file format as SWISS-PROT. We're using SWISS-PROT Release 40.
Example 3-1 contains a sequence entry from SWISS-PROT. This entry contains terms from the SWISS-PROT Field Definitions and Feature Table types, discussed later in this chapter.
Example 3-1. Sample SWISS-PROT sequence entry
ID   CDK2_HUMAN     STANDARD;      PRT;   298 AA.
AC   P24941;
DT   01-MAR-1992 (Rel. 21, Created)
DT   01-AUG-1992 (Rel. 23, Last sequence update)
DT   15-JUN-2002 (Rel. 41, Last annotation update)
DE   Cell division protein kinase 2 (EC 2.7.1.-) (p33 protein kinase).
GN   CDK2.
OS   Homo sapiens (Human).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
OX   NCBI_TaxID=9606;
RN   [1]
RP   SEQUENCE FROM N.A.
RX   MEDLINE=91330891; PubMed=1714386;
RA   Elledge S.J., Spottswood M.R.;
RT   "A new human p34 protein kinase, CDK2, identified by complementation
RT   of a cdc28 mutation in Saccharomyces cerevisiae, is a homolog of
RT   Xenopus Eg1.";
RL   EMBO J. 10:2653-2659(1991).
RN   [2]
RP   SEQUENCE FROM N.A.
RX   MEDLINE=91367262; PubMed=1653904;
RA   Tsai L.-H., Harlow E., Meyerson M.;
RT   "Isolation of the human cdk2 gene that encodes the cyclin A- and
RT   adenovirus E1A-associated p33 kinase.";
RL   Nature 353:174-177(1991).
RN   [3]
RP   SEQUENCE FROM N.A.
RX   MEDLINE=92020980; PubMed=1717994;
RA   Ninomiya-Tsuji J., Nomoto S., Yasuda H., Reed S.I., Matsumoto K.;
RT   "Cloning of a human cDNA encoding a CDC2-related kinase by
RT   complementation of a budding yeast cdc28 mutation.";
RL   Proc. Natl. Acad. Sci. U.S.A. 88:9006-9010(1991).
RN   [4]
RP   SEQUENCE FROM N.A.
RC   TISSUE=Placenta;
RA   Strausberg R.;
RL   Submitted (FEB-2001) to the EMBL/GenBank/DDBJ databases.
RN   [5]
RP   PHOSPHORYLATION SITES.
RX   MEDLINE=93010995; PubMed=1396589;
RA   Gu Y., Rosenblatt J., O'Morgan D.O.;
RT   "Cell cycle regulation of CDK2 activity by phosphorylation of Thr160
RT   and Tyr15.";
RL   EMBO J. 11:3995-4005(1992).
RN   [6]
RP   X-RAY CRYSTALLOGRAPHY (2.4 ANGSTROMS).
RX   MEDLINE=93288132; PubMed=8510751;
RA   de Bondt H.L., Rosenblatt J., Jancarik J., Jones H.D.,
RA   Morgan D.O., Kim S.-H.;
RT   "Crystal structure of cyclin-dependent kinase 2.";
RL   Nature 363:595-602(1993).
RN   [7]
RP   X-RAY CRYSTALLOGRAPHY (2.3 ANGSTROMS) OF COMPLEX WITH CYCLIN A.
RX   MEDLINE=95356811; PubMed=7630397;
RA   Jeffrey P.D., Russo A.A., Polyak K., Gibbs E., Hurwitz J.,
RA   Massague J., Pavletich N.P.;
RT   "Mechanism of CDK activation revealed by the structure of a
RT   cyclinA-CDK2 complex.";
RL   Nature 376:313-320(1995).
RN   [8]
RP   X-RAY CRYSTALLOGRAPHY (2.33 ANGSTROMS) OF COMPLEX WITH L868276.
RX   MEDLINE=96181476; PubMed=8610110;
RA   de Azevedo W.F. Jr., Muleer-Dieckmann H.-J., Schulze-Gahmen U.,
RA   Worland P.J., Sausville E., Kim S.-H.;
RT   "Structural basis for specificity and potency of a flavonoid
RT   inhibitor of human CDK2, a cell cycle kinase.";
RL   Proc. Natl. Acad. Sci. U.S.A. 93:2735-2740(1996).
RN   [9]
RP   X-RAY CRYSTALLOGRAPHY (2.3 ANGSTROMS) OF COMPLEX WITH CG2A AND KIP1.
RX   MEDLINE=96300318; PubMed=8684460;
RA   Russo A.A., Jeffrey P.D., Patten A.K., Massague J., Pavletich N.P.;
RT   "Crystal structure of the p27Kip1 cyclin-dependent-kinase inhibitor
RT   bound to the cyclin A-Cdk2 complex.";
RL   Nature 382:325-331(1996).
RN   [10]
RP   X-RAY CRYSTALLOGRAPHY (2.6 ANGSTROMS) OF COMPLEX WITH CG2A.
RX   MEDLINE=96313126; PubMed=8756328;
RA   Russo A.A., Jeffrey P.D., Pavletich N.P.;
RT   "Structural basis of cyclin-dependent kinase activation by
RT   phosphorylation.";
RL   Nat. Struct. Biol. 3:696-700(1996).
RN   [11]
RP   X-RAY CRYSTALLOGRAPHY (1.9 ANGSTROMS).
RX   MEDLINE=97075215; PubMed=8917641;
RA   Schulze-Gahmen U., de Bondt H.L., Kim S.-H.;
RT   "High-resolution crystal structures of human cyclin-dependent kinase
RT   2 with and without ATP: bound waters and natural ligand as guides for
RT   inhibitor design.";
RL   J. Med. Chem. 39:4540-4546(1996).
RN   [12]
RP   X-RAY CRYSTALLOGRAPHY (2.0 ANGSTROMS).
RX   MEDLINE=97475219; PubMed=9334743;
RA   Lawrie A.M., Noble M.E.M., Tunnah P., Brown N.R., Johnson L.N.,
RA   Endicott J.A.;
RT   "Protein kinase inhibition by staurosporine revealed in details of
RT   the molecular interaction with CDK2.";
RL   Nat. Struct. Biol. 4:796-801(1997).
RN   [13]
RP   X-RAY CRYSTALLOGRAPHY (2.6 ANGSTROMS) OF COMPLEX WITG CKS1.
RX   MEDLINE=96182647; PubMed=8601310;
RA   Bourne Y., Watson M.H., Hickey M.J., Holmes W., Rocque W., Reed S.I.,
RA   Tainer J.A.;
RT   "Crystal structure and mutational analysis of the human CDK2 kinase
RT   complex with cell cycle-regulatory protein CksHs1.";
RL   Cell 84:863-874(1996).
RN   [14]
RP   X-RAY CRYSTALLOGRAPHY (2.05 ANGSTROMS).
RX   MEDLINE=98342369; PubMed=9677190;
RA   Gray N.S., Wodicka L., Thunnissen A.-M.W.H., Norman T.C., Kwon S.,
RA   Espinoza F.H., Morgan D.O., Barnes G., Leclerc S., Meijer L.,
RA   Kim S.H., Lockhart D.J., Schultz P.G.;
RT   "Exploiting chemical libraries, structure, and genomics in the search
RT   for kinase inhibitors.";
RL   Science 281:533-538(1998).
CC   -!- FUNCTION: PROBABLY INVOLVED IN THE CONTROL OF THE CELL CYCLE.
CC       INTERACTS WITH CYCLINS A, D, OR E. ACTIVITY OF CDK2 IS MAXIMAL
CC       DURING S PHASE AND G2.
CC   -!- ENZYME REGULATION: PHOSPHORYLATION AT THR-14 OR TYR-15 INACTIVATES
CC       THE ENZYME, WHILE PHOSPHORYLATION AT THR-160 ACTIVATES IT.
CC   -!- SIMILARITY: BELONGS TO THE SER/THR FAMILY OF PROTEIN KINASES.
CC       CDC2/CDKX SUBFAMILY.
CC   --------------------------------------------------------------------------
CC   This SWISS-PROT entry is copyright. It is produced through a collaboration
CC   between  the Swiss Institute of Bioinformatics  and the  EMBL outstation -
CC   the European Bioinformatics Institute.  There are no  restrictions on  its
CC   use  by  non-profit  institutions as long  as its content  is  in  no  way
CC   modified and this statement is not removed.  Usage  by  and for commercial
CC   entities requires a license agreement (See http://www.isb-sib.ch/announce/
CC   or send an email to license@isb-sib.ch).
CC   --------------------------------------------------------------------------
DR   EMBL; X61622; CAA43807.1; -.
DR   EMBL; X62071; CAA43985.1; -.
DR   EMBL; M68520; AAA35667.1; -.
DR   EMBL; BC003065; AAH03065.1; -.
DR   PIR; A41227; A41227.
DR   PIR; S16520; S16520.
DR   PIR; S17873; S17873.
DR   PDB; 1FIN; 27-JAN-97.
DR   PDB; 1HCK; 07-DEC-96.
DR   PDB; 1HCL; 07-DEC-96.
DR   PDB; 1AQ1; 12-NOV-97.
DR   PDB; 1JST; 11-JAN-97.
DR   PDB; 1JSU; 29-JUL-97.
DR   PDB; 1BUH; 09-SEP-98.
DR   PDB; 1B38; 23-DEC-98.
DR   PDB; 1B39; 23-DEC-98.
DR   PDB; 1CKP; 13-JAN-99.
DR   Genew; HGNC:1771; CDK2.
DR   MIM; 116953; -.
DR   InterPro; IPR000719; Euk_pkinase.
DR   InterPro; IPR002290; Ser_thr_pkinase.
DR   Pfam; PF00069; pkinase; 1.
DR   ProDom; PD000001; Euk_pkinase; 1.
DR   SMART; SM00220; S_TKc; 1.
DR   PROSITE; PS00107; PROTEIN_KINASE_ATP; 1.
DR   PROSITE; PS00108; PROTEIN_KINASE_ST; 1.
DR   PROSITE; PS50011; PROTEIN_KINASE_DOM; 1.
KW   Transferase; Serine/threonine-protein kinase; ATP-binding;
KW   Cell cycle; Cell division; Mitosis; Phosphorylation; 3D-structure.
FT   DOMAIN        4    286       PROTEIN KINASE.
FT   NP_BIND      10     18       ATP (BY SIMILARITY).
FT   BINDING      33     33       ATP (BY SIMILARITY).
FT   ACT_SITE    127    127       BY SIMILARITY.
FT   MOD_RES      14     14       PHOSPHORYLATION.
FT   MOD_RES      15     15       PHOSPHORYLATION.
FT   MOD_RES     160    160       PHOSPHORYLATION (BY CAK).
FT   MUTAGEN      14     14       T->A: INCREASE ACTIVITY 2 FOLD.
FT   MUTAGEN      15     15       Y->F: INCREASE ACTIVITY 2 FOLD.
FT   MUTAGEN     160    160       T->A: ABOLISHES ACTIVITY.
FT   TURN          2      3
FT   STRAND        4     12
FT   STRAND       17     23
FT   TURN         24     26
FT   STRAND       29     35
FT   HELIX        46     55
FT   TURN         56     57
FT   TURN         61     62
FT   STRAND       63     63
FT   STRAND       66     72
FT   TURN         73     74
FT   STRAND       75     81
FT   STRAND       85     86
FT   HELIX        87     93
FT   TURN         94     97
FT   HELIX       101    120
FT   TURN        121    122
FT   HELIX       130    132
FT   STRAND      133    135
FT   TURN        137    138
FT   STRAND      141    143
FT   TURN        146    147
FT   HELIX       148    151
FT   STRAND      157    157
FT   TURN        159    160
FT   STRAND      163    163
FT   TURN        167    168
FT   HELIX       171    174
FT   TURN        175    176
FT   TURN        182    182
FT   HELIX       183    198
FT   HELIX       208    219
FT   TURN        224    226
FT   TURN        228    229
FT   HELIX       230    232
FT   TURN        234    235
FT   TURN        238    239
FT   HELIX       248    251
FT   TURN        253    254
FT   HELIX       257    266
FT   TURN        267    267
FT   TURN        271    273
FT   HELIX       277    280
FT   TURN        281    282
FT   HELIX       284    286
FT   TURN        287    288
SQ   SEQUENCE   298 AA;  33929 MW;  F90A0F4E70910B51 CRC64;
     MENFQKVEKI GEGTYGVVYK ARNKLTGEVV ALKKIRLDTE TEGVPSTAIR EISLLKELNH
     PNIVKLLDVI HTENKLYLVF EFLHQDLKKF MDASALTGIP LPLIKSYLFQ LLQGLAFCHS
     HRVLHRDLKP QNLLINTEGA IKLADFGLAR AFGVPVRTYT HEVVTLWYRA PEILLGCKYY
     STAVDIWSLG CIFAEMVTRR ALFPGDSEID QLFRIFRTLG TPDEVVWPGV TSMPDYKPSF
     PKWARQDFSK VVPPLDEDGR SLLSQMLHYD PNKRISAKAA LAHPFFQDVT KPVPHLRL

//
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
SWISS-PROT Example Flat File
Example 3-1 contains a sequence entry from SWISS-PROT. This entry contains terms from the SWISS-PROT Field Definitions and Feature Table types, discussed later in this chapter.
Example 3-1. Sample SWISS-PROT sequence entry
ID   CDK2_HUMAN     STANDARD;      PRT;   298 AA.
AC   P24941;
DT   01-MAR-1992 (Rel. 21, Created)
DT   01-AUG-1992 (Rel. 23, Last sequence update)
DT   15-JUN-2002 (Rel. 41, Last annotation update)
DE   Cell division protein kinase 2 (EC 2.7.1.-) (p33 protein kinase).
GN   CDK2.
OS   Homo sapiens (Human).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
OX   NCBI_TaxID=9606;
RN   [1]
RP   SEQUENCE FROM N.A.
RX   MEDLINE=91330891; PubMed=1714386;
RA   Elledge S.J., Spottswood M.R.;
RT   "A new human p34 protein kinase, CDK2, identified by complementation
RT   of a cdc28 mutation in Saccharomyces cerevisiae, is a homolog of
RT   Xenopus Eg1.";
RL   EMBO J. 10:2653-2659(1991).
RN   [2]
RP   SEQUENCE FROM N.A.
RX   MEDLINE=91367262; PubMed=1653904;
RA   Tsai L.-H., Harlow E., Meyerson M.;
RT   "Isolation of the human cdk2 gene that encodes the cyclin A- and
RT   adenovirus E1A-associated p33 kinase.";
RL   Nature 353:174-177(1991).
RN   [3]
RP   SEQUENCE FROM N.A.
RX   MEDLINE=92020980; PubMed=1717994;
RA   Ninomiya-Tsuji J., Nomoto S., Yasuda H., Reed S.I., Matsumoto K.;
RT   "Cloning of a human cDNA encoding a CDC2-related kinase by
RT   complementation of a budding yeast cdc28 mutation.";
RL   Proc. Natl. Acad. Sci. U.S.A. 88:9006-9010(1991).
RN   [4]
RP   SEQUENCE FROM N.A.
RC   TISSUE=Placenta;
RA   Strausberg R.;
RL   Submitted (FEB-2001) to the EMBL/GenBank/DDBJ databases.
RN   [5]
RP   PHOSPHORYLATION SITES.
RX   MEDLINE=93010995; PubMed=1396589;
RA   Gu Y., Rosenblatt J., O'Morgan D.O.;
RT   "Cell cycle regulation of CDK2 activity by phosphorylation of Thr160
RT   and Tyr15.";
RL   EMBO J. 11:3995-4005(1992).
RN   [6]
RP   X-RAY CRYSTALLOGRAPHY (2.4 ANGSTROMS).
RX   MEDLINE=93288132; PubMed=8510751;
RA   de Bondt H.L., Rosenblatt J., Jancarik J., Jones H.D.,
RA   Morgan D.O., Kim S.-H.;
RT   "Crystal structure of cyclin-dependent kinase 2.";
RL   Nature 363:595-602(1993).
RN   [7]
RP   X-RAY CRYSTALLOGRAPHY (2.3 ANGSTROMS) OF COMPLEX WITH CYCLIN A.
RX   MEDLINE=95356811; PubMed=7630397;
RA   Jeffrey P.D., Russo A.A., Polyak K., Gibbs E., Hurwitz J.,
RA   Massague J., Pavletich N.P.;
RT   "Mechanism of CDK activation revealed by the structure of a
RT   cyclinA-CDK2 complex.";
RL   Nature 376:313-320(1995).
RN   [8]
RP   X-RAY CRYSTALLOGRAPHY (2.33 ANGSTROMS) OF COMPLEX WITH L868276.
RX   MEDLINE=96181476; PubMed=8610110;
RA   de Azevedo W.F. Jr., Muleer-Dieckmann H.-J., Schulze-Gahmen U.,
RA   Worland P.J., Sausville E., Kim S.-H.;
RT   "Structural basis for specificity and potency of a flavonoid
RT   inhibitor of human CDK2, a cell cycle kinase.";
RL   Proc. Natl. Acad. Sci. U.S.A. 93:2735-2740(1996).
RN   [9]
RP   X-RAY CRYSTALLOGRAPHY (2.3 ANGSTROMS) OF COMPLEX WITH CG2A AND KIP1.
RX   MEDLINE=96300318; PubMed=8684460;
RA   Russo A.A., Jeffrey P.D., Patten A.K., Massague J., Pavletich N.P.;
RT   "Crystal structure of the p27Kip1 cyclin-dependent-kinase inhibitor
RT   bound to the cyclin A-Cdk2 complex.";
RL   Nature 382:325-331(1996).
RN   [10]
RP   X-RAY CRYSTALLOGRAPHY (2.6 ANGSTROMS) OF COMPLEX WITH CG2A.
RX   MEDLINE=96313126; PubMed=8756328;
RA   Russo A.A., Jeffrey P.D., Pavletich N.P.;
RT   "Structural basis of cyclin-dependent kinase activation by
RT   phosphorylation.";
RL   Nat. Struct. Biol. 3:696-700(1996).
RN   [11]
RP   X-RAY CRYSTALLOGRAPHY (1.9 ANGSTROMS).
RX   MEDLINE=97075215; PubMed=8917641;
RA   Schulze-Gahmen U., de Bondt H.L., Kim S.-H.;
RT   "High-resolution crystal structures of human cyclin-dependent kinase
RT   2 with and without ATP: bound waters and natural ligand as guides for
RT   inhibitor design.";
RL   J. Med. Chem. 39:4540-4546(1996).
RN   [12]
RP   X-RAY CRYSTALLOGRAPHY (2.0 ANGSTROMS).
RX   MEDLINE=97475219; PubMed=9334743;
RA   Lawrie A.M., Noble M.E.M., Tunnah P., Brown N.R., Johnson L.N.,
RA   Endicott J.A.;
RT   "Protein kinase inhibition by staurosporine revealed in details of
RT   the molecular interaction with CDK2.";
RL   Nat. Struct. Biol. 4:796-801(1997).
RN   [13]
RP   X-RAY CRYSTALLOGRAPHY (2.6 ANGSTROMS) OF COMPLEX WITG CKS1.
RX   MEDLINE=96182647; PubMed=8601310;
RA   Bourne Y., Watson M.H., Hickey M.J., Holmes W., Rocque W., Reed S.I.,
RA   Tainer J.A.;
RT   "Crystal structure and mutational analysis of the human CDK2 kinase
RT   complex with cell cycle-regulatory protein CksHs1.";
RL   Cell 84:863-874(1996).
RN   [14]
RP   X-RAY CRYSTALLOGRAPHY (2.05 ANGSTROMS).
RX   MEDLINE=98342369; PubMed=9677190;
RA   Gray N.S., Wodicka L., Thunnissen A.-M.W.H., Norman T.C., Kwon S.,
RA   Espinoza F.H., Morgan D.O., Barnes G., Leclerc S., Meijer L.,
RA   Kim S.H., Lockhart D.J., Schultz P.G.;
RT   "Exploiting chemical libraries, structure, and genomics in the search
RT   for kinase inhibitors.";
RL   Science 281:533-538(1998).
CC   -!- FUNCTION: PROBABLY INVOLVED IN THE CONTROL OF THE CELL CYCLE.
CC       INTERACTS WITH CYCLINS A, D, OR E. ACTIVITY OF CDK2 IS MAXIMAL
CC       DURING S PHASE AND G2.
CC   -!- ENZYME REGULATION: PHOSPHORYLATION AT THR-14 OR TYR-15 INACTIVATES
CC       THE ENZYME, WHILE PHOSPHORYLATION AT THR-160 ACTIVATES IT.
CC   -!- SIMILARITY: BELONGS TO THE SER/THR FAMILY OF PROTEIN KINASES.
CC       CDC2/CDKX SUBFAMILY.
CC   --------------------------------------------------------------------------
CC   This SWISS-PROT entry is copyright. It is produced through a collaboration
CC   between  the Swiss Institute of Bioinformatics  and the  EMBL outstation -
CC   the European Bioinformatics Institute.  There are no  restrictions on  its
CC   use  by  non-profit  institutions as long  as its content  is  in  no  way
CC   modified and this statement is not removed.  Usage  by  and for commercial
CC   entities requires a license agreement (See http://www.isb-sib.ch/announce/
CC   or send an email to license@isb-sib.ch).
CC   --------------------------------------------------------------------------
DR   EMBL; X61622; CAA43807.1; -.
DR   EMBL; X62071; CAA43985.1; -.
DR   EMBL; M68520; AAA35667.1; -.
DR   EMBL; BC003065; AAH03065.1; -.
DR   PIR; A41227; A41227.
DR   PIR; S16520; S16520.
DR   PIR; S17873; S17873.
DR   PDB; 1FIN; 27-JAN-97.
DR   PDB; 1HCK; 07-DEC-96.
DR   PDB; 1HCL; 07-DEC-96.
DR   PDB; 1AQ1; 12-NOV-97.
DR   PDB; 1JST; 11-JAN-97.
DR   PDB; 1JSU; 29-JUL-97.
DR   PDB; 1BUH; 09-SEP-98.
DR   PDB; 1B38; 23-DEC-98.
DR   PDB; 1B39; 23-DEC-98.
DR   PDB; 1CKP; 13-JAN-99.
DR   Genew; HGNC:1771; CDK2.
DR   MIM; 116953; -.
DR   InterPro; IPR000719; Euk_pkinase.
DR   InterPro; IPR002290; Ser_thr_pkinase.
DR   Pfam; PF00069; pkinase; 1.
DR   ProDom; PD000001; Euk_pkinase; 1.
DR   SMART; SM00220; S_TKc; 1.
DR   PROSITE; PS00107; PROTEIN_KINASE_ATP; 1.
DR   PROSITE; PS00108; PROTEIN_KINASE_ST; 1.
DR   PROSITE; PS50011; PROTEIN_KINASE_DOM; 1.
KW   Transferase; Serine/threonine-protein kinase; ATP-binding;
KW   Cell cycle; Cell division; Mitosis; Phosphorylation; 3D-structure.
FT   DOMAIN        4    286       PROTEIN KINASE.
FT   NP_BIND      10     18       ATP (BY SIMILARITY).
FT   BINDING      33     33       ATP (BY SIMILARITY).
FT   ACT_SITE    127    127       BY SIMILARITY.
FT   MOD_RES      14     14       PHOSPHORYLATION.
FT   MOD_RES      15     15       PHOSPHORYLATION.
FT   MOD_RES     160    160       PHOSPHORYLATION (BY CAK).
FT   MUTAGEN      14     14       T->A: INCREASE ACTIVITY 2 FOLD.
FT   MUTAGEN      15     15       Y->F: INCREASE ACTIVITY 2 FOLD.
FT   MUTAGEN     160    160       T->A: ABOLISHES ACTIVITY.
FT   TURN          2      3
FT   STRAND        4     12
FT   STRAND       17     23
FT   TURN         24     26
FT   STRAND       29     35
FT   HELIX        46     55
FT   TURN         56     57
FT   TURN         61     62
FT   STRAND       63     63
FT   STRAND       66     72
FT   TURN         73     74
FT   STRAND       75     81
FT   STRAND       85     86
FT   HELIX        87     93
FT   TURN         94     97
FT   HELIX       101    120
FT   TURN        121    122
FT   HELIX       130    132
FT   STRAND      133    135
FT   TURN        137    138
FT   STRAND      141    143
FT   TURN        146    147
FT   HELIX       148    151
FT   STRAND      157    157
FT   TURN        159    160
FT   STRAND      163    163
FT   TURN        167    168
FT   HELIX       171    174
FT   TURN        175    176
FT   TURN        182    182
FT   HELIX       183    198
FT   HELIX       208    219
FT   TURN        224    226
FT   TURN        228    229
FT   HELIX       230    232
FT   TURN        234    235
FT   TURN        238    239
FT   HELIX       248    251
FT   TURN        253    254
FT   HELIX       257    266
FT   TURN        267    267
FT   TURN        271    273
FT   HELIX       277    280
FT   TURN        281    282
FT   HELIX       284    286
FT   TURN        287    288
SQ   SEQUENCE   298 AA;  33929 MW;  F90A0F4E70910B51 CRC64;
     MENFQKVEKI GEGTYGVVYK ARNKLTGEVV ALKKIRLDTE TEGVPSTAIR EISLLKELNH
     PNIVKLLDVI HTENKLYLVF EFLHQDLKKF MDASALTGIP LPLIKSYLFQ LLQGLAFCHS
     HRVLHRDLKP QNLLINTEGA IKLADFGLAR AFGVPVRTYT HEVVTLWYRA PEILLGCKYY
     STAVDIWSLG CIFAEMVTRR ALFPGDSEID QLFRIFRTLG TPDEVVWPGV TSMPDYKPSF
     PKWARQDFSK VVPPLDEDGR SLLSQMLHYD PNKRISAKAA LAHPFFQDVT KPVPHLRL

//
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
SWISS-PROT Field Definitions
The field codes found in a SWISS-PROT (or TrEMBL) sequence flat file are used to help arrange the information for human readabilty and machine-based parsing. There are several SWISS-PROT field codes found in a sequence flat file; they are represented by two-letter abbreviations. Table 3-1 summarizes the contents of each field code.
Table 3-1: SWISS-PROT field definititions
Line code
Content
ID
Identification
AC
Accession number(s)
DT
Date
DE
Description
GN
Gene name(s)
OS
Organism species
OG
Organelle
OC
Organism classification
OX
Taxonomy cross-reference(s)
RN
Reference number
RP
Reference position
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
SWISS-PROT Feature Table
A feature is a single word or abbreviation indicating a functional role or region associated with a sequence. A list of SWISS-PROT features (organized by feature type) is presented below. An example for each feature is also included to illustrate its use for describing a sequence location or region.
CONFLICT
Different papers report differing sequences:
FT   CONFLICT    304    304       MISSING (IN REF. 3).
MUTAGEN
Indicates an experimentally altered site:
FT   MUTAGEN      65     65       H->F: 100% ACTIVITY LOSS.
VARIANT
Authors report that sequence variants exist:
FT   VARIANT     136    136       M -> I.
VARSPLIC
Describes sequence variants produced by alternative splicing:
FT   VARSPLIC     33     49       MISSING (IN SHORT ISOFORM).
BINDING
Binding site for chemical group (co-enzyme, prosthetic group, etc.):
FT   BINDING      14     14       HEME (COVALENT).
CARBOHYD
Glycosylation site:
FT   CARBOHYD     53     53       N-LINKED (GLCNAC...) (POTENTIAL).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
References
  • Bairoch, A., and R. Apweiler. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Research 28:45-48.
    Main page
    http://us.expasy.org/sprot/
    Release notes
    http://us.expasy.org/sprot/relnotes/
    User manual
    http://us.expasy.org/sprot/userman.html
    Download
    ftp://us.expasy.org/databases/swiss-prot
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Pfam
While many databases are dedicated to organizing protein families and protein domains, Pfam is our preferred database for predicting the function of newly-discovered proteins. Pfam is unique in that it is a manually curated database of protein families derived from protein multiple sequence alignments and profile hidden Markow models. Pfam is a key database for understanding protein function and structure. It is used in many methods, including phylogenetic analysis, secondary structure prediction, and sequence annotation. We're using Pfam Release 7.8.
Example 4-1 shows a Pfam flat file. This entry contains terms from the Pfam Field Definitions, discussed later in this chapter.
Example 4-1. Sample Pfam example
# STOCKHOLM 1.0
#=GF ID   14-3-3
#=GF AC   PF00244
#=GF DE   14-3-3 proteins
#=GF AU   Finn RD
#=GF AL   Clustalw
#=GF SE   Prosite
#=GF GA   25 25
#=GF TC   35.40 35.40
#=GF NC   19.10 19.10
#=GF BM   hmmbuild -f HMM SEED
#=GF BM   hmmcalibrate --seed 0 HMM
#=GF RN   [1]
#=GF RM   95327195
#=GF RT   Structure of a 14-3-3 protein and implications for
#=GF RT   coordination of multiple signalling pathways. 
#=GF RA   Xiao B, Smerdon SJ, Jones DH, Dodson GG, Soneji Y, Aitken
#=GF RA   A, Gamblin SJ; 
#=GF RL   Nature 1995;376:188-191.
#=GF RN   [2]
#=GF RM   95327196
#=GF RT   Crystal structure of the zeta isoform of the 14-3-3
#=GF RT   protein. 
#=GF RA   Liu D, Bienkowska J, Petosa C, Collier RJ, Fu H, Liddington
#=GF RA   R; 
#=GF RL   Nature 1995;376:191-194.
#=GF RN   [3]
#=GF RM   96182649
#=GF RT   Interaction of 14-3-3 with signaling proteins is mediated
#=GF RT   by the recognition of phosphoserine. 
#=GF RA   Muslin AJ, Tanner JW, Allen PM, Shaw AS; 
#=GF RL   Cell 1996;84:889-897.
#=GF RN   [4]
#=GF RM   97424374
#=GF RT   The 14-3-3 protein binds its target proteins with a common
#=GF RT   site located towards the C-terminus. 
#=GF RA   Ichimura T, Ito M, Itagaki C, Takahashi M, Horigome T,
#=GF RA   Omata S, Ohno S, Isobe T 
#=GF RL   FEBS Lett 1997;413:273-276.
#=GF RN   [5]
#=GF RM   96394689
#=GF RT   Molecular evolution of the 14-3-3 protein family. 
#=GF RA   Wang W, Shakes DC 
#=GF RL   J Mol Evol 1996;43:384-398.
#=GF RN   [6]
#=GF RM   96300316
#=GF RT   Function of 14-3-3 proteins. 
#=GF RA   Jin DY, Lyu MS, Kozak CA, Jeang KT 
#=GF RL   Nature 1996;382:308-308.
#=GF DR   PROSITE; PDOC00633;
#=GF DR   SMART; 14_3_3;
#=GF DR   PRINTS; PR00305;
#=GF DR   SCOP; 1a4o; fa;
#=GF DR   PDB; 1a37 A; 3; 228;
#=GF DR   PDB; 1a37 B; 3; 228;
#=GF DR   PDB; 1a38 A; 3; 228;
#=GF DR   PDB; 1a38 B; 3; 228;
#=GF DR   PDB; 1a4o A; 3; 228;
#=GF DR   PDB; 1a4o B; 3; 228;
#=GF DR   PDB; 1a4o C; 3; 228;
#=GF DR   PDB; 1a4o D; 3; 228;
#=GF DR   PDB; 1qja B; 3; 229;
#=GF DR   PDB; 1qja A; 3; 230;
#=GF DR   PDB; 1qjb A; 3; 232;
#=GF DR   PDB; 1qjb B; 3; 232;
#=GF DR   INTERPRO; IPR000308;
#=GF SQ   148
#=GS O61131/11-251      AC O61131
<deleted for brevity>
#=GS 143Z_HUMAN/3-236 DR PDB; 1qjb B; 3; 232;
O61131/11-251                RSDCTYRSKLAEQAERYDEMADAMRTLVEQCVnn.......
dkdELTVEERNLLSVAYKNAVGARRASWRIISSVEQKEMSKA.NVHNKNIAATYRKKVEEELNNIC.QDILN.
LLTKKLIPNT..SESESKVFYYKMKGDYYRYISEFS.CDE.
GKKEASNFAQEAYQKATDIAENELPSTHPIRLGLALNYSVFFY..EILNQPHQACEMAKRAF...DDAIT