Mastering Python for Bioinformatics

Book description

Life scientists today urgently need training in bioinformatics skills. Too many bioinformatics programs are poorly written and barely maintained, usually by students and researchers who've never learned basic programming skills. This practical guide shows postdoc bioinformatics professionals and students how to exploit the best parts of Python to solve problems in biology while creating documented, tested, reproducible software.

Ken Youens-Clark, author of Tiny Python Projects (Manning), demonstrates not only how to write effective Python code but also how to use tests to write and refactor scientific programs. You'll learn the latest Python features and tools including linters, formatters, type checkers, and tests to create documented and tested programs. You'll also tackle 14 challenges in Rosalind, a problem-solving platform for learning bioinformatics and programming.

  • Create command-line Python programs to document and validate parameters
  • Write tests to verify refactor programs and confirm they're correct
  • Address bioinformatics ideas using Python data structures and modules such as Biopython
  • Create reproducible shortcuts and workflows using makefiles
  • Parse essential bioinformatics file formats such as FASTA and FASTQ
  • Find patterns of text using regular expressions
  • Use higher-order functions in Python like filter(), map(), and reduce()

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who Should Read This?
    2. Programming Style: Why I Avoid OOP and Exceptions
    3. Structure
    4. Test-Driven Development
    5. Using the Command Line and Installing Python
    6. Getting the Code and Tests
    7. Installing Modules
    8. Installing the new.py Program
    9. Why Did I Write This Book?
    10. Conventions Used in This Book
    11. Using Code Examples
    12. O’Reilly Online Learning
    13. How to Contact Us
    14. Acknowledgments
  2. I. The Rosalind.info Challenges
  3. 1. Tetranucleotide Frequency: Counting Things
    1. Getting Started
      1. Creating the Program Using new.py
      2. Using argparse
      3. Tools for Finding Errors in the Code
      4. Introducing Named Tuples
      5. Adding Types to Named Tuples
      6. Representing the Arguments with a NamedTuple
      7. Reading Input from the Command Line or a File
      8. Testing Your Program
      9. Running the Program to Test the Output
    2. Solution 1: Iterating and Counting the Characters in a String
      1. Counting the Nucleotides
      2. Writing and Verifying a Solution
    3. Additional Solutions
      1. Solution 2: Creating a count() Function and Adding a Unit Test
      2. Solution 3: Using str.count()
      3. Solution 4: Using a Dictionary to Count All the Characters
      4. Solution 5: Counting Only the Desired Bases
      5. Solution 6: Using collections.defaultdict()
      6. Solution 7: Using collections.Counter()
    4. Going Further
    5. Review
  4. 2. Transcribing DNA into mRNA: Mutating Strings, Reading and Writing Files
    1. Getting Started
      1. Defining the Program’s Parameters
      2. Defining an Optional Parameter
      3. Defining One or More Required Positional Parameters
      4. Using nargs to Define the Number of Arguments
      5. Using argparse.FileType() to Validate File Arguments
      6. Defining the Args Class
      7. Outlining the Program Using Pseudocode
      8. Iterating the Input Files
      9. Creating the Output Filenames
      10. Opening the Output Files
      11. Writing the Output Sequences
      12. Printing the Status Report
      13. Using the Test Suite
    2. Solutions
      1. Solution 1: Using str.replace()
      2. Solution 2: Using re.sub()
    3. Benchmarking
    4. Going Further
    5. Review
  5. 3. Reverse Complement of DNA: String Manipulation
    1. Getting Started
      1. Iterating Over a Reversed String
      2. Creating a Decision Tree
      3. Refactoring
    2. Solutions
      1. Solution 1: Using a for Loop and Decision Tree
      2. Solution 2: Using a Dictionary Lookup
      3. Solution 3: Using a List Comprehension
      4. Solution 4: Using str.translate()
      5. Solution 5: Using Bio.Seq
    3. Review
  6. 4. Creating the Fibonacci Sequence: Writing, Testing, and Benchmarking Algorithms
    1. Getting Started
      1. An Imperative Approach
    2. Solutions
      1. Solution 1: An Imperative Solution Using a List as a Stack
      2. Solution 2: Creating a Generator Function
      3. Solution 3: Using Recursion and Memoization
    3. Benchmarking the Solutions
    4. Testing the Good, the Bad, and the Ugly
    5. Running the Test Suite on All the Solutions
    6. Going Further
    7. Review
  7. 5. Computing GC Content: Parsing FASTA and Analyzing Sequences
    1. Getting Started
      1. Get Parsing FASTA Using Biopython
      2. Iterating the Sequences Using a for Loop
    2. Solutions
      1. Solution 1: Using a List
      2. Solution 2: Type Annotations and Unit Tests
      3. Solution 3: Keeping a Running Max Variable
      4. Solution 4: Using a List Comprehension with a Guard
      5. Solution 5: Using the filter() Function
      6. Solution 6: Using the map() Function and Summing Booleans
      7. Solution 7: Using Regular Expressions to Find Patterns
      8. Solution 8: A More Complex find_gc() Function
    3. Benchmarking
    4. Going Further
    5. Review
  8. 6. Finding the Hamming Distance: Counting Point Mutations
    1. Getting Started
      1. Iterating the Characters of Two Strings
    2. Solutions
      1. Solution 1: Iterating and Counting
      2. Solution 2: Creating a Unit Test
      3. Solution 3: Using the zip() Function
      4. Solution 4: Using the zip_longest() Function
      5. Solution 5: Using a List Comprehension
      6. Solution 6: Using the filter() Function
      7. Solution 7: Using the map() Function with zip_longest()
      8. Solution 8: Using the starmap() and operator.ne() Functions
    3. Going Further
    4. Review
  9. 7. Translating mRNA into Protein: More Functional Programming
    1. Getting Started
      1. K-mers and Codons
      2. Translating Codons
    2. Solutions
      1. Solution 1: Using a for Loop
      2. Solution 2: Adding Unit Tests
      3. Solution 3: Another Function and a List Comprehension
      4. Solution 4: Functional Programming with the map(), partial(), and takewhile() Functions
      5. Solution 5: Using Bio.Seq.translate()
    3. Benchmarking
    4. Going Further
    5. Review
  10. 8. Find a Motif in DNA: Exploring Sequence Similarity
    1. Getting Started
      1. Finding Subsequences
    2. Solutions
      1. Solution 1: Using the str.find() Method
      2. Solution 2: Using the str.index() Method
      3. Solution 3: A Purely Functional Approach
      4. Solution 4: Using K-mers
      5. Solution 5: Finding Overlapping Patterns Using Regular Expressions
    3. Benchmarking
    4. Going Further
    5. Review
  11. 9. Overlap Graphs: Sequence Assembly Using Shared K-mers
    1. Getting Started
      1. Managing Runtime Messages with STDOUT, STDERR, and Logging
      2. Finding Overlaps
      3. Grouping Sequences by the Overlap
    2. Solutions
      1. Solution 1: Using Set Intersections to Find Overlaps
      2. Solution 2: Using a Graph to Find All Paths
    3. Going Further
    4. Review
  12. 10. Finding the Longest Shared Subsequence: Finding K-mers, Writing Functions, and Using Binary Search
    1. Getting Started
      1. Finding the Shortest Sequence in a FASTA File
      2. Extracting K-mers from a Sequence
    2. Solutions
      1. Solution 1: Counting Frequencies of K-mers
      2. Solution 2: Speeding Things Up with a Binary Search
    3. Going Further
    4. Review
  13. 11. Finding a Protein Motif: Fetching Data and Using Regular Expressions
    1. Getting Started
      1. Downloading Sequences Files on the Command Line
      2. Downloading Sequences Files with Python
      3. Writing a Regular Expression to Find the Motif
    2. Solutions
      1. Solution 1: Using a Regular Expression
      2. Solution 2: Writing a Manual Solution
    3. Going Further
    4. Review
  14. 12. Inferring mRNA from Protein: Products and Reductions of Lists
    1. Getting Started
      1. Creating the Product of Lists
      2. Avoiding Overflow with Modular Multiplication
    2. Solutions
      1. Solution 1: Using a Dictionary for the RNA Codon Table
      2. Solution 2: Turn the Beat Around
      3. Solution 3: Encoding the Minimal Information
    3. Going Further
    4. Review
  15. 13. Location Restriction Sites: Using, Testing, and Sharing Code
    1. Getting Started
      1. Finding All Subsequences Using K-mers
      2. Finding All Reverse Complements
      3. Putting It All Together
    2. Solutions
      1. Solution 1: Using the zip() and enumerate() Functions
      2. Solution 2: Using the operator.eq() Function
      3. Solution 3: Writing a revp() Function
    3. Testing the Program
    4. Going Further
    5. Review
  16. 14. Finding Open Reading Frames
    1. Getting Started
      1. Translating Proteins Inside Each Frame
      2. Finding the ORFs in a Protein Sequence
    2. Solutions
      1. Solution 1: Using the str.index() Function
      2. Solution 2: Using the str.partition() Function
      3. Solution 3: Using a Regular Expression
    3. Going Further
    4. Review
  17. II. Other Programs
  18. 15. Seqmagique: Creating and Formatting Reports
    1. Using Seqmagick to Analyze Sequence Files
    2. Checking Files Using MD5 Hashes
    3. Getting Started
      1. Formatting Text Tables Using tabulate()
    4. Solutions
      1. Solution 1: Formatting with tabulate()
      2. Solution 2: Formatting with rich
    5. Going Further
    6. Review
  19. 16. FASTX grep: Creating a Utility Program to Select Sequences
    1. Finding Lines in a File Using grep
    2. The Structure of a FASTQ Record
    3. Getting Started
      1. Guessing the File Format
    4. Solution
    5. Going Further
    6. Review
  20. 17. DNA Synthesizer: Creating Synthetic Data with Markov Chains
    1. Understanding Markov Chains
    2. Getting Started
      1. Understanding Random Seeds
      2. Reading the Training Files
      3. Generating the Sequences
      4. Structuring the Program
    3. Solution
    4. Going Further
    5. Review
  21. 18. FASTX Sampler: Randomly Subsampling Sequence Files
    1. Getting Started
      1. Reviewing the Program Parameters
      2. Defining the Parameters
      3. Nondeterministic Sampling
      4. Structuring the Program
    2. Solutions
      1. Solution 1: Reading Regular Files
      2. Solution 2: Reading a Large Number of Compressed Files
    3. Going Further
    4. Review
  22. 19. Blastomatic: Parsing Delimited Text Files
    1. Introduction to BLAST
    2. Using csvkit and csvchk
    3. Getting Started
      1. Defining the Arguments
      2. Parsing Delimited Text Files Using the csv Module
      3. Parsing Delimited Text Files Using the pandas Module
    4. Solutions
      1. Solution 1: Manually Joining the Tables Using Dictionaries
      2. Solution 2: Writing the Output File with csv.DictWriter()
      3. Solution 3: Reading and Writing Files Using pandas
      4. Solution 4: Joining Files Using pandas
    5. Going Further
    6. Review
  23. A. Documenting Commands and Creating Workflows with make
    1. Makefiles Are Recipes
    2. Running a Specific Target
    3. Running with No Target
    4. Makefiles Create DAGs
    5. Using make to Compile a C Program
    6. Using make for a Shortcut
    7. Defining Variables
    8. Writing a Workflow
    9. Other Workflow Managers
    10. Further Reading
  24. B. Understanding $PATH and Installing Command-Line Programs
  25. Epilogue
  26. Index
  27. About the Author

Product information

  • Title: Mastering Python for Bioinformatics
  • Author(s): Ken Youens-Clark
  • Release date: May 2021
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098100889