Genomics in the Azure Cloud

Book description

This practical guide bridges the gap between general cloud computing architecture in Microsoft Azure and scientific computing for bioinformatics and genomics. You'll get a solid understanding of the architecture patterns and services that are offered in Azure and how they might be used in your bioinformatics practice. You'll get code examples that you can reuse for your specific needs. And you'll get plenty of concrete examples to illustrate how a given service is used in a bioinformatics context.

You'll also get valuable advice on how to:

  • Use enterprise platform services to easily scale your bioinformatics workloads
  • Organize, query, and analyze genomic data at scale
  • Build a genomics data lake and accompanying data warehouse
  • Use Azure Machine Learning to scale your model training, track model performance, and deploy winning models
  • Orchestrate and automate processing pipelines using Azure Data Factory and Databricks
  • Cloudify your organization's existing bioinformatics pipelines by moving your workflows to Azure high-performance compute services
  • And more

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who Should Read This Book
    2. How the Book Is Organized
    3. Software and Hardware Requirements
    4. Code Conventions and Downloads
    5. Conventions Used in This Book
    6. Using Code Examples
    7. O’Reilly Online Learning
    8. How to Contact Us
    9. Acknowledgments
  2. 1. Essentials of Cloud Architecture
    1. Cloud Horsepower
      1. Considerations for the Cloud
      2. Three Benefits of the Cloud
    2. Types of Cloud Services
      1. Infrastructure Services
      2. Platform Services
      3. Software Services
    3. Azure Environment Organization
    4. Getting an Azure Account
    5. Welcome to the Azure Portal
      1. Setting Up a Resource Group
      2. Creating Resources
      3. Free Services
    6. Basics of the Bioinformatics Workflow
      1. Primary Analysis
      2. Secondary Analysis
      3. Tertiary Analysis
      4. Other Analyses
      5. Other File Formats
  3. 2. Organizing Genomics Data with Data Lakes
    1. Organizing Your Genomics Data
      1. Going for Bronze, Silver, and Gold
      2. Letting Your Bioinformatics Workflow Dictate Your Data Lake Organization
      3. Planning for -omics and Non-omics Data Together
    2. Creating a Data Lake with Azure Storage
      1. Blob Storage Versus Data Lake Storage
    3. Balancing Costs Versus Performance in Data Storage
      1. The Goldilocks Method of Storage Tiers
      2. Genomics Data Lifecycle
    4. Managing Access Inside the Lake
      1. Role-Based Access Control
      2. Access-Control Lists
    5. Azure Open Datasets for Genomics
  4. 3. Querying Variant Data in SQL
    1. Building a Genomics Data Warehouse
      1. Example: Lab Results
      2. Data Warehouse Architecture for Genomics
    2. Azure Synapse Analytics
      1. Creating an Azure Synapse Analytics Workspace
      2. Registering Services in Subscriptions
      3. Getting to Work in the Synapse Workspace
      4. Using Open Row Sets
      5. Creating External Tables
      6. Did Someone Say “Pool Party”?
    3. Connecting to More Data Sources
    4. Azure SQL DB
      1. Creating a Database in Azure SQL DB
    5. Relaxing at Your Genomics Data Lakehouse
      1. Efficient File Formats
  5. 4. Orchestrating Data Movement and Transformation
    1. Creating Your Data Factory
    2. Getting Started with Data Movement
      1. Getting Data into Your Data Lake Using the Copy Data Tool
      2. Linking to NCBI’s FTP Server
      3. Transforming Data Using Data Flows
      4. Building and Triggering Pipelines for Automation
  6. 5. Azure Databricks (and Apache Spark)
    1. Introduction to Apache Spark and Databricks
    2. Setting Up an Azure Databricks Workspace
      1. Connecting Databricks to Your Data Lake
    3. Processing Variant Data with the Glow Package
      1. Exploring DataFrames
    4. Automating Variant Data Processing
      1. Orchestrating a Databricks Notebook from Data Factory
      2. A Brief Interlude About Distributed File Formats
    5. Using Other Tools in Databricks
      1. Single-Node Bioinformatics Tools
      2. Koalas
      3. Hail
  7. 6. Azure Machine Learning
    1. How to Scale Machine Learning Tasks
    2. Creating an Azure Machine Learning Workspace
    3. Training a Drug Sensitivity Model
      1. Creating a Compute Instance in Azure Machine Learning Studio
      2. Datastores and Datasets
      3. Experimenting with Cluster-Based Training
    4. Automating Model Training with AutoML
      1. Explainable Machine Learning
    5. Using Azure Machine Learning Not for Machine Learning
      1. Performing Alignment in a Notebook
      2. Custom Docker Images for Bioinformatics
  8. 7. High-Performance Computing and Other Compute Services
    1. Bring Your Own Pipeline (BYOP)
      1. Why Azure for HPC?
    2. Azure Batch
      1. Scaling Workloads with Cromwell
    3. Azure CycleCloud
      1. Setting Up CycleCloud Clusters
    4. Microsoft Genomics
      1. Alignment and Variant Calling with the msgen Package
  9. 8. Deployment, Security, Compliance, and Potpourri
    1. Automating the Deployment of Cloud Resources
      1. Dev, Staging, and Prod
      2. Lifting Your Deployment with ARMs and Biceps
    2. Security Planning
      1. Azure Active Directory
      2. Role-Based Access Controls and Access-Control Lists
    3. Compliance
      1. HIPAA, HITECH, and HITRUST
      2. Azure Blueprints
    4. Cost Considerations
      1. Azure Pricing Calculator
      2. Retail Pricing Versus Enterprise Agreements
      3. Budgeting Examples
    5. Quota Problems
      1. Please, Sir, Can I Have Some More (vCPUs)?
    6. Getting General Support
  10. Conclusion
    1. Looking Backward
      1. Baby Azure
    2. What Else?
      1. Using Other Web-Based Bioinformatics Platforms
    3. Looking Forward
      1. Cheaper Sequencing = More Data
  11. Index
  12. About the Author

Product information

  • Title: Genomics in the Azure Cloud
  • Author(s): Colby T. Ford
  • Release date: November 2022
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098139049