Chapter 5. Azure Databricks (and Apache Spark)

In genomics, we often have to perform computationally-intensive tasks such as sequence alignment, variant calling, and machine learning. Traditionally, we can perform these tasks on our local workstations or, if we’re really fancy, on a physical computing cluster. The struggle with these options is in how they limit our scale or access when we need it most.

Physical workstations are quite convenient, but they obviously won’t scale to handle workloads that are larger than the memory or physical disk space on the machine. Plus, they usually have only a small number of cores, limiting the amount of parallelization that can be achieved when running code.

High-performance computing (HPC) clusters are often a pain to use because you usually need to submit your job into a queue. This means that you can’t really run your code interactively, making debugging very difficult. Plus, job queue purgatory is awful when you’re racing against a deadline and really need your results fast.

Luckily, the Azure cloud has quite a few options for flexibly running various types of tasks at scale. In the next few chapters, I’ll be covering a few different computational services and how they may help make your life easier when trying to analyze some data. In this chapter, we’ll walk through how Spark-based cluster computing can elevate our data analysis capabilities and how Azure Databricks in the cloud alleviates most of our struggles that we’ve previously ...

Get Genomics in the Azure Cloud now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.