Preface
If cloud technology is the future of biomedical science, then for genomics, the future is already here.
Genomics is the first biomedical discipline to move en masse to the cloud. Perhaps inevitably so, given that it was the first to experience explosive growth in data generation, leading to rapidly escalating compute and storage requirement issues that a cloud infrastructure is ideally positioned to address. Major genomic datasets and their derived resources are now available in the cloud, and many tools like the industry-leading Genome Analysis Toolkit (GATK) produced by the Broad Institute are now offered in forms optimized to run efficiently on a cloud infrastructure. As a result, many researchers making use of genomic data and related analysis tools are now or will soon be confronted with the need to learn to use cloud resources, which can represent a huge challenge to many. Meanwhile, many informatics and bioinformatics support staff are being pulled in to help researchers to achieve this transition, sometimes with only minimal or no training relevant to the science of genomics. Taken together, these two populations form a continuum of people who need to get on the same page and work together to solve the challenges they face.
Purpose, Scope, and Intended Audience of This Book
With this book, we aim to provide a hands-on orientation tour of major tools, mechanisms, and processes involved in performing genomic analysis in the cloud that can serve as a middle ground for the majority of people on this spectrum. We try to assume as little prior knowledge as possible, and we provide two primer-style chapters, one focused on genomics and one on technology, to ensure that everyone has a firm grounding in the fundamental concepts we rely on from both domains. In addition, we deliberately chose a particular open source technology stack—GATK, Workflow Description Language (WDL), Terra, Docker, and Google Cloud Platform—that provides end-to-end functionality and is backed by robust user support systems in order to guarantee a successful educational experience.
To be clear, this book is not intended to be comprehensive, either in terms of tooling options or the scientific scope of genomic analyses. Our operational definition of genomics, centered on variant discovery and immediately related analyses, is intentionally narrow; and for every step of the processes we describe, there often exist several, if not many, alternative tools that you could substitute for those we chose to showcase. However, we designed the topics and exercises presented here to provide patterns and takeaways that are largely transferable and extensible to other tools and analyses in order to maximize their long-term value to readers. In addition, we plan to release a series of companion blog posts and other online materials that will show complementary approaches using different platforms and technologies; see the book’s GitHub repository and its companion website.
What You Will Learn from This Book
The very idea of doing genomics in the cloud might seem intimidating on first approach, especially if you’re new to either one or both, but it’s not as complicated as you might think. Throughout this book, we walk you through all of the important pieces of the puzzle, step by step. You’ll have the opportunity to run genomic analyses involving the GATK, selected for their broad appeal and interesting computational approaches. You’ll do so first through the “bare” services provided by the Google Cloud Platform (GCP) and then on Terra, a scalable platform for biomedical research codeveloped by the Broad Institute and Verily, an Alphabet company, on top of GCP.
By the end of the book, you should expect to have learned or achieved the following:
-
Fundamentals of computational infrastructure and processes
-
Fundamentals of genomics including biological underpinnings, formats, and conventions
-
Beginner- to intermediate-level hands-on usage of the core technology stack:
-
GATK, WDL, Terra, Docker, and Google Cloud
-
GATK Best Practices for variant discovery as formulated by the GATK development team at the Broad Institute, covering germline short variants, somatic short variants, and somatic copy-number alterations
-
Reading, authoring, and interpreting analysis workflows, first in a sandbox environment and then at scale through several modes of execution (from a standalone command-line package to a fully managed system)
-
Managing data and workflow execution in a workspace environment
-
Performing interactive analysis using Jupyter Notebooks
-
Tying it all together: achieving computational reproducibility in publications through the use of cloud data storage, synthetic data generation, portable workflows, and containerized tools
-
-
Secondary goals
-
Increased familiarity with computational concepts such as scaling and optimization approaches
-
Practical experience with several bioinformatics command-line packages, common commands, and file formats
-
What Computational Experience Is Needed for the Exercises?
For the exercises in Chapter 4 through Chapter 10, we assume that you are already somewhat familiar with command-line fundamentals, including the basics of navigating directories and interacting with text files in a Bash shell; composing and running simple commands; and the concepts of environment variables, path, and working directory. For Chapter 8 through Chapter 11 and Chapter 13, we assume that you are familiar with the concept of writing scripts, though we do not require you to have practical experience doing so. For Chapter 12 and Chapter 14, we assume that you have heard of the programming languages R and Python, and you will find it easier to understand the more complex examples if you have some familiarity with their syntax, though it is not required.
If at any point during the exercises you feel out of your depth in terms of the computational tooling and terminology, we recommend that you check out the lessons provided by the Software Carpentry organization, which are specifically designed for research scientists who have not had formal computational training. The lessons on the Unix shell can be particularly helpful if you don’t have any prior command-line experience. They also have sets of lessons on Python and on R as well as other topics relevant to the book like version control with Git. These lessons are all open source and developed by volunteers in the community who understand the everyday challenges faced by researchers, so they’re a truly fantastic resource.
Conventions Used in This Book
The following typographical conventions are used in this book:
- Italic
-
Indicates new terms, URLs, email addresses, filenames, file extensions, table names and components, and workflows.
Constant width
-
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold
-
Shows text that should be typed literally by the user.
Constant width italic
-
Shows text that should be replaced with user-supplied values or by values determined by context.
$
before code-
Indicates a command run in the VM shell
#
before code-
Indicates a command run in the docker container
Note
This element signifies a note.
Using Code Examples
Supplemental material (code examples, exercises, full-size color figures, etc.) is available for download on GitHub.
This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.
We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Genomics in the Cloud by Geraldine A. Van der Auwera and Brian D. O’Connor (O’Reilly). Copyright 2020 The Broad Institute, Inc. and Brian O’Connor, 978-1-491-97519-0.”
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com.
O’Reilly Online Learning
Note
For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.
Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
- O’Reilly Media, Inc.
- 1005 Gravenstein Highway North
- Sebastopol, CA 95472
- 800-998-9938 (in the United States or Canada)
- 707-829-0515 (international or local)
- 707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/genomics-cloud.
Email bookquestions@oreilly.com to comment or ask technical questions about this book.
To learn more about our books, courses, and news, visit http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Acknowledgments
We would like to thank our countless colleagues at the Broad Institute and at the University of California, Santa Cruz (UCSC), who contributed in so many ways to making this book a reality.
We are hugely indebted to all the past and present members of the frontline support and education teams in the Data Sciences Platform at the Broad Institute who developed and maintain the original educational materials and resources on which we based many of the hands-on exercises presented in this book. Within the education team led by Robert Majovski, we’d like to highlight the work of Soo Hee Lee, whose thoroughness and exacting attention to detail produced some of the deepest resources available about GATK tools; Allie Hajian and Anton Kovalsky, who are tasked with the Herculean feat of documenting how to use Terra even as it wriggles and evolves from underneath them; and Kate Noblett, who wrote much of the original WDL documentation and now coordinates GATK, WDL, and Terra workshops with an iron hand. Within the frontline support team led by Tiffany Miller, we’d like to highlight the work of Beri Shifaw, who maintains the gatk-workflows pipelines on GitHub and in Dockstore as well as the featured workspaces in Terra; and Bhanu Gandham, who has so enthusiastically taken on the responsibility of obsessing about the well-being of the GATK user community. Other contributing members from these two teams, past and present, include Derek Caetano-Anolles, Sushma Chaluvadi, Sheila Chandran, Elizabeth Kiernan, David Kling, Ron Levine and Adelaide Rhodes.
We also recognize and appreciate the growing role played by the Broad DSP Field Engineering team led by Alexander Baumann in this arena. Star among the stars, Yvonne Blanco swooped in from the User Experience team to improve key diagrams and illustrations with her impeccable design mojo.
We are eternally grateful to the many members of the GATK development team who have provided critical input to educational resources and lent their expertise in GATK workshops across the globe. There are too many of them to enumerate here, but within that team, we would like to highlight the invaluable support of Eric Banks, Laura Gauthier, Yossi Farjoun, and Lee Lichtenstein; the seemingly endless patience of David Benjamin and Sam Lee; the unflappable aplomb of David Roazen and jovial fatalism of Louis Bergelson; the quiet expertise of Mark “Duplicates” Fleharty and the cheerful expertise of Megan Shand. Special shout-out also to Chris Norman for his work on the Barclay library, which powers the GATK documentation system.
On a more personal level, Geraldine would like to thank Mauricio Carneiro and Mark De Pristo, past member and founder of the original GATK team, respectively, for taking a chance and hiring a confused microbiologist all those years ago.
Speaking of too many to count, we could not begin to name everyone involved in the development of the chapters on WDL, Cromwell, and Terra, but we’d like to put in a special mention for Adrian “Notebooks Guy” Sharma, William Disman, Ruchi Munshi, and Kyle Vernest, who all contributed helpful insights and put up with our constant badgering about issues we hoped to see addressed before the book came out. On that note, we owe a big thank you to Chris Llanwarne and Adam Nichols for patching womtool
just in time for Chapter 9 to make a lot more sense than it would have otherwise. And speaking of badgering, our deepest apologies go to Eric Karofsky and Jerôme Chadel from the User Experience team, who had to endure a constant barrage of questions about what elements of the Terra interface would change next and on what timeline. We’re deeply grateful to Matthieu J. Miossec for collaborating with us to develop the project we present in Chapter 14.
Within the UCSC GI, we want to thank the Computational Genomics Platform (CGP) team whose members work on a variety of projects that leverage Terra and other cloud-based analysis ecosystem components we present in this book. Contributors include Jesse Brennan, Amar Jandu, Natan Lao, Melaina Legaspi, Geryl Pelayo, Charles Reid, Hannes Schmidt, and Daniel Sotirhos. Within CGP, the Lighthouse Point team—Michael Baumann (now at the Broad Institute), Lon Blauvelt, Brian Hannafious, and Ash O’Farrell, led by Beth Sheets—deserves special recognition for their role in writing excellent research tutorials that helped inspire sections of the book.
We also want to thank the Dockstore teams at both UCSC and the Ontario Institute for Cancer Research (OICR) for their feedback on this effort and support building a platform for workflow sharing that contributes to the Terra ecosystem. Charles Overbeck leads the technical team at UCSC, and we are grateful for contributions by Louise Cabansay, Abraham Chavez, Andy Chen, Trevor Heathorn, Nneka Olunwa, Kevin Osborn, Natalie Perez, Walter Shands, Emily Soth, Cricket Sloan, and David Steinberg. Denis Yuen leads the technical team at OICR with Lincoln Stein as the PI and contributions from Ryan Bautista, Kitty Cao, Andy Chen, Vincent Chung, Andrew Duncan, Victor Liu, Gary Luu, Shreya Radesh, and Jennifer Wu.
None of this would have been possible without the support of our respective leadership teams. At the Broad Institute, we would like to thank Eric Lander, Lee McGuire, and the Data Sciences Platform leaders, particularly Anthony Philippakis, Eric Banks again, and Danielle Ciofani for keeping the faith that this book would eventually materialize. At UCSC, we thank the Genomics Institute (GI) leadership including Benedict Paten and the institute director, David Haussler, for their support along with Greta Martin, whose organizational skills are unrivaled, and Nadine Gassner, who keeps us funded so that we can work on cool projects.
We are forever grateful to the reviewers who took the time to read through early draft versions in order to help us identify what didn’t work reliably and to understand what could be improved. The book you see before you is very different from what we originally gave them to evaluate, for the better. In this category, we salute Titus Brown, Aaron Chevalier, Jeff Gentry, Sean Horgan, Lynn Langit, Lee Lichtenstein, Jessica Maia, David Mohs, Andrew Moschetti, Anubhav Shelat, and Jonn Smith.
We are also incredibly grateful to the editorial team at O’Reilly who performed the truly magical feat of turning our manuscript—a loose conglomerate of Google Docs—into an actual book. In particular, we thank our development editor, Michele Cronin, for shepherding us from early drafts to the finished product. It took a lot of cajoling and a few stern reminders about deadlines to get us there.
Last but most definitely not least, we would like to thank our loved ones for their patience and support during the more than two years that it took us to produce this book. Geraldine hopes that her lovely wife, Jessica, and daughters, Gabrielle and Melanie, will be suitably impressed and somehow forget her many late nights, obsessive behavior, and general inability to complete any home-improvement projects during that time period. Meanwhile, Brian thanks his partner Dhawal for his infinite patience, understanding, and encouragement to finish the book, along with his mom (Patty) and dad (Jim) for providing the occasional and appreciated push to “get it done!”
Get Genomics in the Cloud now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.