Chapter 12. Bioinformatics Shell Scripting, Writing Pipelines, and Parallelizing Tasks
Iâve waited until the penultimate chapter this book to share a regrettable fact: everyday bioinformatics work often involves a great deal of tedious data processing. Bioinformaticians regularly need to run a sequence of commands on not just one file, but dozens (sometimes even hundreds) of files. Consequently, a large part of bioinformatics is patching together various processing steps into a pipeline, and then repeatedly applying this pipeline to many files. This isnât exciting scientific work, but itâs a necessary hurdle before tackling more exciting analyses.
While writing pipelines is a daily burden of bioinformaticians, itâs essential that pipelines are written to be robust and reproducible. Pipelines must be robust to problems that might occur during data processing. When we execute a series of commands on data directly into the shell, we usually clearly see if something goes awryâoutput files are empty when they should contain data or programs exit with an error. However, when we run data through a processing pipeline, we sacrifice the careful attention we paid to each stepâs output to gain the ability to automate processing of numerous files. The catch is that not only are errors likely to still occur, theyâre more likely to occur because weâre automating processing over more data files and using more steps. For these reasons, itâs critical to construct robust pipelines. ...
Get Bioinformatics Data Skills now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.