Book description
Learn the data skills necessary for turning large sequencing datasets into reproducible and robust biological findings. With this practical guide, you’ll learn how to use freely available open source tools to extract meaning from large complex biological data sets.
At no other point in human history has our ability to understand life’s complexities been so dependent on our skills to work with and analyze data. This intermediate-level book teaches the general computational and data skills you need to analyze biological data. If you have experience with a scripting language like Python, you’re ready to get started.
- Go from handling small problems with messy scripts to tackling large problems with clever methods and tools
- Process bioinformatics data with powerful Unix pipelines and data tools
- Learn how to use exploratory data analysis techniques in the R language
- Use efficient methods to work with genomic range data and range operations
- Work with common genomics data file formats like FASTA, FASTQ, SAM, and BAM
- Manage your bioinformatics project with the Git version control system
- Tackle tedious data processing tasks with with Bash scripts and Makefiles
Publisher resources
Table of contents
-
Preface
- The Approach of This Book
- Why This Book Focuses on Sequencing Data
- Audience
- The Difficulty Level of Bioinformatics Data Skills
- Assumptions This Book Makes
- Supplementary Material on GitHub
- Computing Resources and Setup
- Organization of This Book
- Code Conventions
- Conventions Used in This Book
- Using Code Examples
- Safari® Books Online
- How to Contact Us
- Acknowledgments
- I. Ideology: Data Skills for Robust and Reproducible Bioinformatics
-
1. How to Learn Bioinformatics
- Why Bioinformatics? Biologyâs Growing Data
- Learning Data Skills to Learn Bioinformatics
- New Challenges for Reproducible and Robust Research
- Reproducible Research
- Robust Research and the Golden Rule of Bioinformatics
- Adopting Robust and Reproducible Practices Will Make Your Life Easier, Too
-
Recommendations for Robust Research
- Pay Attention to Experimental Design
- Write Code for Humans, Write Data for Computers
- Let Your Computer Do the Work For You
- Make Assertions and Be Loud, in Code and in Your Methods
- Test Code, or Better Yet, Let Code Test Code
- Use Existing Libraries Whenever Possible
- Treat Data as Read-Only
- Spend Time Developing Frequently Used Scripts into Tools
- Let Data Prove That Itâs High Quality
- Recommendations for Reproducible Research
- Continually Improving Your Bioinformatics Data Skills
- II. Prerequisites: Essential Skills for Getting Started with a Bioinformatics Project
- 2. Setting Up and Managing a Bioinformatics Project
- 3. Remedial Unix Shell
- 4. Working with Remote Machines
-
5. Git for Scientists
- Why Git Is Necessary in Bioinformatics Projects
- Installing Git
-
Basic Git: Creating Repositories, Tracking Files, and Staging and Committing Changes
- Git Setup: Telling Git Who You Are
- git init and git clone: Creating Repositories
- Tracking Files in Git: git add and git status Part I
- Staging Files in Git: git add and git status Part II
- git commit: Taking a Snapshot of Your Project
- Seeing File Differences: git diff
- Seeing Your Commit History: git log
- Moving and Removing Files: git mv and git rm
- Telling Git What to Ignore: .gitignore
- Undoing a Stage: git reset
-
Collaborating with Git: Git Remotes, git push, and git pull
- Creating a Shared Central Repository with GitHub
- Authenticating with Git Remotes
- Connecting with Git Remotes: git remote
- Pushing Commits to a Remote Repository with git push
- Pulling Commits from a Remote Repository with git pull
- Working with Your Collaborators: Pushing and Pulling
- Merge Conflicts
- More GitHub Workflows: Forking and Pull Requests
- Using Git to Make Life Easier: Working with Past Commits
- Working with Branches
- Continuing Your Git Education
- 6. Bioinformatics Data
- III. Practice: Bioinformatics Data Skills
-
7. Unix Data Tools
- Unix Data Tools and the Unix One-Liner Approach: Lessons from Programming Pearls
- When to Use the Unix Pipeline Approach and How to Use It Safely
-
Inspecting and Manipulating Text Data with Unix Tools
- Inspecting Data with Head and Tail
- less
- Plain-Text Data Summary Information with wc, ls, and awk
- Working with Column Data with cut and Columns
- Formatting Tabular Data with column
- The All-Powerful Grep
- Decoding Plain-Text Data: hexdump
- Sorting Plain-Text Data with Sort
- Finding Unique Values in Uniq
- Join
- Text Processing with Awk
- Bioawk: An Awk for Biological Formats
- Stream Editing with Sed
- Advanced Shell Tricks
- The Unix Philosophy Revisited
-
8. A Rapid Introduction to the R Language
- Getting Started with R and RStudio
- R Language Basics
-
Working with and Visualizing Data in R
- Loading Data into R
- Exploring and Transforming Dataframes
- Exploring Data Through Slicing and Dicing: Subsetting Dataframes
- Exploring Data Visually with ggplot2 I: Scatterplots and Densities
- Exploring Data Visually with ggplot2 II: Smoothing
- Binning Data with cut() and Bar Plots with ggplot2
- Merging and Combining Data: Matching Vectors and Merging Dataframes
- Using ggplot2 Facets
- More R Data Structures: Lists
- Writing and Applying Functions to Lists with lapply() and sapply()
- Working with the Split-Apply-Combine Pattern
- Exploring Dataframes with dplyr
- Working with Strings
- Developing Workflows with R Scripts
- Further R Directions and Resources
-
9. Working with Range Data
- A Crash Course in Genomic Ranges and Coordinate Systems
-
An Interactive Introduction to Range Data with GenomicRanges
- Installing and Working with Bioconductor Packages
- Storing Generic Ranges with IRanges
- Basic Range Operations: Arithmetic, Transformations, and Set Operations
- Finding Overlapping Ranges
- Finding Nearest Ranges and Calculating Distance
- Run Length Encoding and Views
- Storing Genomic Ranges with GenomicRanges
- Grouping Data with GRangesList
- Working with Annotation Data: GenomicFeatures and rtracklayer
- Retrieving Promoter Regions: Flank and Promoters
- Retrieving Promoter Sequence: Connection GenomicRanges with Sequence Data
- Getting Intergenic and Intronic Regions: Gaps, Reduce, and Setdiffs in Practice
- Finding and Working with Overlapping Ranges
- Calculating Coverage of GRanges Objects
- Working with Ranges Data on the Command Line with BEDTools
- 10. Working with Sequence Data
- 11. Working with Alignment Data
- 12. Bioinformatics Shell Scripting, Writing Pipelines, and Parallelizing Tasks
-
13. Out-of-Memory Approaches: Tabix and SQLite
- Fast Access to Indexed Tab-Delimited Files with BGZF and Tabix
-
Introducing Relational Databases Through SQLite
- When to Use Relational Databases in Bioinformatics
- Installing SQLite
- Exploring SQLite Databases with the Command-Line Interface
- Querying Out Data: The Almighty SELECT Command
- SQLite Functions
- SQLite Aggregate Functions
- Subqueries
- Organizing Relational Databases and Joins
- Writing to Databases
- Dropping Tables and Deleting Databases
- Interacting with SQLite from Python
- Dumping Databases
- 14. Conclusion
- Glossary
- Bibliography
- Index
Product information
- Title: Bioinformatics Data Skills
- Author(s):
- Release date: July 2015
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781449367503
You might also like
book
Analytical Skills for AI and Data Science
While several market-leading companies have successfully transformed their business models by following data- and AI-driven paths, …
book
Practical Statistics for Data Scientists
Statistical methods are a key part of of data science, yet very few data scientists have …
book
Practical Statistics for Data Scientists, 2nd Edition
Statistical methods are a key part of data science, yet few data scientists have formal statistical …
book
Data Science from Scratch, 2nd Edition
To really learn data science, you should not only master the tools—data science libraries, frameworks, modules, …