Chapter 13. Out-of-Memory Approaches: Tabix and SQLite
In this chapter, weâll look at out-of-memory approachesâcomputational strategies built around storing and working with data kept out of memory on the disk. Reading data from a disk is much, much slower than working with data in memory (see âThe Almighty Unix Pipe: Speed and Beauty in Oneâ), but in many cases this is the approach we have to take when in-memory (e.g., loading the entire dataset into R) or streaming approaches (e.g., using Unix pipes, as we did in Chapter 7) arenât appropriate. Specifically, weâll look at two tools to work with data out of memory: Tabix and SQLite databases.
Fast Access to Indexed Tab-Delimited Files with BGZF and Tabix
BGZF and Tabix solve a really important problem in genomics: we often need fast read-only random access to data linked to a genomic location or range. For the scale of data we encounter in genomics, retrieving this type of data is not trivial for a few reasons. First, the data may not fit entirely in memory, requiring an approach where data is kept out of memory (in other words, on a slow disk). Second, even powerful relational database systems can be sluggish when querying out millions of entries that overlap a specific regionâan incredibly common operation in genomics. The tools weâll see in this section are specially designed to get around these limitations, allowing fast random-access of tab-delimited genome position data.
In chapter on alignment, we saw ...
Get Bioinformatics Data Skills now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.