Chapter 2. Organizing Genomics Data with Data Lakes

The concept of a data lake may seem quite foreign at first. In the business world, you may be used to storing data in structured locations like a database. In the research world (or in academia), you may have put your data on a network storage device or an FTP server. And in your personal life, you’ve probably used cloud storage services like Dropbox, Google Drive, or Microsoft OneDrive. There are concepts from each of these types of data storage that will make data lakes seem more familiar in time.

You can think of a data lake as a general location to store your data. It is basically organized in directories like the files on your hard drive or in your personal cloud storage. It supports structured data, like CSV files, and unstructured data, like images. Plus, using a data lake in Azure allows you to grab your data from any of the other services that you want to use. Finally, you can share or secure different parts of your lake so that not everyone can swim wherever they please.

Here are some benefits of data lakes:

  • Data is kept at its various stages and forms and stored in various “zones,” organized by the level of maturity, cleanliness, and utility.

  • Data lakes are more flexible than databases or data warehouses because they can store virtually any type of data: structured, semistructured, or unstructured—the data lake can handle it.

  • Users can browse a data lake in much the same way as they’d browse files on their computer. ...

Get Genomics in the Azure Cloud now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.