Chapter 29. Embracing Data Silos

Bin Fan and Amelia Wong

Working in the big data and machine learning space, we frequently hear from data engineers that the biggest obstacle to extracting value from data is being able to access the data efficiently. Data silos, isolated islands of data, are often viewed by data engineers as the key culprit. Over the years, many attempts have arisen to resolve the challenges caused by data silos, but those attempts have often resulted in even more data silos. Rather than attempting to eliminate data silos, we believe the right approach is to embrace them.

Why Data Silos Exist

Data silos exist for three main reasons. First, within any organization there is data with varying characteristics (Internet of Things data, behavioral data, transactional data, etc.) that is intended for different uses, and some of that data will be more business-critical than others. This drives the need for disparate storage systems. Second, history has shown that every 5 to 10 years a new wave in storage technologies churns out storage systems that are faster, cheaper, or better designed for certain types of data. Organizations also have a desire to avoid vendor lock-ins and as a result will diversify their data storage. Third, regulations mandate the siloing of data. ...

Get 97 Things Every Data Engineer Should Know now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.