Chapter 47. Metadata ≥ Data

Jonathan Seidman

My first real experience in the big data world was helping to deploy Apache Hadoop clusters at Orbitz Worldwide, a heavily trafficked online travel site.1 One of the first things we did was deploy Apache Hive on our clusters and provide access to our developers to start building applications and analyses on top of this infrastructure.

This was all great, in that it allowed us to unlock tons of value from all of this data we were collecting. After a while, though, we noticed that we ended up with numerous Hive tables that basically represented the same entities. From a resource standpoint, this wasn’t that awful, since even in the dark ages of the aughts, storage was pretty cheap. However, our users’ time was not cheap, so all the time they spent creating new Hive tables, or searching our existing tables to find the data they needed, was time they weren’t spending on getting insights from the data.

The lesson we learned at Orbitz was that it’s a mistake to leave data management planning as an afterthought. Instead, it’s best to start planning your data management strategy early, ideally in parallel with any new data initiative or project.

Having a data management infrastructure that includes things like metadata management isn’t only critical for allowing users to perform data discovery and make optimal use of your data. It’s also crucial ...

Get 97 Things Every Data Engineer Should Know now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.