Chapter 1. Overview

Organizations today are bursting at the seams with data, including existing databases, output from applications, and streaming data from ecommerce, social media, apps, and connected devices on the Internet of Things (IoT).

We are all well versed on the data warehouse, which is designed to capture the essence of the business from other enterprise systems—for example, customer relationship management (CRM), inventory, and sales transactions systems—and which allows analysts and business users to gain insight and make important business decisions from that data.

But new technologies, including mobile, social platforms, and IoT, are driving much greater data volumes, higher expectations from users, and a rapid globalization of economies.

Organizations are realizing that traditional technologies can’t meet their new business needs.

As a result, many organizations are turning to scale-out architectures such as data lakes, using Apache Hadoop and other big data technologies. However, despite growing investment in data lakes and big data technology—$150.8 billion in 2017, an increase of 12.4% over 20161—just 14% of organizations report ultimately deploying their big data proof-of-concept (PoC) project into production.2

One reason for this discrepancy is that many organizations do not see a return on their initial investment in big data technology and infrastructure. This is usually because those organizations fail to do data lakes right, falling short when it comes to designing the data lake properly and in managing the data within it effectively. Ultimately these organizations create data “swamps” that are really useful for only ad hoc exploratory use cases.

For those organizations that do move beyond a PoC, many are doing so by merging the flexibility of the data lake with some of the governance and control of a traditional data warehouse. This is the key to deriving significant ROI on big data technology investments.

Succeeding with Big Data

The first step to ensure success with your data lake is to design it with future growth in mind. The data lake stack can be complex, and requires decisions around storage, processing, data management, and analytics tools.

The next step is to address management and governance of the data within the data lake, also with the future in mind. How you manage and govern data in a discovery sandbox might not be challenging or critical, but how you manage and govern data in a production data lake environment, with multiple types of users and use cases, is critical. Enterprises need a clear view of lineage and quality for all their data.

It is critical to have a robust set of capabilities to ingest and manage the data, to store and organize it, prepare and analyze it, and secure and govern it. This is essential no matter what underlying platform you choose—whether streaming, batch, object storage, flash, in-memory, or file—you need to provide this consistently through all the evolutions the data lake is going to undergo over the next few years.

The key takeaway? Organizations seeing success with big data are not just dumping data into cheap storage. They are designing and deploying data lakes for scale, with robust, metadata-driven data management platforms, which give them the transparency and control needed to benefit from a scalable, modern data architecture.

Definition of a Data Lake

There are numerous views out there on what constitutes a data lake, many of which are overly simplistic. At its core, a data lake is a central location in which to store all your data, regardless of its source or format. It is typically built using Hadoop or another scale-out architecture (such as the cloud) that enables you to cost-effectively store significant volumes of data.

The data can be structured or unstructured. You can then use a variety of processing tools—typically new tools from the extended big data ecosystem—to extract value quickly and inform key organizational decisions.

Because all data is welcome, data lakes are a powerful alternative to the challenges presented by data integration in a traditional Data Warehouse, especially as organizations turn to mobile and cloud-based applications and the IoT.

Some of the technical benefits of a data lake include the following:

The kinds of data from which you can derive value are unlimited.

You can store all types of structured and unstructured data in a data lake, from CRM data to social media posts.

You don’t need to have all the answers upfront.

Simply store raw data—you can refine it as your understanding and insight improves.

You have no limits on how you can query the data.

You can use a variety of tools to gain insight into what the data means.

You don’t create more silos.

You can access a single, unified view of data across the organization.

The Differences Between Data Warehouses and Data Lakes

The differences between data warehouses and data lakes are significant. A data warehouse is fed data from a broad variety of enterprise applications. Naturally, each application’s data has its own schema. The data thus needs to be transformed to be compatible with the data warehouse’s own predefined schema.

Designed to collect only data that is controlled for quality and conforming to an enterprise data model, the data warehouse is thus capable of answering a limited number of questions. However, it is eminently suitable for enterprise-wide use.

Data lakes, on the other hand, are fed information in its native form. Little or no processing is performed for adapting the structure to an enterprise schema. The structure of the data collected is therefore not known when it is fed into the data lake, but only found through discovery, when read.

The biggest advantage of data lakes is flexibility. By allowing the data to remain in its native format, a far greater—and timelier—stream of data is available for analysis. Table 1-1 shows the major differences between data warehouses and data lakes.

Table 1-1. Differences between data warehouses and data lakes
Attribute Data warehouse Data lake

Schema

Schema-on-write

Schema-on-read

Scale

Scales to moderate to large volumes at moderate cost

Scales to huge volumes at low cost

Access Methods

Accessed through standardized SQL and BI tools

Accessed through SQL-like systems, programs created by developers and also supports big data analytics tools

Workload

Supports batch processing as well as thousands of concurrent users performing interactive analytics

Supports batch and stream processing, plus an improved capability over data warehouses to support big data inquiries from users

Data

Cleansed

Raw and refined

Data Complexity

Complex integrations

Complex processing

Cost/Efficiency

Efficiently uses CPU/IO but high storage and processing costs

Efficiently uses storage and processing capabilities at very low cost

Benefits

  • Transform once, use many
  • Easy to consume data
  • Fast response times
  • Mature governance
  • Provides a single enterprise-wide view of data from multiple sources
  • Clean, safe, secure data
  • High concurrency
  • Operational integration
  • Transforms the economics of storing large amounts of data
  • Easy to consume data
  • Fast response times
  • Mature governance
  • Provides a single enterprise-wide view of data
  • Scales to execute on tens of thousands of servers
  • Allows use of any tool
  • Enables analysis to begin as soon as data arrives
  • Allows usage of structured and unstructured content form a single source
  • Supports Agile modeling by allowing users to change models, applications and queries
  • Analytics and big data analytics

Drawbacks

  • Time consuming
  • Expensive
  • Difficult to conduct ad hoc and exploratory analytics
  • Only structured data
  • Complexity of big data ecosystem
  • Lack of visibility if not managed and organized
  • Big data skills gap

The Business Case for Data Lakes

We’ve discussed the tactical, architectural benefits of a data lake, now let’s discuss the business benefits it provides. Enterprise data warehouses have been most organizations’ primary mechanism for performing complex analytics, reporting, and operations. But they are too rigid to work in the era of big data, where large data volumes and broad data variety are the norms. It is challenging to change data warehouse data models, and field-to-field integration mappings are rigid. Data warehouses are also expensive.

Perhaps more important, most data warehouses require that business users rely on IT to do any manipulation or enrichment of data, largely because of the inflexible design, system complexity, and intolerance for human error in data warehouses. This slows down business innovation.

Data lakes can solve these challenges, and more. As a result, almost every industry has a potential data lake use case. For example, almost any organization would benefit from a more complete and nuanced view of its customers and can use data lakes to capture 360-degree views of those customers. With data lakes, whether used to augment the data warehouse or replace it altogether, organizations can finally unleash big data’s potential across industries.

Let’s look at a few business benefits that are derived from a data lake.

Freedom from the rigidity of a single data model

Because data can be unstructured as well as structured, you can store everything from blog postings to product reviews. And the data doesn’t need to be consistent to be stored in a data lake. For example, you might have the same type of information in very different data formats, depending on who is providing the data. This would be problematic in a data warehouse; in a data lake, however, you can put all sorts of data into a single repository without worrying about schemas that define the integration points between different data sets.

Ability to handle streaming data

Today’s data world is a streaming world. Streaming has evolved from rare use cases, such as sensor data from the IoT and stock market data, to very common everyday data, such as social media.

Fitting the task to the tool

A data warehouse works well for certain kinds of analytics. But when you are using Spark, MapReduce, or other new models, preparing data for analysis in a data warehouse can take more time than performing the actual analytics. In a data lake, data can be processed efficiently by these new paradigm tools without excessive prep work. Integrating data involves fewer steps because data lakes don’t enforce a rigid metadata schema. Schema-on-read allows users to build custom schemas into their queries upon query execution.

Easier accessibility

Data lakes also solve the challenge of data integration and accessibility that plague data warehouses. Using a scale-out infrastructure, you can bring together ever-larger data volumes for analytics—or simply store them for some as-yet-undetermined future use. Unlike a monolithic view of a single enterprise-wide data model, the data lake allows you to put off modeling until you actually use the data, which creates opportunities for better operational insights and data discovery. This advantage only grows as data volumes, variety, and metadata richness increase.

Scalability

Big data is typically defined as the intersection between volume, variety, and velocity. Data warehouses are notorious for not being able to scale beyond a certain volume due to restrictions of the architecture. Data processing takes so long that organizations are prevented from exploiting all their data to its fullest extent. Petabyte-scale data lakes are both cost-efficient and relatively simple to build and maintain at whatever scale is desired.

Drawbacks of Data Lakes

Despite the myriad technological and business benefits, building a data lake is complicated and different for every organization. It involves integration of many different technologies and requires technical skills that aren’t always readily available on the market—let alone on your IT team. Following are three key challenges organizations should be aware of when working to put an enterprise-grade data lake into production.

Visibility

Unlike data warehouses, data lakes don’t come with governance built in, and in early use cases for data lakes, governance was an afterthought—or not a thought at all. In fact, organizations frequently loaded data without attempting to manage it in any way. Although situations still exist in which you might want to take this approach—particularly since it is both fast and cheap—in most cases, this type of data dump isn’t optimal and ultimately leads to a data swamp of poor visibility into data type, lineage, and quality and really can’t be used confidently for data discovery and analytics. For cases in which the data is not standardized, errors are unacceptable, and the accuracy of the data is of high priority, a data dump will greatly impede your efforts to derive value from the data. This is especially the case as your data lake transitions from an add-on feature to a truly central aspect of your data architecture.

Governance

Metadata is not automatically applied when data is ingested into the data lake. Without the technical, operational, and business metadata that gives you information about the data you have, it is impossible to organize your data lake and apply governance policies. Metadata is what allows you to track data lineage, monitor and understand data quality, enforce data privacy and role-based security, and manage data life cycle policies. This is particularly critical for organizations in tightly regulated industries.

Data lakes must be designed in such a way to use metadata and integrate the lake with existing metadata tools in the overall ecosystem in order to track how data is used and transformed outside of the data lake. If this isn’t done correctly, it can prevent a data lake from going into production.

Complexity

Building a big data lake environment is complex and requires integration of many different technologies. Also, determining your strategy and architecture is complicated: organizations must determine how to integrate existing databases, systems, and applications to eliminate data silos; how to automate and operationalize certain processes; how to broaden access to data to increase an organization’s agility; and how to implement and enforce enterprise-wide governance policies to ensure data remains private and secure.

In addition, most organizations don’t have all of the skills in-house that are needed to successfully implement an enterprise-grade data lake project, which can lead to costly mistakes and delays.

Succeeding with Big Data

The rest of this book focuses on how to build a successful production data lake that accelerates business insight and delivers true business value. At Zaloni, through numerous data lake implementations, we have constructed a data lake reference architecture that ensures production-grade readiness. This book addresses many of the challenges that companies face when building and managing data lakes.

We discuss why an integrated approach to data lake management and governance is essential, and we describe the sort of solution needed to effectively manage an enterprise-grade lake. The book also delves into best practices for consuming the data in a data lake. Finally, we take a look at what’s ahead for data lakes.

1 IDC. “Worldwide Semiannual Big Data & Analytics Spending Guide.” March 2017.

2 Gartner. “Market Guide for Hadoop Distributions.” February 1, 2017.

Get Architecting Data Lakes, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.