Chapter 4. Key Principles of a DataOps Ecosystem

Having worked with dozens of Global 2000 customers on their data/analytics initiatives, I have seen a consistent pattern of key principles of a DataOps ecosystem that is in stark contrast to traditional “single vendor,” “single platform” approaches that are advocated by vendors such as Palantir, Teradata, IBM, Oracle, and others. An open, best of breed approach is more difficult, but also much more effective in the medium and long term; it represents a winning strategy for a chief data officer, chief information officer, and CEO who believe in maximizing the reuse of quality data in the enterprise and avoids the oversimplified trap of writing a massive check to a single vendor with the belief that there will be “one throat to choke.”

There are certain key principles of a DataOps ecosystem that we see at work every day in a large enterprise. A modern DataOps infrastructure/ecosystem should be and do the following:

  • Highly automated

  • Open

  • Take advantage of best of breed tools

  • Use Table(s) In/Table(s) Out protocols

  • Have layered interfaces

  • Track data lineage

  • Feature deterministic, probabilistic, and humanistic data integration

  • Combine both aggregated and federated methods of storage and access

  • Process data in both batch and streaming modes

I’ve provided more detail/thoughts on each of these in this chapter.

Highly Automated

The scale and scope of data in the enterprise has surpassed the ability of bespoke human effort to catalog, move, and organize the data. Automating your data infrastructure and using the principles of highly engineered systems—design for operations, repeatability, automated testing, and release of data—is critical to keep up with the dramatic pace of change in enterprise data. The principles at work in automating the flow of data from sources to consumption are very similar to those that drove the automation of the software build, test, and release process in DevOps over the past 20 years. This is one of the key reasons we call the overall approach “DataOps.”

Open

The best way to describe this is to talk about what it is not. The primary characteristic of a modern DataOps ecosystem is that it is not a single proprietary software artifact or even a small collection of artifacts from a single vendor. For decades, B2B software companies have been in the business of trying to get their customers “hooked” on a software artifact by building it into their infrastructure—only to find that the software artifact is inadequate in certain areas. But because the software was built with proprietary assumptions, it’s impossible to augment or replace it with other software artifacts that represent “best of breed” for that function.

In the next phase of data management in the enterprise, it would be a waste of time for customers to “sell their data souls” to single vendors that promote proprietary platforms. The ecosystem in DataOps should resemble DevOps ecosystems in which there are many best of breed free and open source software (FOSS) and proprietary tools that are expected to interoperate via APIs. An open ecosystem results in better software being adopted broadly—and offers the flexibility to replace, with minimal disruption to your business, those software vendors that don’t produce better software.

Best of Breed

Closely related to having an open ecosystem is embracing technologies and tools that are best of breed—meaning that each key component of the system is built for purpose, providing a function that is the best available at a reasonable cost. As the tools and technology that the large internet companies built to manage their data goes mainstream, the enterprise has been flooded with a set of tools that are powerful, liberating (from the traditional proprietary enterprise data tools), and intimidating.

Selecting the right tools for your workload is difficult because of the massive heterogeneity of data in the enterprise and also because of the dysfunction of software sales and marketing organizations who all overpromote their own capabilities, i.e., extreme data software vendor hubris.

It all sounds the same on the surface, so the only way to really figure out what systems are capable of is to try them (or by taking the word of a proxy—a real customer who has worked with the vendor to deliver value from a production system). This is why people such as Mark Ramsey, previously of GSK, are a powerful example. Mark’s attempt to build an ecosystem for more than 12 best-of-breed vendors and combine their solutions to manage data as an asset at scale is truly unique and a good reference as to what works and what does not.

Table(s) In/Table(s) Out Protocol

The next logical questions to ask if you embrace best of breed is, “How will these various systems/tools communicate? And what is the protocol?” Over the past 20 years, I’ve come to believe when talking about interfaces between core systems that it’s best to focus on the lowest common denominator. In the case of data, this means tables—both individual tables and collections.

I believe that Table(s) In/Table(s) Out is the primary method that should be assumed when integrating these various best of breed tools and software artifacts. Tables can be shared or moved using many different methods described under Data Services. A great reference for these table-oriented methods is the popularity of Resilient Distributed Datasets (RDDs) and DataFrames in the Spark ecosystem. Using service-oriented methods for these interfaces is critical, and the thoughtful design of these services is a core component of a functional DataOps ecosystem. Overall, we see a pattern of three key types of interfaces that are required or desired inside of these systems.

Three Core Styles of Interfaces for Components

There are many personas that desire to use data in a large enterprise. Some “power users” need to access data in its raw form, whereas others just want to get responses to inquiries that are well formulated. A layered set of services and design patterns is required to satisfy all of these users over time.

Here are the three methods that we think are most useful (see also Figure 4-1):

  • Data access services that are “View” abstractions over the data and are essentially SQL or SQL-like interfaces. This is the power-user level that data scientists prefer.

  • Messaging services that provide the foundation for stateful data interchange, event processing, and data interchange orchestration.

  • REST services built on or wrapped around APIs providing the ultimate flexible direct access to and interchange of data.

The layered set of data access services  messaging services  and REST services
Figure 4-1. The layered set of data access services, messaging services, and REST services

Tracking Data Lineage and Provenance

As data flows through a next-generation data ecosystem (see Figure 4-2), it is of paramount importance to properly manage this lineage metadata to ensure reproducible data production for analytics and machine learning. Having as much provenance/lineage for data as possible enables reproducibility that is essential for any significant scale in data science practices or teams. Ideally, each version of a tabular input and output to a processing step is registered. In addition to tracking inputs and outputs to a data processing step, some metadata about what that processing step is doing is also essential. A focus on data lineage and processing tracking across the data ecosystem results in reproducibility going up and confidence in data increasing. It’s important to note that lineage/provenance is not absolute—there are many subtle levels of provenance and lineages, and it’s important to embrace the spectrum and appropriate implementation (i.e., it’s more a style of your ecosystem than a component).

Data pipeline patterns
Figure 4-2. Data pipeline patterns

Data Integration: Deterministic, Probabilistic, and Humanistic

When bringing data together from disparate silos, it’s tempting to rely on traditional deterministic approaches to engineer the alignment of data with rules and ETL. I believe that at scale—with many hundreds of sources—the only viable method of bringing data together is the use of machine-based models (probabilistic) + rules (deterministic) + human feedback (humanistic) to bind the schema and records together as appropriate in the context of both how the data is generated and (perhaps more importantly) how the data is being consumed.

Combining Aggregated and Federated Storage

A healthy next-generation data ecosystem embraces data that is both aggregated and federated. Over the past 40-plus years, the industry has gone back and forth between federated or aggregated approaches to integrating data. It’s my strong belief that what is required in the modern enterprise is an overall architecture that embraces the idea that sources and intermediate storage of data will be a combination of both aggregated and federated data.

This adds a layer of complexity that was previously challenging, but actually possible now with some modern design patterns. You always make trade-offs of performance and control when you aggregate versus federate. But over and over, I find that workloads across an enterprise (when considered broadly and holistically) require both aggregated and federated. In your modern DataOps ecosystem, cloud storage methods can make this much easier. In fact, Amazon Simple Storage Service (Amazon S3) and Google Cloud Storage (GCS) specifically—when correctly configured as a primary storage mechanism—can give you the benefit of both aggregated and federated methods.

Processing Data in Both Batch and Streaming Modes

The success of Kafka and similar design patterns has validated that a healthy next-generation data ecosystem includes the ability to simultaneously process data from source to consumption in both batch and streaming modes. With all the usual caveats about consistency, these design patterns can give you the best of both worlds—the ability to process batches of data as required and also to process streams of data that provide more real-time consumption.

Conclusion

Everything presented in this chapter is obviously high level and has an infinite number of technical caveats. After doing hundreds of implementations at large and small companies, I believe that it’s actually possible to do all of it within an enterprise, but not without embracing an open and best-of-breed approach. At Tamr, we’re in the middle of exercising all of these principles with our customers every day, across a diverse and compelling set of data engineering projects.

Get Getting DataOps Right now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.