Chapter 1. The Disruption of Data Management

Data management is being disrupted because datafication is everywhere. Existing architectures can no longer be scaled up. Enterprises need a new data strategy. A paradigm shift and change of culture are needed too, because the centralized solutions that work today will no longer work in the future.

Technological trends are fragmenting the data landscape. The speed of software delivery is changing with the new methodologies at a cost of increased data complexity. The rapid growth of data and intensive data consumption make operational systems suffer. Lastly, there are privacy, security, and regulatory concerns.

The impact these trends have on data management are tremendous and force the whole industry to rethink how data management must be conducted in the future. In this book, I will lay out a distinctive theory on data management, one that contrasts with how many enterprises have designed and organized their existing data landscape today. Before we come to this in Chapter 2, we need to agree on what data management is, and why it is important. Next, we need to set the scene by looking at different trends. Then, and finally, we will examine how current enterprise data architectures with platforms are designed and organized today.

Before we start, let me lay my cards out on the table. I have strong beliefs about what should be done within data management centrally and what can be done on a federated level. The distributed nature of future architectures inspired me to work on a new vision. Although data warehouses and lakes are excellent approaches for utilizing data, they weren’t designed for the increasingly rapid pace of tomorrow’s data consumption requirements. Machine learning, for example, is far more data-intensive, while the need for responsiveness and immediate action requires the architecture to be quickly reactive.

Before we continue, I would like to ask you to take a deep breath and put your biases aside. The need for data harmonization, bringing amounts of data into a particular context, always remains, but something we have to consider is the scale in which we want to apply this discipline. In a highly distributed ecosystem, is it really the best way to bring all data centrally before it can be consumed by any user or application?

Data Management

The processes and procedures required to manage data are what we call data management. DAMA International’s Guide to the Data Management Body of Knowledge (DAMA-DMBOK) has a more extensive explanation of data management and uses the following definition: “data management is the development, execution, and supervision of plans, policies, programs, and practices that deliver, control, protect, and enhance the value of data and information assets throughout their life cycles.”¹ It is crucial to embed these disciplines deeply into your organization. Otherwise, you will lack insight and become ineffective, and your data will get out of control. Becoming data-driven—getting as much value as possible out of your data—will become a challenge. Analytics, for example, is worth nothing if you have low-quality data.

The activities and disciplines of data management are wide-ranging and cover multiple areas, some closely related to software architecture:² the design and high-level structure of software that is needed to develop a system that meets business requirements and includes characteristics such as flexibility, scalability, feasibility, reusability, and security. It is important to understand that I have selected only the aspects of data management that are most relevant for managing a modern data architecture at scale. The areas I’ll refer to several times are:

Data architecture: Data architecture is the data master plan. It’s about insights into the bigger picture of your architecture, including the blueprints, reference architectures, future state vision, and dependencies. Managing these helps the organizations make decisions. The entire book revolves around data architecture generally, but the discipline and its activities will be covered fully in Chapters 2 and 3.
Data governance: Data governance activities involve implementing and enforcing authority and control over the management of data, including all of the corresponding assets. This area is described in more detail in Chapter 7.
Data modeling and design: Data modeling and design is about structuring and representing data within a specific context and specific systems. Discovering, designing, and analyzing data requirements are all part of this discipline. Some of these aspects will be discussed in Chapter 6.
Database management, data storage, and operations: Database management, data storage, and operations refer to the management of the database design, correct implementation, and support in order to maximize the value of data. Database management also includes database operations management. Some of these aspects will be discussed in Chapter 8.
Data security management: Data security management includes all disciplines and activities that provide secure authentication, authorization, and access to the data. These activities include prevention, auditing, and escalation-mitigating actions. This area is described in more detail in Chapter 7.
Data integration and interoperability: Data integration and interoperability include all the disciplines and activities for moving, collecting, consolidating, combining, and transforming data in order to move the data efficiently from one context into another context. Data interoperability is the capability to communicate, invoke functions, or transfer data among various applications in a way that requires little or no knowledge of the application characteristics. Data integration, on the other hand, is about consolidating data from different (multiple) sources into a unified view. This process is often supported by additional tools, such as replication and ETL (extract transform and load) tools, which I consider most important. It is described extensively in Chapters 3, 4, and 5.
Reference and master data management: Reference and master data management is about managing the critical data to make sure the data is accessible, accurate, secure, transparent, and trustworthy. This area is described in more detail in Chapter 9.
Data life cycle management: Data life cycle management refers to the process of managing data through its life cycle, from creation and initial storage until the time when the data becomes obsolete and is deleted. These activities are required for efficient use of resources and to meet legal obligations and customer expectations. Some disciplines of this area are described in Chapter 3.
Metadata management: Metadata management involves managing all of the data that classifies and describes the data. Metadata can be used to make the data understandable, ready for integration, and secure. Metadata can also be used to ensure the quality. This area is described in more detail in Chapter 10.
Data quality management: Data quality management includes all activities for managing the quality of the data to ensure the data can be used. Some disciplines of this area are described in Chapters 2 and 3.
Data warehousing, business intelligence, and advanced analytics management: Data warehousing, business intelligence, and advanced analytics management include all the activities that provide business insights and support decision making. This area is described in more depth in Chapter 8.

The area of DAMA-DMBOK that needs more work and inspired me to write this book is data integration and interoperability. My observation is that this area hasn’t been well enough connected to metadata management. Metadata is scattered across many tools, applications, platforms, and environments. Its shapes and forms are diverse. The interoperability of metadata—the ability of two or more systems or components to exchange descriptive data about data—is underexposed because building and managing a large-scale architecture is very much about metadata integration. It also isn’t well connected to the area of data architecture. If metadata is utilized in the right way, you can see what data passes by, how it can be integrated, distributed, and secured, and how it connects to applications, business capabilities, and so on. There is limited documentation about this aspect in the field.

A concern I have is the view DAMA and many organizations have on semantical consistency. As of today, attempts to unify all semantics to provide enterprise-wide consistency are still taking place. This is called a “single version of the truth.” However, applications are always unique, and so is data. Designing applications involves a lot of implicit thinking. You are framed by the context you’re in. This context is inherited by the design of application and finds its way to the data. We pass through this context when we move from conceptual design into logical application design and physical application design.³ This is essential to understand because it frames any future architecture. When data is moved across applications, a transformation step is always necessary. Even when all data is unified and stored centrally, a context switch still has to be made when consuming downstream. There is no escape from this data transformation dilemma! In the next chapter, I’ll connect back to this.

Another view I see in many organizations is that data management should be central and must be connected to the strategic goals of the enterprise. Many organizations believe that operational costs can be reduced by centralizing all data and management activities. There’s also a deep assumption that a centralized platform can take away the pain of data integration for its consumers. Companies have invested heavily in their enterprise data platforms, which include data warehouses, data lakes, and service buses. The activities of master data management are strongly connected to these platforms because consolidating allows us to simultaneously improve the accuracy of our most critical data.

The centralized platform—and the centralized model that comes with it—is subject to fail because of disruptive trends, such as analytics, cloud computing, new software development methodologies, real-time decisioning, and data monetization. Although we are aware of these trends, many companies fail to comprehend the impact they have on data management. Let’s examine the most important trends and determine their magnitude.

Analytics Is Fragmenting the Data Landscape

The most impactful trend is advanced analytics because it exploits data to make companies more responsive, competitive, and innovative. Why does advanced analytics disrupt the existing data landscape? When more data is available, the number of options and opportunities increases. Advanced analytics is about making what-if analyses, projecting future trends and outcomes or events, detecting hidden relations and behaviors, and automating decision making. Because of the recognized value and strategical benefits of advanced analytics, many methodologies, frameworks, and tools have been developed to use it in divergent ways. We’ve only scratched the surface of what artificial intelligence (AI), machine learning (ML), and natural language processing (NLP) will be capable of in the future.

A trend that has accelerated advanced analytics is open source. High-value open source software projects are becoming the mainstream standard.⁴ Open source made advanced analytics more popular because it removed the expensive licensing aspect of commercial vendors and let everybody learn from each other.

Open source also opened up the realm of specialized databases. Cassandra, HBase, MongoDB, Hive, and Redis, to name a few, disrupted the traditional database market by making it possible to store and analyze massive volumes of data. The result of all these new database possibilities is that the efficiency of building and developing new solutions increased dramatically. Now complex problems can be solved easily with a highly specialized database instead of having to use a traditional relational database and complex application logic. Many of these new database products are open source, which increased their popularity and usage.

The diversity and growth of advanced analytics and databases has resulted in two problems: data proliferation and data-intensiveness.

With data proliferation, the same data gets distributed across many applications and databases. An enterprise data warehouse using a relational database management system (RDBM), for example, is not capable of performing a complex social network analysis. These types of use cases can be better implemented with a specialized graph database.⁵

Using a relational database for the central platform, and the restrictions that come with it, forces you to always export the data. Data thus leaves the central platform and must be distributed to other database systems. This further distribution and proliferation of data also introduces another problem. If data is scattered throughout the organization, it will be more difficult to find and judge its origin and quality. It also makes controlling the data much more difficult because the data can be distributed further as soon as it leaves the central platform.

The growth of analytical techniques means an accelerated growth of data-intensiveness: the read-versus-write ratio is changing significantly. Analytical models that are constantly retrained, for example, constantly read large volumes of data. This read aspect impacts applications and database designs because we need to optimize for data readability. It could consequently mean that we need to duplicate data to relieve systems from the pressure of constantly serving out data. It could also mean that we need to duplicate data to preprocess it because of the diverse and high variety of use-case variations and read patterns that comes with these use cases. Facilitating this high variety of read patterns while duplicating data and staying in control isn’t easy. A solution for this will be provided in Chapter 3.

Speed of Software Delivery Is Changing

In today’s world, software-based services are at the core of a business, which means that new features and functionality must be delivered quickly. In response to the demands of more agility, new ideologies have emerged at companies like Amazon, Netflix, Facebook, Google, and Uber. These companies advanced their software development practice based on two beliefs.

The first belief is that software development (Dev) and information-technology operations (Ops) must be combined to shorten the systems-development life cycle and provide continuous delivery with high software quality. This is called DevOps. This methodology requires a new culture that embraces more autonomy, open communication, trust, transparency, and cross-discipline teamwork.

The second belief is about the size at which applications must be developed. Flexibility is expected to increase when applications are transformed into smaller decomposed services. This development approach includes several buzzwords: microservices, containers, Kubernetes, domain-driven design, serverless computing, etc. I won’t go in into the detail about every concept yet, but this software development evolution involves increased complexity and an increased demand to better control data.

The transformation of a monolithic application into a distributed application creates many challenges for data management. When breaking applications up into smaller pieces, the data is spread across different smaller components. Development teams must also transition their (single) unique data stores, where they fully understand their data model and have all the data objects together, to a design where data objects are spread all over the place. This introduces several challenges, including increased network communication, data read replicas that need to be synchronized, consistency and referential integrity issues, and so on.

A shift in software development trends requires an architecture that allows more fine-grained applications to distribute their data. It also requires a new DataOps culture and a different design philosophy with more emphasis on data interoperability, the capture of immutable events, and reproducible and loose coupling. We will discuss this in more detail in Chapter 2.

Networks Are Getting Faster

Networks are becoming faster, and bandwidth increases year after year. I attended the Gartner Data and Analytics Summit in 2018 where Google demonstrated that it’s possible to move hundreds of terabytes of data in their cloud in less than a minute.

This movement of terabytes of data allows for an interesting approach, where instead of bringing the computational power to the data—which has been the common best practice because of network limitations—we can now turn it around and bring the data to the computational power by distributing it. The network is no longer the bottleneck, so we can move terabytes of data quickly from environments to allow applications to consume and use data. This model becomes especially interesting as SaaS and Machine Learning as a Service (MLaaS) markets become more popular. Instead of doing all the complex stuff in-house, we can use networks for providing the data to other parties.

This distribution pattern of copying (duplicating) and bringing the data toward the computational power on a different facility, such as cloud, will fragment the data landscape even more, which again makes having a clear data management strategy more important than ever.

Privacy and Security Concerns Are a Top Priority

Data is inarguably key for organizations to optimize, innovate, or differentiate, but data has also started to reveal a darker side with unfriendly undertones. The Cambridge Analytica files and 500 million hacked accounts at Marriott are impressive examples of data privacy scandals and data breaches.⁶ Governments are increasingly getting involved, as every aspect of our personal and professional lives is now connected to the internet. The COVID-19 pandemic is expected to connect even more people, since many of us are forced to work from home.

The trends of massive data, more powerful advanced analytics, and the faster distribution of data have triggered a debate around the dangers of data, raising ethical questions and discussions. As companies will make mistakes and cross ethical lines, I expect governments to sharpen regulation by demanding more security, control, and insight. We have only scratched the surface of true data privacy and data ethical problems. Regulation will force big companies to be transparent about what data is collected, what data was purchased, what data is combined, how data is analyzed, and what data is distributed (sold). Big companies need to start thinking about transparency and privacy-first approaches and how to deal with big regulatory topics.

Dealing with regulation is a complex subject. Imagine situations in which several cloud environments and different SaaS services are used and data is scattered. Satisfying GDPR and CCPA is difficult because companies are required to have insight and control over all personal data, regardless of where it is stored. Data governance and dealing with personal data is at the top of the agenda for many large companies.⁷

These stronger regulatory requirements and data ethics will result in further restrictions, additional processes, and enhanced control. Insights about where data originated and how data is distributed are crucial. A stronger internal governance is required. The trend of stronger control is contrary to the methodologies for fast software development, which involves less documentation and fewer internal controls. It requires a different—more defensive—viewpoint on how data management is done internally. A large part of these concerns will be addressed in Chapter 7.

Operational and Transactional Systems Need to Be Integrated

The need to react faster to business events introduces new challenges. Traditionally, there has been a clear split between transactional (operational) applications and analytical applications because transactional systems are generally not sufficient for delivering large amounts of data or constantly pushing out data. The best practice has always has been to split the data strategy into two parts: operational transactional processing and analytical data warehousing and big data processing.

At the same time, this clear split is becoming more obscure. Operational analytics, which focuses on predicting and improving the existing operational processes, is expected to work closely with both the transactional and analytical systems. The analytical results need to be integrated back into the operational system’s core so that insights become relevant in the operational context.

This trend requires a different integration architecture, one that connects both the operational and analytical systems at the same time. It also requires data integration to work at different velocities: at the velocity of the operational systems and at the velocity of the analytical systems. In this book you’ll explore the options for preserving historical data in the original operational context while making it simultaneously available to both operational and analytical systems.

Data Monetization Requires an Ecosystem-to-Ecosystem Architecture

Many people consider their enterprise a single business ecosystem with clear demarcation lines, but this belief is changing.⁸ Companies are increasingly integrating their core business functionalities and services with third parties and their platforms. They are monetizing their data, making their APIs publicly available, and using open data at large.⁹

The consequences of these advancements is that data is distributed more often between environments, and thus is more decentralized. When data is shared with other companies, or using cloud or SaaS solutions, it ends up in different places, which makes integration and data management more difficult. In addition, network bandwidth, connectivity, and latency issues arise when data isn’t distributed properly to the platform or environment where data is used. Pursuing a single public cloud strategy won’t solve these challenges. This means that if you want APIs and SaaS systems to work well and leverage the public cloud capabilities, you must master data integration, which this book will teach you how to do.

The trends I cover are major and will affect the way people use data and the way companies should organize their architectures. Data growth is accelerating, computing power is increasing, and analytical techniques are advancing. Data consumption is increasing, which means that data needs to be distributed quickly. Stronger data governance is required. Data management also must be decentralized due to trends like cloud, SaaS, and microservices. All of these factors have to be balanced with a short time to market, thanks to strong competition. This risky combination challenges us to manage data in a completely different way.

Enterprises Are Saddled with Outdated Data Architectures

One of the biggest problems many enterprises are dealing with is getting value out of their current enterprise data architectures.¹⁰ The majority of all data architectures use a monolithic design—either an enterprise data warehouse or data lake—and manage and distribute data centrally. In a highly distributed environment, these architectures won’t fulfill future needs. Let’s look at some characteristics.

Enterprise Data Warehouse and Business Intelligence

The first-generation data architectures are based on data warehousing and business intelligence. The philosophy is that there is one central integrated data repository, containing years of detailed and stable data, for the entire organization. This architecture comes with some downsides.

Enterprise data unification is an incredibly complex process and takes many years to complete. Chances are relatively high that the meaning of data differs across different domains,¹¹ departments, and systems. Data attributes can have the same names, but their meaning and definitions differ, so we either end up creating many variations or just accepting the differences and inconsistencies. The more data we add, and the more conflicts and inconsistencies in definitions that arise, the more difficult it will be to harmonize. Chances are you end up with a unified context that is meaningless to everybody. For advanced analytics, such as machine learning, leaving context out can be a big problem because if the data is meaningless, it is impossible to correctly predict the future.

Enterprise data warehouses (EDWs) behave like integration databases, as illustrated in Figure 1-1. They act as data stores for multiple data-consuming applications. This means that they are a point of coupling between all the applications that want to access it. Changes need to be carried out carefully because of the many cross dependencies between different applications. Some changes can also trigger a ripple effect of other changes. When this happens, you’ve created a big ball of mud.

Enterprise Data Warehouses typically have many coupling points, steps of integration and dependencies.

Big Ball of Mud

A big ball of mud is a haphazardly structured, sprawling, sloppy, duct-tape-and-baling-wire, spaghetti-code jungle. It is a popularized term first coined by Brian Foote and Joseph Yoder. The “big ball of mud” describes a system architecture that is monolithic, difficult to understand, hard to maintain, and tightly coupled because of its many dependencies. Figure 1-2 shows a dependency diagram that illustrates this. Each line represents a relationship between two software components.

A big ball of mud, as you can see, has extreme dependencies between all components, which makes it practically impossible to modify one component without affecting others.

Data warehouses, with their layers, views, countless tables, relationships, scripts, ETL jobs, and scheduling flows, often result in a chaotic web of dependencies. These complexities are such that you often end up, after a while, with a “big ball of mud.”

Figure 1-2. A dependency diagram is a mathematical model frequently used for software architecture to identify the components and functional units of the design.

Because of the data warehouse’s high degree of complexity and one central team managing it, the lack of agility often becomes a concern. This increased waiting time starts to make people creative. Engineers, for example, might bypass the integration layer and directly map data from the staging layer to their data mart. Another developer creates a view to quickly combine data. This technical debt (future rework) will cause problems later. The architecture will become more complex and people will lose insight in all the creativity and shortcuts created to ensure timely delivery.

Data warehouses are tightly coupled with the underlying chosen solution or technology, meaning that consumers requiring different read patterns must export data to other environments. As the vendor landscape changes and new types of databases pop up, warehouses are becoming more scattered and are forced to export data. This trend undermines the grand vision of efficiently using a single central repository and utilizing the underlying (expensive) hardware.

Life cycle management of historical data is often an issue. Data warehouses are seen as archives of truth, allowing operational systems to clean up irrelevant data, knowing data will be retained in the warehouse. For operational advanced analytics—something that emerged after data warehouses made an appearance—this might be a problem. Data has been transformed and is no longer recognizable for the operational use case. Or making it quickly available is difficult, given that many warehouses typically process data for many hours.

Warehouses often lack insight into ad hoc consumption and further distribution, especially when data is carried out of the ecosystem. With new regulation, data ownership and insight into the consumption and distribution of data is important because you need to be able to explain what personal data has been consumed by whom and for what purpose.

Data quality is often a problem as well because who owns the data in data warehouses? Who is responsible if source systems deliver corrupted data? I have examined situations where engineers took care of data quality issues themselves. In one instance, the engineers fixed the data in the staging layer so the data would load properly into the data warehouse. These fixes became permanent, and over time hundreds of additional scripts had to be applied before data processing could start. These scripts aren’t part of trustworthy ETL processes and don’t allow for data lineage that can be tracked back.

Given the total amount of data in the data warehouse, the years it took to develop it, the knowledge people have, and intensive business usage, a replacement migration will be a risky and time-consuming activity. Therefore many enterprises continue to use this architecture and feed their business reports, dashboards, and data-hungry applications from the data warehouse.¹²

Data Lake

As data volumes and the need for faster insights grew, engineers started to work on other concepts. Data lakes emerged as an alternative for access to raw and higher volumes of data.¹³ By providing data as is, without having to structure it first, any consumer can decide how to use it and how to transform and integrate it.

Data lakes, just like data warehouses, are considered centralized (monolithic) data repositories, but they differ from warehouses because they store data before it has been transformed, cleansed, and structured. Schemas therefore are often determined when reading data. This differs from data warehouses, which use a predefined and fixed structure. Data lakes also provide a higher data variety by supporting multiple formats: structured, semi-structured, and unstructured.

A bigger difference between data warehouses and data lakes is the underlying technology used. Data warehouses are usually engineered with RDBMs, while data lakes are commonly engineered with distributed databases or NoSQL systems. Using public cloud services is also a popular choice. Recently distributed and fully managed cloud-scale databases,¹⁴ on top of container infrastructure, have simplified the task of managing centralized data repositories at scale while adding advantages in elasticity and cost.¹⁵

Many of the lakes, as pictured in Figure 1-3, collect pure, unmodified, raw data from the original source systems. Dumping in raw application structures—exact copies—is fast and allows data analysts and scientists quick access. However, the complexity with raw data is that use cases always require reworking the data. Data quality problems have to be sorted out, aggregations are required, and enrichments with other data are needed to bring the data into context. This introduces a lot of repeatable work and is another reason why data lakes are typically combined with data warehouses. Data warehouses, in this combination, act like high-quality repositories of cleansed and harmonized data, while data lakes act like (ad hoc) analytical environments, holding a large variety of raw data to facilitate analytics.

Data lakes are typically giant object-storage environments that pull raw data together from a variety of sources, from which data can be consumed downstream. In many cases, they are just a pool of tables, without any logical domain boundaries defined.

Designing data lakes, just like data warehouses, is a challenge. Gartner analyst Nick Heudecker‏ tweeted that he sees a data-lake-implementation failure rate of more than 60%.¹⁶ Data lake implementations typically fail, in part, because of their immense complexity, difficult maintenance, and shared dependencies:

Data that is pulled into a data lake is often raw and likely a complex representation of all the different applications. It can include tens of thousands of tables, incomprehensible data structures, and technical values that are understood only by the application itself. Additionally, there is tight coupling with the underlying source systems, since the inherited structure is an identical copy. In a scenario of pulling in raw data, there is a real risk that data pipelines will break when sources start changing.
Analytical models in data lakes are often trained on both raw and harmonized data. It is not unthinkable that both data engineers and data scientists are technically data plumbing, creating data and operating these data pipelines and models by hand or in their data science project. Data lakes therefore carry substantial (operational) risks.
Data lakes are often a single platform and are shared by many different use cases. Due to their tight coupling, compatibility challenges, shared libraries, and configurations, these platforms are very hard to maintain.

These challenges are just a few reasons why the failure rate of big data projects is so high. Other reasons include management resistance, internal politics, lack of expertise, and security and governance challenges.

Centralized View

Data warehouses and lakes can be scaled up using techniques like metadata-driven ELT, data virtualization, cloud, distributed processing, real-time ingestion, machine learning for enrichments, and so on. But there is a far bigger problem: the centralized thinking behind these architectures. This includes centralized management, centralized data ownership, clustered resources, and central models that dictate that everybody must use the same terms and definitions. This centralization comes with another expensive price tag: by removing data professionals from business domains, we take away creativity and business insights. Teams are forced into constant cross-communication. It’s for good reason that modern tech companies are advocating domain-driven design, a software development approach first proposed by Eric Evans that includes widely accepted best practices and strategic, philosophical, tactical, and technical elements.

Summary

Data warehouses are here to stay because the need to harmonize data from different sources within a particular context will always remain. Patterns of decoupling by staging data won’t disappear, nor will the steps of cleansing, fixing, and transforming schemas. Any architectural style from Bill Inmon, Ralph Kimball, or Data vault modeling can be well applied, depending on the needs of the use case. The same applies for data lakes: the need to process vast amounts of data in a distributed fashion for analytics won’t disappear soon.

However, we must consider how we want to manage and distribute our data at large. Big silos like enterprise data warehouses will become extinct because they are unable to scale. Tightly coupled integration layers, loss of context, and increased data consumption will force companies to look for alternatives. Data lake architectures, which pull data in raw, are the other extreme. Raw, polluted data, which can change any time, will prevent experiments and use cases from ever making it into production. Raw data itself carries a lot of repeatable work with it.

Scaled Architecture

The solution to these siloed-data complexity problems is a Scaled Architecture: a reference and domain-based architecture with a set of blueprints, designs, principles, models, and best practices that simplifies and integrates data management across the entire organization in a distributed fashion. What I envision is an architecture that brings all the data management areas much closer together by providing a consistent view of how to uniformly apply security, governance, master data management, metadata, and data modeling, an architecture that can work using a combination of multiple cloud providers and on-premises platforms but still gives you the control and agility you need. It abstracts complexity for teams by providing domain-agnostic and reusable building blocks but still provides flexibility by providing a combination of different data delivery styles using a mix of technologies. The Scaled Architecture enables teams to turn data into value themselves, without help from a central team.

The Scaled Architecture you will discover in this book comes with a large set of data management principles. It requires you, for example, to identify and classify genuine and unique data, fix data quality at the source, administer metadata precisely, and draw boundaries carefully. When enterprises follow these principles, they empower their teams to distribute and use data quickly while staying decoupled. This architecture also comes with a governance model: engineers need to learn how to make good abstractions and data pipelines, while business data owners need to take accountability for their data and its quality, ensuring that the context is clear to everyone.

¹ The Body of Knowledge is developed by the DAMA community. It has a slow update cycle: the first release was in 2008, the second one in 2014.

² If you want to learn more about software architecture, I encourage you to read Fundamentals of Software Architecture by Mark Richards and Neal Ford (O’Reilly, 2020).

³ The conceptual model is sometimes also called domain model, domain object model, or analysis object model.

⁴ Microsoft acquired GitHub, a popular code-repository service used by many developers and large companies. IBM has recognized the value of open source as well with their purchase of RedHat, a leading provider of open source solutions.

⁵ This Neo4j use case shows that a graph database is a better option for performing a social networks analysis.

⁶ The New York Times has described the impact of the Marriott account hack.

⁷ Personal data are any information related to an identified or identifiable natural person.

⁸ James Moore, author of The Death of Competition (Harper, 1997), defined a business ecosystem as “a collection of companies that work cooperatively and competitively to satisfy customer needs.”

⁹ Open data is data that can be used freely and is made publicly available. McKinsey sees that data monetization is changing the way business is done.

¹⁰ Data architecture here refers to the infrastructure and data and the schemas, integration, transformations, storage, and workflow required to enable the analytical requirements of the information architecture.

¹¹ A domain is a field of study that defines a set of common requirements, terminology, and functionality for any software program constructed to solve a problem in the area of computer programming.

¹² Dashboards are more visual and use a variety of chart types. Reports tend to be mainly tabular, but they may contain additional charts or chart components.

¹³ James Dixon, then chief technology officer at Pentaho, coined the term data lake.

¹⁴ Docker, Inc.—the leading company behind tools built around Docker—explains containers nicely.

¹⁵ Elasticity is the degree to which systems are able to adapt their workload changes by automatically provisioning and deprovisioning resources.

¹⁶ TechRepublic even says that 85% of all big data projects fail.

Get Data Management at Scale now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Data Management at Scale by