Iterative data modeling to avoid dreaded ETL
Gain agility by loading first and transforming later.
In today’s world full of big data, every large enterprise faces a similar problem: how do I leverage my data more effectively when it’s spread across dozens or even hundreds of systems? Businesses build mission-critical business applications on relational databases filled with structured data, and they also have unstructured data to worry about (think patient notes, photos, reports, etc.). They want to get a better grasp of all this data, and build new applications that leverage it to innovate and better serve their customers.
The ETL problem
Integrating data from various silos into a relational database requires significant investment in the extract, transform, load (ETL) phase of any data project. Before building an application that leverages integrated data, data architects must first reconcile all of the data in their source systems, finalizing the schema before the data can be ingested. This data modeling effort may take years. And, additional effort will be necessary with each change in an input system data scheme or application requirement.
This approach is not agile, and in today’s world, it means that a business constantly plays catch-up. Not to mention, ETL tools and the work that goes into using them can eat up 60% of a project’s budget, despite providing little additional value (see the report TDWI, Evaluating ETL and Data Integration Platforms). The meaningful work of building an application and delivering actual value only begins after all the ETL works is complete.
The ELT solution
No, “ELT” is not a typo. Instead of ETL, the flexibility of a document database makes it possible to extract, load…and then transform (hence, “ELT”). This process, known as “schema-on-read” (instead of the traditional schema-on-write), lets you apply your own lens to the data when you read it back out. So instead of requiring a schema first, before doing anything with your data, you can use the latent schema already with the data and update this existing schema later as desired or needed.
That means taking all of your data, from all of your systems—structured and unstructured, however it comes—and ingest it as is. Developers can start using it immediately to build applications. By loading it into a database that can support different schemas and data types (document, RDF, geospatial, binary, SQL, and text), data architects don’t have to worry about defining the schema, type, or format up front and can focus instead on how to use that data down the line.
Once loaded, it is possible to iteratively make adjustments to the data as needed to address current requirements. Now you can transform that data, harmonize it, and make it usable for your business needs, as you need it. Over time, as requirements and downstream systems change, so might your data transformations. In part three of the recently released MarkLogic Cookbook, Dave Cassel illustrates a variety of ways to transform and harmonize data in MarkLogic after data has been loaded. In fact, he shows how you can transform around a given field as you load.
Of course, from a governance standpoint, you don’t want to actually change the data. You can use the MarkLogic Envelope Pattern to wrap newly harmonized data around the original data to preserve its original form. You can also transform the data that gets stored in indexes without physically changing the data stored in documents. And, finally, you can use the platform to implement a data-as-a-service pattern, transforming data on export as it is accessed by downstream applications.
Data modeling does not have to be an up-front activity, but rather an iterative one that evolves as the business needs evolve. This iterative data modeling approach allows large enterprises to respond faster to their business needs, reap the benefits of their business data, and cut significant costs.
This post is a collaboration between O’Reilly and MarkLogic. See our statement of editorial independence.