Chapter 31. Five Best Practices for Stable Data Processing

Christian Lauer

The following five best practices are the basics when it comes to implementing data processes such as ELT or ETL.

Prevent Errors

In case of failure, a rollback should be done—similar to with SQL. If a job aborts with errors, all changes should be rolled back. Otherwise, only X% of the transaction will be transmitted, and a part will be missing. Finding out what that missing data is will be very hard.

Set Fair Processing Times

How long does a job take to process X data rows? This provides important insights about the process. How often and how long does a process have to run? Which data actuality can I assure my department? What happens when data has to be reloaded?

Use Data-Quality Measurement Jobs

Are my source and target systems compliant? How can I be sure that all data has been transferred? Here, I recommend building up a monitoring strategy. It’s always a good idea to measure data quality and to detect errors quickly; otherwise, a lack of trust from the consumer can result.

Ensure Transaction Security

When using database-replication software in your process (for example, AWS Data Migration Service, or DMS) instead of a direct connection between system A and system B, you can run into trouble. I once had a replication job that loaded data from table A and table B at the same time. Both were further ...

Get 97 Things Every Data Engineer Should Know now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.