Chapter 12. Change Data Capture
Raghotham Murthy
Change data capture (CDC) is a solution to a specific problem. You have your most valuable data in your production databases. You want to analyze that data, but you don’t want to add any more load to those production databases. Instead, you want to rely on a data warehouse or data lake. Once you decide that you want to analyze data from your production database in a different system, you need a reliable way to replicate that data from the production database to your data warehouse.
It turns out, at scale, this is a hard problem to solve. You can’t just decide to copy data over from the production database to the warehouse—that would add a lot more load on the production database, especially if you want high fidelity. And if you fetched only the changed records, you would miss deletes.
Thankfully, all modern production databases write out a write-ahead log (WAL), or change log, as part of their normal transaction processing. This log captures every single change to each row/cell in each table in the database and can be used in database replication to create replicas of the production database. In CDC, a tool reads this write-ahead log and applies the changes to the data warehouse. This technique is a lot more robust than batch exports of the tables and has a low footprint on the production database.
However, you have to treat CDC ...
Get 97 Things Every Data Engineer Should Know now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.