Chapter 9. Data for Testing

In the preceding chapter, you saw how to replace one of the two dependencies in data pipeline testing: interfaces to external services. This gets you part of the way to cost-effective testing. This chapter covers how to replace the second dependency mentioned in Chapter 7: external data sources. Instead of using a live data source for testing, you’ll see how to replace it with synthetic data.

There are a lot of neat techniques in this chapter for creating synthetic data, but before you fire up your IDE, it’s important to assess whether replacing a data dependency with synthetic data is the right move. This chapter opens with guidance on how to make the choice between live and synthetic data for testing and the benefits and challenges with each approach.

After this, the remainder of the chapter focuses on different approaches to synthetic data generation. The approach I’ll cover first, manual data generation, is likely one you’ve done when creating a few rows of fake data for unit testing.

The learnings from creating manual data will help you build accurate models for automated data generation, the approach I’ll cover next. You’ll also see how to use data generation libraries to customize data generators so that they provide the data characteristics needed for testing.

Finally, you’ll see how to keep automated data generation models up to date with source data changes by linking data schemas with test generation code. This is a powerful approach that ...

Get Cost-Effective Data Pipelines now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.