Chapter 9. Simulation and Bootstrapping

The application of different techniques in the data scientist’s toolkit depends critically on the nature of the data you’re working with. Observational data arises in the normal, day-to-day, business-as-usual set of interactions at any company. In contrast, experimental data arises under well-designed experimental conditions, such as when you set up an A/B test. This type of data is most commonly used to infer causality or estimate the incrementality of a lever (Chapter 15).

A third type, simulated or synthetic data, is less well-known and occurs when a person re-creates the data generating process (DGP). This can be done either by making strong assumptions about it or by training a generative model on a dataset. In this chapter, I will only deal with the former type, but I’ll recommend references at the end of this chapter if you’re interested in the latter.

Simulation is a great tool for data scientists for different reasons:

Understanding an algorithm

No algorithm works universally well across datasets. Simulation allows you to single out different aspects of a DGP and understand the sensitivity of the algorithm to changes. This is commonly done with Monte Carlo (MC) simulations.

Bootstrapping

Many times you need to estimate the precision of an estimate without making distributional assumptions that simplify the calculations. Bootstrapping is a sort of simulation that can help you out in such cases.

Levers optimization

There are ...

Get Data Science: The Hard Parts now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.