Chapter 9. Simulation and Bootstrapping

The application of different techniques in the data scientist’s toolkit depends critically on the nature of the data you’re working with. Observational data arises in the normal, day-to-day, business-as-usual set of interactions at any company. In contrast, experimental data arises under well-designed experimental conditions, such as when you set up an A/B test. This type of data is most commonly used to infer causality or estimate the incrementality of a lever (Chapter 15).

A third type, simulated or synthetic data, is less well-known and occurs when a person re-creates the data generating process (DGP). This can be done either by making strong assumptions about it or by training a generative model on a dataset. In this chapter, I will only deal with the former type, but I’ll recommend references at the end of this chapter if you’re interested in the latter.

Simulation is a great tool for data scientists for different reasons:

Understanding an algorithm: No algorithm works universally well across datasets. Simulation allows you to single out different aspects of a DGP and understand the sensitivity of the algorithm to changes. This is commonly done with Monte Carlo (MC) simulations.
Bootstrapping: Many times you need to estimate the precision of an estimate without making distributional assumptions that simplify the calculations. Bootstrapping is a sort of simulation that can help you out in such cases.
Levers optimization: There are ...

Get Data Science: The Hard Parts now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Data Science: The Hard Parts by Daniel Vaughan

Chapter 9. Simulation and Bootstrapping

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly