Chapter 16. Model Selection

So far when we fit models, we have used a few strategies to decide which features to include:

  • Assess model fit with residual plots.

  • Connect the statistical model to a physical model.

  • Keep the model simple.

  • Compare improvements in the standard deviation of the residuals and in the MSE between increasingly complex models.

For example, when we examined the one-variable model of upward mobility in Chapter 15, we found curvature in the residual plot. Adding a second variable greatly improved the fit in terms of average loss (MSE and, relatedly, multiple R 2 ), but some curvature remained in the residuals. A seven-variable model made little improvement over the two-variable model in terms of a decrease in MSE, so although the two-variable model still showed some patterns in the residuals, we opted for this simpler model.

As another example, when we model the weight of a donkey in Chapter 18, we will take guidance from a physical model. We’ll ignore the donkey’s appendages and draw on the similarity between a barrel and a donkey’s body to begin fitting a model that explains weight by its length and girth (comparable to a barrel’s height and circumference). We’ll then continue to adjust that model by adding categorical features related to the donkey’s physical condition and age, collapsing categories, and excluding other possible features to keep the model simple.

The decisions we make in building these models are based on judgment calls, and in ...

Get Learning Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.