Chapter 8. Selecting and Debugging XGBoost Models

The ways that data scientists measure a model’s real-world performance are usually inadequate. According to “Underspecification Presents Challenges for Credibility in Modern Machine Learning”, penned by 40 researchers at Google and other leading machine learning research institutions, “ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains.” A fundamental issue is that we measure performance like we’re writing research papers, no matter how complex and high-risk the deployment scenario. Test data measurements like accuracy or area under the curve (AUC) don’t tell us much about fairness, privacy, security, or stability. These simple measurements of prediction quality or error on static test sets are not informative enough for risk management. They are only correlated with real-world performance, and don’t guarantee good performance in deployment. Put plainly, we should be more concerned with in vivo performance and risk management than in silico test data performance, because a primary thrust of the applied practice of ML is to make good decisions in the real world.

This chapter will introduce several methods that go beyond traditional model assessment to select models that generalize better, and that push models to their limits to find hidden problems and failure modes. The chapter starts with a concept refresher, puts forward an enhanced process for model selection, and then focuses on ...

Get Machine Learning for High-Risk Applications now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.