CHAPTER 12Creating Machine Learning Datasets Using SQL

In previous chapters, we introduced SQL concepts and walked through some analytical reporting examples, but we have not yet focused on the specifics of dataset design for predictive modeling applications. In this chapter, we'll discuss the development of datasets for two types of algorithms: classification and time series models.

A binary classification model predicts whether a record belongs to one category or another. For example, a heart disease classification model might analyze data from a patient's medical history to determine whether they're likely to develop heart disease, or not. A weather model could use past and current temperature, precipitation, pressure, and wind measurements, as well as those from surrounding geographic areas, to predict whether or not it will rain in the next 24 hours. In a retail scenario like a farmer's market, the seller may want to predict whether a customer will return to make another purchase within a certain time frame, or not.

In order to make predictions, the model needs to be trained. Binary classifiers are a type of supervised learning model, which means they are trained by passing example rows of data (also called instances, observations, or feature vectors) labeled with each of the possible outcomes into the algorithm, so it can detect patterns and identify characteristics that are more strongly associated with one result or the other. Some example instances are set aside to ...

Get SQL for Data Scientists now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.