Chapter 2. Data Representation Design Patterns

At the heart of any machine learning model is a mathematical function that is defined to operate on specific types of data only. At the same time, real-world machine learning models need to operate on data that may not be directly pluggable into the mathematical function. The mathematical core of a decision tree, for example, operates on boolean variables. Note that we are talking here about the mathematical core of a decision tree—decision tree machine learning software will typically also include functions to learn an optimal tree from data and ways to read in and process different types of numeric and categorical data. The mathematical function (see Figure 2-1) that underpins a decision tree, however, operates on boolean variables and uses operations such as AND (&& in Figure 2-1) and OR (+ in Figure 2-1).

The heart of a decision tree machine learning model to predict whether or not a baby requires intensive care is a mathematical model that operates on boolean variables.
Figure 2-1. The heart of a decision tree machine learning model to predict whether or not a baby requires intensive care is a mathematical model that operates on boolean variables.

Suppose we have a decision tree to predict whether a baby will require intensive care (IC) or can be normally discharged (ND), and suppose that the decision tree takes as inputs two variables, x1 and x2. The trained model might look something like Figure 2-1.

It is pretty clear that x1 and x2 need to be boolean variables in order for f(x1, x2) to ...

Get Machine Learning Design Patterns now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.