Chapter 4. Making Predictions with Decision Trees and Decision Forests

Classification and regression are the oldest and most well-studied types of predictive analytics. Most algorithms you will likely encounter in analytics packages and libraries are classification or regression techniques, like support vector machines, logistic regression, neural networks, and deep learning. The common thread linking regression and classification is that both involve predicting one (or more) values given one (or more) other values. To do so, both require a body of inputs and outputs to learn from. They need to be fed both questions and known answers. For this reason, they are known as types of supervised learning.

PySpark MLlib offers implementations of a number of classification and regression algorithms. These include decision trees, naïve Bayes, logistic regression, and linear regression. The exciting thing about these algorithms is that they can help predict the future—or at least, predict the things we don’t yet know for sure, like the likelihood you will buy a car based on your online behavior, whether an email is spam given the words it contains, or which acres of land are likely to grow the most crops given their location and soil chemistry.

In this chapter, we will focus on a popular and flexible type of algorithm for both classification and regression (decision trees) and the algorithm’s extension (random decision forests). First, we will understand the basics of decision trees and forests and introduce the former’s PySpark implementation. The PySpark implementation of decision trees supports binary and multiclass classification, and regression. The implementation partitions data by rows, allowing distributed training with millions or even billions of instances. This will be followed by preparation of our dataset and creating our first decision tree. Then we’ll tune our decision tree model. We’ll finish up by training a random forest model on our processed dataset and making predictions.

Although PySpark’s decision tree implementation is easy to get started with, it is helpful to understand the fundamentals of decision tree and random forest algorithms. That is what we’ll cover in the next section.

Decision Trees and Forests

Decision trees are a family of algorithms that can naturally handle both categorical and numeric features. Building a single tree can be done using parallel computing, and many trees can be built in parallel at once. They are robust to outliers in the data, meaning that a few extreme and possibly erroneous data points might not affect predictions at all. They can consume data of different types and on different scales without the need for preprocessing or normalization.

Decision tree–based algorithms have the advantage of being comparatively intuitive to understand and reason about. In fact, we all probably use the same reasoning embodied in decision trees, implicitly, in everyday life. For example, I sit down to have morning coffee with milk. Before I commit to that milk and add it to my brew, I want to predict: is the milk spoiled? I don’t know for sure. I might check if the use-by date has passed. If not, I predict no, it’s not spoiled. If the date has passed, but it was three or fewer days ago, I take my chances and predict no, it’s not spoiled. Otherwise, I sniff the milk. If it smells funny, I predict yes, and otherwise no.

This series of yes/no decisions that leads to a prediction are what decision trees embody. Each decision leads to one of two results, which is either a prediction or another decision, as shown in Figure 4-1. In this sense, it is natural to think of the process as a tree of decisions, where each internal node in the tree is a decision, and each leaf node is a final answer.

That is a simplistic decision tree and was not built with any rigor. To elaborate, consider another example. A robot has taken a job in an exotic pet store. It wants to learn, before the shop opens, which animals in the shop would make a good pet for a child. The owner lists nine pets that would and wouldn’t be suitable before hurrying off. The robot compiles the information found in Table 4-1 from examining the animals.

aaps 0401
Figure 4-1. Decision tree: is milk spoiled?
Table 4-1. Exotic pet store “feature vectors”
Name Weight (kg) # Legs Color Good pet?

Fido

20.5

4

Brown

Yes

Mr. Slither

3.1

0

Green

No

Nemo

0.2

0

Tan

Yes

Dumbo

1390.8

4

Gray

No

Kitty

12.1

4

Gray

Yes

Jim

150.9

2

Tan

No

Millie

0.1

100

Brown

No

McPigeon

1.0

2

Gray

No

Spot

10.0

4

Brown

Yes

The robot can make a decision for the nine listed pets. There are many more pets available in the store. It still needs a methodology for deciding which animals among the rest will be suitable as pets for kids. We can assume that the characteristics of all animals are available. Using the decision data provided by the store owner and a decision tree, we can help the robot learn what a good pet for a kid looks like.

Although a name is given, it will not be included as a feature in our decision tree model. There is little reason to believe the name alone is predictive; “Felix” could name a cat or a poisonous tarantula, for all the robot knows. So, there are two numeric features (weight, number of legs) and one categorical feature (color) predicting a categorical target (is/is not a good pet for a child.

The way a decision tree works is by making one or more decisions in sequence based on provided features. To start off, the robot might try to fit a simple decision tree to this training data, consisting of a single decision based on weight, as shown in Figure 4-2.

aaps 0402
Figure 4-2. Robot’s first decision tree

The logic of the decision tree is easy to read and make sense of: 500kg animals certainly sound unsuitable as pets. This rule predicts the correct value in five of nine cases. A quick glance suggests that we could improve the rule by lowering the weight threshold to 100kg. This gets six of nine examples correct. The heavy animals are now predicted correctly; the lighter animals are only partly correct.

So, a second decision can be constructed to further refine the prediction for examples with weights less than 100kg. It would be good to pick a feature that changes some of the incorrect Yes predictions to No. For example, there is one small green animal, sounding suspiciously like a snake, that will be classified by our current model as a suitable pet candidate. The robot could predict correctly by adding a decision based on color, as shown in Figure 4-3.

aaps 0403
Figure 4-3. Robot’s next decision tree

Now, seven of nine examples are correct. Of course, decision rules could be added until all nine were correctly predicted. The logic embodied in the resulting decision tree would probably sound implausible when translated into common speech: “If the animal’s weight is less than 100kg, its color is brown instead of green, and it has fewer than 10 legs, then yes, it is a suitable pet.” While perfectly fitting the given examples, a decision tree like this would fail to predict that a small, brown, four-legged wolverine is not a suitable pet. Some balance is needed to avoid this phenomenon, known as overfitting.

Decision trees generalize into a more powerful algorithm, called random forests. Random forests combine many decision trees to reduce the risk of overfitting and train the decision trees separately. The algorithm injects randomness into the training process so that each decision tree is a bit different. Combining the predictions reduces the variance of the predictions, makes the resulting model more generalizable, and improves performance on test data.

This is enough of an introduction to decision trees and random forests for us to begin using them with PySpark. In the next section, we will introduce the dataset that we’ll work with and prepare it for use in PySpark.

Preparing the Data

The dataset used in this chapter is the well-known Covtype dataset, available online as a compressed CSV-format data file, covtype.data.gz, and accompanying info file, covtype.info.

The dataset records the types of forest-covered parcels of land in Colorado, USA. It’s only a coincidence that the dataset concerns real-world forests! Each data record contains several features describing each parcel of land—like its elevation, slope, distance to water, shade, and soil type—along with the known forest type covering the land. The forest cover type is to be predicted from the rest of the features, of which there are 54 in total.

This dataset has been used in research and even a Kaggle competition. It is an interesting dataset to explore in this chapter because it contains both categorical and numeric features. There are 581,012 examples in the dataset, which does not exactly qualify as big data but is large enough to be manageable as an example and still highlight some issues of scale.

Thankfully, the data is already in a simple CSV format and does not require much cleansing or other preparation to be used with PySpark MLlib. The covtype.data file should be extracted and copied into your local or cloud storage (such as AWS S3).

Start pyspark-shell. You may find it helpful to give the shell a healthy amount of memory to work with, as building decision forests can be resource intensive. If you have the memory, specify --driver-memory 8g or similar.

CSV files contain fundamentally tabular data, organized into rows of columns. Sometimes these columns are given names in a header line, although that’s not the case here. The column names are given in the companion file, covtype.info. Conceptually, each column of a CSV file has a type as well—a number, a string—but a CSV file doesn’t specify this.

It’s natural to parse this data as a dataframe because this is PySpark’s abstraction for tabular data, with a defined column schema, including column names and types. PySpark has built-in support for reading CSV data. Let’s read our dataset as a DataFrame using the built-in CSV reader:

data_without_header = spark.read.option("inferSchema", True)\
                      .option("header", False).csv("data/covtype.data")
data_without_header.printSchema()
...
root
 |-- _c0: integer (nullable = true)
 |-- _c1: integer (nullable = true)
 |-- _c2: integer (nullable = true)
 |-- _c3: integer (nullable = true)
 |-- _c4: integer (nullable = true)
 |-- _c5: integer (nullable = true)
 ...

This code reads the input as CSV and does not attempt to parse the first line as a header of column names. It also requests that the type of each column be inferred by examining the data. It correctly infers that all of the columns are numbers, and, more specifically, integers. Unfortunately, it can name the columns only _c0 and so on.

We can look at the covtype.info file for the column names.

$ cat data/covtype.info

...
[...]
7.	Attribute information:

Given is the attribute name, attribute type, the measurement unit and
a brief description.  The forest cover type is the classification
problem.  The order of this listing corresponds to the order of
numerals along the rows of the database.

Name                                    Data Type
Elevation                               quantitative
Aspect                                  quantitative
Slope                                   quantitative
Horizontal_Distance_To_Hydrology        quantitative
Vertical_Distance_To_Hydrology          quantitative
Horizontal_Distance_To_Roadways         quantitative
Hillshade_9am                           quantitative
Hillshade_Noon                          quantitative
Hillshade_3pm                           quantitative
Horizontal_Distance_To_Fire_Points      quantitative
Wilderness_Area (4 binary columns)      qualitative
Soil_Type (40 binary columns)           qualitative
Cover_Type (7 types)                    integer

Measurement                  Description

meters                       Elevation in meters
azimuth                      Aspect in degrees azimuth
degrees                      Slope in degrees
meters                       Horz Dist to nearest surface water features
meters                       Vert Dist to nearest surface water features
meters                       Horz Dist to nearest roadway
0 to 255 index               Hillshade index at 9am, summer solstice
0 to 255 index               Hillshade index at noon, summer soltice
0 to 255 index               Hillshade index at 3pm, summer solstice
meters                       Horz Dist to nearest wildfire ignition point
0 (absence) or 1 (presence)  Wilderness area designation
0 (absence) or 1 (presence)  Soil Type designation
1 to 7                       Forest Cover Type designation
...

Looking at the column information, it’s clear that some features are indeed numeric. Elevation is an elevation in meters; Slope is measured in degrees. However, Wilderness_Area is something different, because it is said to span four columns, each of which is a 0 or 1. In reality, Wilderness_Area is a categorical value, not a numeric one.

These four columns are actually a one-hot or 1-of-N encoding. When this form of encoding is performed on a categorical feature, one categorical feature that takes on N distinct values becomes N numeric features, each taking on the value 0 or 1. Exactly one of the N values has value 1, and the others are 0. For example, a categorical feature for weather that can be cloudy, rainy, or clear would become three numeric features, where cloudy is represented by 1,0,0; rainy by 0,1,0; and so on. These three numeric features might be thought of as is_cloudy, is_rainy, and is_clear features. Likewise, 40 of the columns are really one Soil_Type categorical feature.

This isn’t the only possible way to encode a categorical feature as a number. Another possible encoding simply assigns a distinct numeric value to each possible value of the categorical feature. For example, cloudy may become 1.0, rainy 2.0, and so on. The target itself, Cover_Type, is a categorical value encoded as a value 1 to 7.

Be careful when encoding a categorical feature as a single numeric feature. The original categorical values have no ordering, but when encoded as a number, they appear to. Treating the encoded feature as numeric leads to meaningless results because the algorithm is effectively pretending that rainy is somehow greater than, and two times larger than, cloudy. It’s OK as long as the encoding’s numeric value is not used as a number.

We have seen both types of encodings of categorical features. It would have, perhaps, been simpler and more straightforward to not encode such features (and in two ways, no less) and to instead simply include their values directly, like “Rawah Wilderness Area.” This may be an artifact of history; the dataset was released in 1998. For performance reasons or to match the format expected by libraries of the day, which were built more for regression problems, datasets often contain data encoded in these ways.

In any event, before proceeding, it is useful to add column names to this DataFrame to make it easier to work with:

from pyspark.sql.types import DoubleType
from pyspark.sql.functions import col

colnames = ["Elevation", "Aspect", "Slope", \
            "Horizontal_Distance_To_Hydrology", \
            "Vertical_Distance_To_Hydrology", "Horizontal_Distance_To_Roadways", \
            "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm", \
            "Horizontal_Distance_To_Fire_Points"] + \ 1
[f"Wilderness_Area_{i}" for i in range(4)] + \
[f"Soil_Type_{i}" for i in range(40)] + \
["Cover_Type"]

data = data_without_header.toDF(*colnames).\
                          withColumn("Cover_Type",
                                    col("Cover_Type").cast(DoubleType()))

data.head()
...
Row(Elevation=2596,Aspect=51,Slope=3,Horizontal_Distance_To_Hydrology=258,...)
1

+ concatenates collections.

The wilderness- and soil-related columns are named Wilderness_Area_0, Soil_Type_0, etc., and a bit of Python can generate these 44 names without having to type them all out. Finally, the target Cover_Type column is cast to a double value up front, because it will actually be necessary to consume it as a double rather than int in all PySpark MLlib APIs. This will become apparent later.

You can call data.show to see some rows of the dataset, but the display is so wide that it will be difficult to read it all. data.head displays it as a raw Row object, which will be more readable in this case.

Now that we’re familiar with our dataset and have processed it, we can train a decision tree model.

Our First Decision Tree

In Chapter 3, we built a recommender model right away on all of the available data. This created a recommender that could be sense-checked by anyone with some knowledge of music: looking at a user’s listening habits and recommendations, we got some sense that it was producing good results. Here, that is not possible. We would have no idea how to make up a 54-feature description of a new parcel of land in Colorado, or what kind of forest cover to expect from such a parcel.

Instead, we must jump straight to holding out some data for purposes of evaluating the resulting model. Before, the AUC metric was used to assess the agreement between held-out listening data and predictions from recommendations. AUC may be viewed as the probability that a randomly chosen good recommendation ranks above a randomly chosen bad recommendation. The principle is the same here, although the evaluation metric will be different: accuracy. The majority—90%—of the data will again be used for training, and, later, we’ll see that a subset of this training set will be held out for cross-validation (the CV set). The other 10% held out here is actually a third subset, a proper test set.

(train_data, test_data) = data.randomSplit([0.9, 0.1])
train_data.cache()
test_data.cache()

The data needs a little more preparation to be used with a classifier in MLlib. The input DataFrame contains many columns, each holding one feature that could be used to predict the target column. MLlib requires all of the inputs to be collected into one column, whose value is a vector. PySpark’s VectorAssembler class is an abstraction for vectors in the linear algebra sense and contains only numbers. For most intents and purposes, they work like a simple array of double values (floating-point numbers). Of course, some of the input features are conceptually categorical, even if they’re all represented with numbers in the input.

Fortunately, the VectorAssembler class can do this work:

from pyspark.ml.feature import VectorAssembler

input_cols = colnames[:-1] 1
vector_assembler = VectorAssembler(inputCols=input_cols,
                                    outputCol="featureVector")

assembled_train_data = vector_assembler.transform(train_data)

assembled_train_data.select("featureVector").show(truncate = False)
...
+------------------------------------------------------------------- ...
|featureVector                                                       ...
+------------------------------------------------------------------- ...
|(54,[0,1,2,5,6,7,8,9,13,18],[1874.0,18.0,14.0,90.0,208.0,209.0, ...
|(54,[0,1,2,3,4,5,6,7,8,9,13,18],[1879.0,28.0,19.0,30.0,12.0,95.0, ...
...
1

Excludes the label, Cover_Type

The key parameters of VectorAssembler are the columns to combine into the feature vector, and the name of the new column containing the feature vector. Here, all columns—except the target, of course—are included as input features. The resulting DataFrame has a new featureVector column, as shown.

The output doesn’t look exactly like a sequence of numbers, but that’s because this shows a raw representation of the vector, represented as a SparseVector instance to save storage. Because most of the 54 values are 0, it stores only nonzero values and their indices. This detail won’t matter in classification.

VectorAssembler is an example of Transformer within the current MLlib Pipelines API. It transforms the input DataFrame into another DataFrame based on some logic, and is composable with other transformations into a pipeline. Later in this chapter, these transformations will be connected into an actual Pipeline. Here, the transformation is just invoked directly, which is sufficient to build a first decision tree classifier model:

from pyspark.ml.classification import DecisionTreeClassifier

classifier = DecisionTreeClassifier(seed = 1234, labelCol="Cover_Type",
                                    featuresCol="featureVector",
                                    predictionCol="prediction")

model = classifier.fit(assembled_train_data)
print(model.toDebugString)
...
DecisionTreeClassificationModel: uid=DecisionTreeClassifier_da03f8ab5e28, ...
  If (feature 0 <= 3036.5)
   If (feature 0 <= 2546.5)
    If (feature 10 <= 0.5)
     If (feature 0 <= 2412.5)
      If (feature 3 <= 15.0)
       Predict: 4.0
      Else (feature 3 > 15.0)
       Predict: 3.0
     Else (feature 0 > 2412.5)
       ...

Again, the essential configuration for the classifier consists of column names: the column containing the input feature vectors and the column containing the target value to predict. Because the model will later be used to predict new values of the target, it is given the name of a column to store predictions.

Printing a representation of the model shows some of its tree structure. It consists of a series of nested decisions about features, comparing feature values to thresholds. (Here, for historical reasons, the features are referred to only by number, not name, unfortunately.)

Decision trees are able to assess the importance of input features as part of their building process. That is, they can estimate how much each input feature contributes to making correct predictions. This information is simple to access from the model:

import pandas as pd

pd.DataFrame(model.featureImportances.toArray(),
            index=input_cols, columns=['importance']).\
            sort_values(by="importance", ascending=False)
...
                                  importance
Elevation                         0.826854
Hillshade_Noon                    0.029087
Soil_Type_1                       0.028647
Soil_Type_3                       0.026447
Wilderness_Area_0                 0.024917
Horizontal_Distance_To_Hydrology  0.024862
Soil_Type_31                      0.018573
Wilderness_Area_2                 0.012458
Horizontal_Distance_To_Roadways   0.003608
Hillshade_9am                     0.002840
...

This pairs importance values (higher is better) with column names and prints them in order from most to least important. Elevation seems to dominate as the most important feature; most features are estimated to have virtually no importance when predicting the cover type!

The resulting DecisionTreeClassificationModel is itself a transformer because it can transform a dataframe containing feature vectors into a dataframe also containing predictions.

For example, it might be interesting to see what the model predicts on the training data and compare its prediction with the known correct cover type:

predictions = model.transform(assembled_train_data)
predictions.select("Cover_Type", "prediction", "probability").\
            show(10, truncate = False)

...
+----------+----------+------------------------------------------------ ...
|Cover_Type|prediction|probability                                      ...
+----------+----------+------------------------------------------------ ...
|6.0       |4.0       |[0.0,0.0,0.028372324539571926,0.2936784469885515, ...
|6.0       |3.0       |[0.0,0.0,0.024558587479935796,0.6454654895666132, ...
|6.0       |3.0       |[0.0,0.0,0.024558587479935796,0.6454654895666132, ...
|6.0       |3.0       |[0.0,0.0,0.024558587479935796,0.6454654895666132, ...
...

Interestingly, the output also contains a probability column that gives the model’s estimate of how likely it is that each possible outcome is correct. This shows that in these instances, it’s fairly sure the answer is 3 in several cases and quite sure the answer isn’t 1.

Eagle-eyed readers might note that the probability vectors actually have eight values even though there are only seven possible outcomes. The vector’s values at indices 1 to 7 do contain the probability of outcomes 1 to 7. However, there is also a value at index 0, which always shows as probability 0.0. This can be ignored, as 0 isn’t even a valid outcome, as this says. It’s a quirk of representing this information as a vector that’s worth being aware of.

Based on the above snippet, it looks like the model could use some work. Its predictions look like they are often wrong. As with the ALS implementation in Chapter 3, the DecisionTreeClassifier implementation has several hyperparameters for which a value must be chosen, and they’ve all been left to defaults here. Here, the test set can be used to produce an unbiased evaluation of the expected accuracy of a model built with these default hyperparameters.

We will now use MulticlassClassificationEvaluator to compute accuracy and other metrics that evaluate the quality of the model’s predictions. It’s an example of an evaluator in MLlib, which is responsible for assessing the quality of an output DataFrame in some way:

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="Cover_Type",
                                        predictionCol="prediction")

evaluator.setMetricName("accuracy").evaluate(predictions)
evaluator.setMetricName("f1").evaluate(predictions)

...
0.6989423087953562
0.6821216079701136

After being given the column containing the “label” (target, or known correct output value) and the name of the column containing the prediction, it finds that the two match about 70% of the time. This is the accuracy of this classifier. It can compute other related measures, like the F1 score. For our purposes here, accuracy will be used to evaluate classifiers.

This single number gives a good summary of the quality of the classifier’s output. Sometimes, however, it can be useful to look at the confusion matrix. This is a table with a row and a column for every possible value of the target. Because there are seven target category values, this is a 7×7 matrix, where each row corresponds to an actual correct value, and each column to a predicted value, in order. The entry at row i and column j counts the number of times an example with true category i was predicted as category j. So, the correct predictions are the counts along the diagonal, and the predictions are everything else.

It’s possible to calculate a confusion matrix directly with the DataFrame API, using its more general operators.

confusion_matrix = predictions.groupBy("Cover_Type").\
  pivot("prediction", range(1,8)).count().\
  na.fill(0.0).\ 1
  orderBy("Cover_Type")

confusion_matrix.show()

...

+----------+------+------+-----+---+---+---+-----+
|Cover_Type|     1|     2|    3|  4|  5|  6|    7|
+----------+------+------+-----+---+---+---+-----+
|       1.0|133792| 51547|  109|  0|  0|  0| 5223|
|       2.0| 57026|192260| 4888| 57|  0|  0|  750|
|       3.0|     0|  3368|28238|590|  0|  0|    0|
|       4.0|     0|     0| 1493|956|  0|  0|    0|
|       5.0|     0|  8282|  283|  0|  0|  0|    0|
|       6.0|     0|  3371|11872|406|  0|  0|    0|
|       7.0|  8122|    74|    0|  0|  0|  0|10319|
+----------+------+------+-----+---+---+---+-----+
1

Replace null with 0.

Spreadsheet users may have recognized the problem as just like that of computing a pivot table. A pivot table groups values by two dimensions, whose values become rows and columns of the output, and computes some aggregation within those groupings, like a count here. This is also available as a PIVOT function in several databases and is supported by Spark SQL. It’s arguably more elegant and powerful to compute it this way.

Although 70% accuracy sounds decent, it’s not immediately clear whether it is outstanding or poor. How well would a simplistic approach do to establish a baseline? Just as a broken clock is correct twice a day, randomly guessing a classification for each example would also occasionally produce the correct answer.

We could construct such a random “classifier” by picking a class at random in proportion to its prevalence in the training set. For example, if 30% of the training set were cover type 1, then the random classifier would guess “1” 30% of the time. Each classification would be correct in proportion to its prevalence in the test set. If 40% of the test set were cover type 1, then guessing “1” would be correct 40% of the time. Cover type 1 would then be guessed correctly 30% x 40% = 12% of the time and contribute 12% to overall accuracy. Therefore, we can evaluate the accuracy by summing these products of probabilities:

from pyspark.sql import DataFrame

def class_probabilities(data):
    total = data.count()
    return data.groupBy("Cover_Type").count().\ 1
    orderBy("Cover_Type").\ 2
    select(col("count").cast(DoubleType())).\
    withColumn("count_proportion", col("count")/total).\
    select("count_proportion").collect()


train_prior_probabilities = class_probabilities(train_data)
test_prior_probabilities = class_probabilities(test_data)

train_prior_probabilities
...

[Row(count_proportion=0.36455357859838705),
 Row(count_proportion=0.4875111371136425),
 Row(count_proportion=0.06155716924206445),
 Row(count_proportion=0.00468236760696409),
 Row(count_proportion=0.016375858943914835),
 Row(count_proportion=0.029920118693908142),
 Row(count_proportion=0.03539976980111887)]

...

train_prior_probabilities = [p[0] for p in train_prior_probabilities]
test_prior_probabilities = [p[0] for p in test_prior_probabilities]

sum([train_p * cv_p for train_p, cv_p in zip(train_prior_probabilities,
                                              test_prior_probabilities)]) 3
...

0.37735294664034547
1

Count by category

2

Order counts by category

3

Sum products of pairs in training and test sets

Random guessing achieves 37% accuracy then, which makes 70% seem like a good result after all. But the latter result was achieved with default hyperparameters. We can do even better by exploring what the hyperparameters actually mean for the tree-building process. That is what we will do in the next section.

Decision Tree Hyperparameters

In Chapter 3, the ALS algorithm exposed several hyperparameters whose values we had to choose by building models with various combinations of values and then assessing the quality of each result using some metric. The process is the same here, although the metric is now multiclass accuracy instead of AUC. The hyperparameters controlling how the tree’s decisions are chosen will be quite different as well: maximum depth, maximum bins, impurity measure, and minimum information gain.

Maximum depth simply limits the number of levels in the decision tree. It is the maximum number of chained decisions that the classifier will make to classify an example. It is useful to limit this to avoid overfitting the training data, as illustrated previously in the pet store example.

The decision tree algorithm is responsible for coming up with potential decision rules to try at each level, like the weight >= 100 or weight >= 500 decisions in the pet store example. Decisions are always of the same form: for numeric features, decisions are of the form feature >= value; and for categorical features, they are of the form feature in (value1, value2, …). So, the set of decision rules to try is really a set of values to plug in to the decision rule. These are referred to as bins in the PySpark MLlib implementation. A larger number of bins requires more processing time but might lead to finding a more optimal decision rule.

What makes a decision rule good? Intuitively, a good rule would meaningfully distinguish examples by target category value. For example, a rule that divides the Covtype dataset into examples with only categories 1–3 on the one hand and 4–7 on the other would be excellent because it clearly separates some categories from others. A rule that resulted in about the same mix of all categories as are found in the whole dataset doesn’t seem helpful. Following either branch of such a decision leads to about the same distribution of possible target values and so doesn’t really make progress toward a confident classification.

Put another way, good rules divide the training data’s target values into relatively homogeneous, or “pure,” subsets. Picking a best rule means minimizing the impurity of the two subsets it induces. There are two commonly used measures of impurity: Gini impurity and entropy.

Gini impurity is directly related to the accuracy of the random guess classifier. Within a subset, it is the probability that a randomly chosen classification of a randomly chosen example (both according to the distribution of classes in the subset) is incorrect. To calculate this value, we first multiply each class with its respective proportion among all classes. Then we subtract the sum of all the values from 1. If a subset has N classes and pi is the proportion of examples of class i, then its Gini impurity is given in the Gini impurity equation:

I G ( p ) = 1 - i=1 N p i 2

If the subset contains only one class, this value is 0 because it is completely “pure.” When there are N classes in the subset, this value is larger than 0 and is largest when the classes occur the same number of times—maximally impure.

Entropy is another measure of impurity, borrowed from information theory. Its nature is more difficult to explain, but it captures how much uncertainty the collection of target values in the subset implies about predictions for data that falls in that subset. A subset containing one class suggests that the outcome for the subset is completely certain and has 0 entropy—no uncertainty. A subset containing one of each possible class, on the other hand, suggests a lot of uncertainty about predictions for that subset because data has been observed with all kinds of target values. This has high entropy. Hence, low entropy, like low Gini impurity, is a good thing. Entropy is defined by the entropy equation:

I E ( p ) = i=1 N p i log ( 1 p i ) = - i=1 N p i log ( p i )

Interestingly, uncertainty has units. Because the logarithm is the natural log (base e), the units are nats, the base e counterpart to more familiar bits (which we can obtain by using log base 2 instead). It really is measuring information, so it’s also common to talk about the information gain of a decision rule when using entropy with decision trees.

One or the other measure may be a better metric for picking decision rules in a given dataset. They are, in a way, similar. Both involve a weighted average: a sum over values weighted by pi. The default in PySpark’s implementation is Gini impurity.

Finally, minimum information gain is a hyperparameter that imposes a minimum information gain, or decrease in impurity, for candidate decision rules. Rules that do not improve the subsets’ impurity enough are rejected. Like a lower maximum depth, this can help the model resist overfitting because decisions that barely help divide the training input may in fact not helpfully divide future data at all.

Now that we understand the relevant hyperparameters of a decision tree algorithm, we will tune our model in the next section to improve its performance.

Tuning Decision Trees

It’s not obvious from looking at the data which impurity measure leads to better accuracy or what maximum depth or number of bins is enough without being excessive. Fortunately, as in Chapter 3, it’s simple to let PySpark try a number of combinations of these values and report the results.

First, it’s necessary to set up a pipeline encapsulating the two steps we performed in previous sections—creating a feature vector and using it to create a decision tree model. Creating the VectorAssembler and DecisionTreeClassifier and chaining these two Transformers together results in a single Pipeline object that represents these two operations together as one operation:

from pyspark.ml import Pipeline

assembler = VectorAssembler(inputCols=input_cols, outputCol="featureVector")
classifier = DecisionTreeClassifier(seed=1234, labelCol="Cover_Type",
                                    featuresCol="featureVector",
                                    predictionCol="prediction")

pipeline = Pipeline(stages=[assembler, classifier])

Naturally, pipelines can be much longer and more complex. This is about as simple as it gets. Now we can also define the combinations of hyperparameters that should be tested using the PySpark ML API’s built-in support, ParamGridBuilder. It’s also time to define the evaluation metric that will be used to pick the “best” hyperparameters, and that is again MulticlassClassificationEvaluator:

from pyspark.ml.tuning import ParamGridBuilder

paramGrid = ParamGridBuilder(). \
  addGrid(classifier.impurity, ["gini", "entropy"]). \
  addGrid(classifier.maxDepth, [1, 20]). \
  addGrid(classifier.maxBins, [40, 300]). \
  addGrid(classifier.minInfoGain, [0.0, 0.05]). \
  build()

multiclassEval = MulticlassClassificationEvaluator(). \
  setLabelCol("Cover_Type"). \
  setPredictionCol("prediction"). \
  setMetricName("accuracy")

This means that a model will be built and evaluated for two values of four hyperparameters. That’s 16 models. They’ll be evaluated by multiclass accuracy. Finally, TrainValidationSplit brings these components together—the pipeline that makes models, model evaluation metrics, and hyperparameters to try—and can run the evaluation on the training data. It’s worth noting that CrossValidator could be used here as well to perform full k-fold cross-validation, but it is k times more expensive and doesn’t add as much value in the presence of big data. So, TrainValidationSplit is used here:

from pyspark.ml.tuning import TrainValidationSplit

validator = TrainValidationSplit(seed=1234,
  estimator=pipeline,
  evaluator=multiclassEval,
  estimatorParamMaps=paramGrid,
  trainRatio=0.9)

validator_model = validator.fit(train_data)

This will take several minutes or more, depending on your hardware, because it’s building and evaluating many models. Note the train ratio parameter is set to 0.9. This means that the training data is actually further subdivided by TrainValidationSplit into 90%/10% subsets. The former is used for training each model. The remaining 10% of the input is held out as a cross-validation set to evaluate the model. If it’s already holding out some data for evaluation, then why did we hold out 10% of the original data as a test set?

If the purpose of the CV set was to evaluate parameters that fit to the training set, then the purpose of the test set is to evaluate hyperparameters that were “fit” to the CV set. That is, the test set ensures an unbiased estimate of the accuracy of the final, chosen model and its hyperparameters.

Say that the best model chosen by this process exhibits 90% accuracy on the CV set. It seems reasonable to expect it will exhibit 90% accuracy on future data. However, there’s an element of randomness in how these models are built. By chance, this model and evaluation could have turned out unusually well. The top model and evaluation result could have benefited from a bit of luck, so its accuracy estimate is likely to be slightly optimistic. Put another way, hyperparameters can overfit too.

To really assess how well this best model is likely to perform on future examples, we need to evaluate it on examples that were not used to train it. But we also need to avoid examples in the CV set that were used to evaluate it. That is why a third subset, the test set, was held out.

The result of the validator contains the best model it found. This itself is a representation of the best overall pipeline it found, because we provided an instance of a pipeline to run. To query the parameters chosen by DecisionTreeClassifier, it’s necessary to manually extract DecisionTreeClassificationModel from the resulting PipelineModel, which is the final stage in the pipeline:

from pprint import pprint

best_model = validator_model.bestModel
pprint(best_model.stages[1].extractParamMap())

...
{Param(...name='predictionCol', doc='prediction column name.'): 'prediction',
 Param(...name='probabilityCol', doc='...'): 'probability',
 [...]
 Param(...name='impurity', doc='...'): 'entropy',
 Param(...name='maxDepth', doc='...'): 20,
 Param(...name='minInfoGain', doc='...'): 0.0,
 [...]
 Param(...name='featuresCol', doc='features column name.'): 'featureVector',
 Param(...name='maxBins', doc='...'): 40,
 [...]
 Param(...name='labelCol', doc='label column name.'): 'Cover_Type'}
 ...
}

This output contains a lot of information about the fitted model, but it also tells us that entropy apparently worked best as the impurity measure and that a max depth of 20 was not surprisingly better than 1. It might be surprising that the best model was fit with just 40 bins, but this is probably a sign that 40 was “plenty” rather than “better” than 300. Lastly, no minimum information gain was better than a small minimum, which could imply that the model is more prone to underfit than overfit.

You may wonder if it is possible to see the accuracy that each of the models achieved for each combination of hyperparameters. The hyperparameters and the evaluations are exposed by getEstimatorParamMaps and validationMetrics, respectively. They can be combined to display all of the parameter combinations sorted by metric value:

validator_model = validator.fit(train_data)

metrics = validator_model.validationMetrics
params = validator_model.getEstimatorParamMaps()
metrics_and_params = list(zip(metrics, params))

metrics_and_params.sort(key=lambda x: x[0], reverse=True)
metrics_and_params

...
[(0.9130409881445563,
  {Param(...name='minInfoGain' ...): 0.0,
   Param(...name='maxDepth'...): 20,
   Param(...name='maxBins' ...): 40,
   Param(...name='impurity'...): 'entropy'}),
 (0.9112655352131498,
  {Param(...name='minInfoGain',...): 0.0,
   Param(...name='maxDepth' ...): 20,
   Param(...name='maxBins'...): 300,
   Param(...name='impurity'...: 'entropy'}),
...

What was the accuracy that this model achieved on the CV set? And, finally, what accuracy does the model achieve on the test set?

metrics.sort(reverse=True)
print(metrics[0])
...

0.9130409881445563
...

multiclassEval.evaluate(best_model.transform(test_data)) 1

...
0.9138921373048084
1

best_Model is a complete pipeline.

The results are both about 91%. It happens that the estimate from the CV set was pretty fine to begin with. In fact, it is not usual for the test set to show a very different result.

This is an interesting point at which to revisit the issue of overfitting. As discussed previously, it’s possible to build a decision tree so deep and elaborate that it fits the given training examples very well or perfectly but fails to generalize to other examples because it has fit the idiosyncrasies and noise of the training data too closely. This is a problem common to most machine learning algorithms, not just decision trees.

When a decision tree has overfit, it will exhibit high accuracy when run on the same training data that it fit the model to, but low accuracy on other examples. Here, the final model’s accuracy was about 91% on other, new examples. Accuracy can just as easily be evaluated over the same data that the model was trained on, trainData. This gives an accuracy of about 95%. The difference is not large but suggests that the decision tree has overfit the training data to some extent. A lower maximum depth might be a better choice.

So far, we’ve implicitly treated all input features, including categoricals, as if they’re numeric. Can we improve our model’s performance further by treating categorical features as exactly that? We will explore this next.

Categorical Features Revisited

The categorical features in our dataset are one-hot encoded as several binary 0/1 values. Treating these individual features as numeric turns out to be fine, because any decision rule on the “numeric” features will choose thresholds between 0 and 1, and all are equivalent since all values are 0 or 1.

Of course, this encoding forces the decision tree algorithm to consider the values of the underlying categorical features individually. Because features like soil type are broken down into many features and because decision trees treat features individually, it is harder to relate information about related soil types.

For example, nine different soil types are actually part of the Leighton family, and they may be related in ways that the decision tree can exploit. If soil type were encoded as a single categorical feature with 40 soil values, then the tree could express rules like “if the soil type is one of the nine Leighton family types” directly. However, when encoded as 40 features, the tree would have to learn a sequence of nine decisions on soil type to do the same, this expressiveness may lead to better decisions and more efficient trees.

However, having 40 numeric features represent one 40-valued categorical feature increases memory usage and slows things down.

What about undoing the one-hot encoding? This would replace, for example, the four columns encoding wilderness type with one column that encodes the wilderness type as a number between 0 and 3, like Cover_Type:

def unencode_one_hot(data):
    wilderness_cols = ['Wilderness_Area_' + str(i) for i in range(4)]
    wilderness_assembler = VectorAssembler().\
                            setInputCols(wilderness_cols).\
                            setOutputCol("wilderness")

    unhot_udf = udf(lambda v: v.toArray().tolist().index(1)) 1

    with_wilderness = wilderness_assembler.transform(data).\
      drop(*wilderness_cols).\ 2
      withColumn("wilderness", unhot_udf(col("wilderness")))

    soil_cols = ['Soil_Type_' + str(i) for i in range(40)]
    soil_assembler = VectorAssembler().\
                      setInputCols(soil_cols).\
                      setOutputCol("soil")
    with_soil = soil_assembler.\
                transform(with_wilderness).\
                drop(*soil_cols).\
                withColumn("soil", unhot_udf(col("soil")))

    return with_soil
1

Note UDF definition

2

Drop one-hot columns; no longer needed

Here VectorAssembler is deployed to combine the 4 and 40 wilderness and soil type columns into two Vector columns. The values in these Vectors are all 0, except for one location that has a 1. There’s no simple DataFrame function for this, so we have to define our own UDF that can be used to operate on columns. This turns these two new columns into numbers of just the type we need.

We can now transform our dataset by removing one-hot encoding using our function defined above:

unenc_train_data = unencode_one_hot(train_data)
unenc_train_data.printSchema()
...
root
 |-- Elevation: integer (nullable = true)
 |-- Aspect: integer (nullable = true)
 |-- Slope: integer (nullable = true)
 |-- Horizontal_Distance_To_Hydrology: integer (nullable = true)
 |-- Vertical_Distance_To_Hydrology: integer (nullable = true)
 |-- Horizontal_Distance_To_Roadways: integer (nullable = true)
 |-- Hillshade_9am: integer (nullable = true)
 |-- Hillshade_Noon: integer (nullable = true)
 |-- Hillshade_3pm: integer (nullable = true)
 |-- Horizontal_Distance_To_Fire_Points: integer (nullable = true)
 |-- Cover_Type: double (nullable = true)
 |-- wilderness: string (nullable = true)
 |-- soil: string (nullable = true)
...

unenc_train_data.groupBy('wilderness').count().show()
...

+----------+------+
|wilderness| count|
+----------+------+
|         3| 33271|
|         0|234532|
|         1| 26917|
|         2|228144|
+----------+------+

From here, nearly the same process as above can be used to tune the hyperparameters of a decision tree model built on this data and to choose and evaluate a best model. There’s one important difference, however. The two new numeric columns have nothing about them that indicates they’re actually an encoding of categorical values. To treat them as numbers is not correct, as their ordering is meaningless. The model will still get built but because of some information in these features not being available, the accuracy may suffer.

Internally MLlib can store additional metadata about each column. The details of this data are generally hidden from the caller but include information such as whether the column encodes a categorical value and how many distinct values it takes on. To add this metadata, it’s necessary to put the data through VectorIndexer. Its job is to turn input into properly labeled categorical feature columns. Although we did much of the work already to turn the categorical features into 0-indexed values, VectorIndexer will take care of the metadata.

We need to add this stage to the Pipeline:

from pyspark.ml.feature import VectorIndexer

cols = unenc_train_data.columns
inputCols = [c for c in cols if c!='Cover_Type']

assembler = VectorAssembler().setInputCols(inputCols).setOutputCol("featureVector")

indexer = VectorIndexer().\
  setMaxCategories(40).\ 1
  setInputCol("featureVector").setOutputCol("indexedVector")

classifier = DecisionTreeClassifier().setLabelCol("Cover_Type").\
                                      setFeaturesCol("indexedVector").\
                                      setPredictionCol("prediction")

pipeline = Pipeline().setStages([assembler, indexer, classifier])
1

>= 40 because soil has 40 values

The approach assumes that the training set contains all possible values of each of the categorical features at least once. That is, it works correctly only if all 4 soil values and all 40 wilderness values appear in the training set so that all possible values get a mapping. Here, that happens to be true, but may not be for small training sets of data in which some labels appear very infrequently. In those cases, it could be necessary to manually create and add a VectorIndexerModel with the complete value mapping supplied manually.

Aside from that, the process is the same as before. You should find that it chose a similar best model but that accuracy on the test set is about 93%. By treating categorical features as actual categorical features in the previous sections, the classifier improved its accuracy by almost 2%.

We have trained and tuned a decision tree. Now, we will move on to random forests, a more powerful algorithm. As we will see in the next section, implementing them using PySpark will be surprisingly straightforward at this point.

Random Forests

If you have been following along with the code examples, you may have noticed that your results differ slightly from those presented in the code listings in the book. That is because there is an element of randomness in building decision trees, and the randomness comes into play when you’re deciding what data to use and what decision rules to explore.

The algorithm does not consider every possible decision rule at every level. To do so would take an incredible amount of time. For a categorical feature over N values, there are 2N–2 possible decision rules (every subset except the empty set and entire set). For an even moderately large N, this would create billions of candidate decision rules.

Instead, decision trees use several heuristics to determine which few rules to actually consider. The process of picking rules also involves some randomness; only a few features picked at random are looked at each time, and only values from a random subset of the training data. This trades a bit of accuracy for a lot of speed, but it also means that the decision tree algorithm won’t build the same tree every time. This is a good thing.

It’s good for the same reason that the “wisdom of the crowds” usually beats individual predictions. To illustrate, take this quick quiz: how many black taxis operate in London?

Don’t peek at the answer; guess first.

I guessed 10,000, which is well off the correct answer of about 19,000. Because I guessed low, you’re a bit more likely to have guessed higher than I did, and so the average of our answers will tend to be more accurate. There’s that regression to the mean again. The average guess from an informal poll of 13 people in the office was indeed closer: 11,170.

A key to this effect is that the guesses were independent and didn’t influence one another. (You didn’t peek, did you?) The exercise would be useless if we had all agreed on and used the same methodology to make a guess, because the guesses would have been the same answer—the same potentially quite wrong answer. It would even have been different and worse if I’d merely influenced you by stating my guess up front.

It would be great to have not one tree but many trees, each producing reasonable but different and independent estimations of the right target value. Their collective average prediction should fall close to the true answer, more than any individual tree’s does. It’s the randomness in the process of building that helps create this independence. This is the key to random forests.

Randomness is injected by building many trees, each of which sees a different random subset of data—and even of features. This makes the forest as a whole less prone to overfitting. If a particular feature contains noisy data or is deceptively predictive only in the training set, then most trees will not consider this problem feature most of the time. Most trees will not fit the noise and will tend to “outvote” the trees that have fit the noise in the forest.

The prediction of a random forest is simply a weighted average of the trees’ predictions. For a categorical target, this can be a majority vote or the most probable value based on the average of probabilities produced by the trees. Random forests, like decision trees, also support regression, and the forest’s prediction in this case is the average of the number predicted by each tree.

While random forests are a more powerful and complex classification technique, the good news is that it’s virtually no different to use it in the pipeline that has been developed in this chapter. Simply drop in a RandomForestClassifier in place of DecisionTreeClassifier and proceed as before. There’s really no more code or API to understand in order to use it:

from pyspark.ml.classification import RandomForestClassifier

classifier = RandomForestClassifier(seed=1234, labelCol="Cover_Type",
                                    featuresCol="indexedVector",
                                    predictionCol="prediction")

Note that this classifier has another hyperparameter: the number of trees to build. Like the max bins hyperparameter, higher values should give better results up to a point. The cost, however, is that building many trees of course takes many times longer than building one.

The accuracy of the best random forest model produced from a similar tuning process is 95% off the bat—about 2% better already, although viewed another way, that’s a 28% reduction in the error rate over the best decision tree built previously, from 7% down to 5%. You may do better with further tuning.

Incidentally, at this point we have a more reliable picture of feature importance:

forest_model = best_model.stages[1]

feature_importance_list = list(zip(input_cols,
                                  forest_model.featureImportances.toArray()))
feature_importance_list.sort(key=lambda x: x[1], reverse=True)

pprint(feature_importance_list)
...
(0.28877055118903183,Elevation)
(0.17288279582959612,soil)
(0.12105056811661499,Horizontal_Distance_To_Roadways)
(0.1121550648692802,Horizontal_Distance_To_Fire_Points)
(0.08805270405239551,wilderness)
(0.04467393191338021,Vertical_Distance_To_Hydrology)
(0.04293099150373547,Horizontal_Distance_To_Hydrology)
(0.03149644050848614,Hillshade_Noon)
(0.028408483578137605,Hillshade_9am)
(0.027185325937200706,Aspect)
(0.027075578474331806,Hillshade_3pm)
(0.015317564027809389,Slope)

Random forests are appealing in the context of big data because trees are supposed to be built independently, and big data technologies like Spark and MapReduce inherently need data-parallel problems, where parts of the overall solution can be computed independently on parts of the data. The fact that trees can, and should, train on only a subset of features or input data makes it trivial to parallelize building the trees.

Making Predictions

Building a classifier, while an interesting and nuanced process, is not the end goal. The goal is to make predictions. This is the payoff, and it is comparatively quite easy.

The resulting “best model” is actually a whole pipeline of operations. It encapsulates how input is transformed for use with the model and includes the model itself, which can make predictions. It can operate on a dataframe of new input. The only difference from the data DataFrame we started with is that it lacks the Cover_Type column. When we’re making predictions—especially about the future, says Mr. Bohr—the output is of course not known.

To prove it, try dropping the Cover_Type from the test data input and obtaining a prediction:

unenc_test_data = unencode_one_hot(test_data)
bestModel.transform(unenc_test_data.drop("Cover_Type")).\
                    select("prediction").show()

...
+----------+
|prediction|
+----------+
|       6.0|
+----------+

The result should be 6.0, which corresponds to class 7 (the original feature was 1-indexed) in the original Covtype dataset. The predicted cover type for the land described in this example is Krummholz.

Where to Go from Here

This chapter introduced two related and important types of machine learning, classification and regression, along with some foundational concepts in building and tuning models: features, vectors, training, and cross-validation. It demonstrated how to predict a type of forest cover from things like location and soil type using the Covtype dataset, with decision trees and forests implemented in PySpark.

As with recommenders in Chapter 3, it could be useful to continue exploring the effect of hyperparameters on accuracy. Most decision tree hyperparameters trade time for accuracy: more bins and trees generally produce better accuracy but hit a point of diminishing returns.

The classifier here turned out to be very accurate. It’s unusual to achieve more than 95% accuracy. In general, you will achieve further improvements in accuracy by including more features or transforming existing features into a more predictive form. This is a common, repeated step in iteratively improving a classifier model. For example, for this dataset, the two features encoding horizontal and vertical distance-to-surface-water features could produce a third feature: straight-line distance-to-surface-water features. This might turn out to be more useful than either original feature. Or, if it were possible to collect more data, we might try adding new information like soil moisture to improve classification.

Of course, not all prediction problems in the real world are exactly like the Covtype dataset. For example, some problems require predicting a continuous numeric value, not a categorical value. Much of the same analysis and code applies to this type of regression problem; the RandomForestRegressor class will be of use in this case.

Furthermore, decision trees and forests are not the only classification or regression algorithms, and not the only ones implemented in PySpark. Each algorithm operates quite differently from decision trees and forests. However, many elements are the same: they plug into a Pipeline and operate on columns in a dataframe, and have hyperparameters that you must select using training, cross-validation, and test subsets of the input data. The same general principles, with these other algorithms, can also be deployed to model classification and regression problems.

These have been examples of supervised learning. What happens when some, or all, of the target values are unknown? The following chapter will explore what can be done in this situation.

Get Advanced Analytics with PySpark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.