Big data’s biggest secret: Hyperparameter tuning
The toughest part of machine learning with Spark isn't what you think it is.
Machine learning may seem magical to the uninitiated. Practitioners who apply machine learning to massive real-world data sets know there is indeed some magic involved—but not the good kind. A set of “magic numbers,” called hyperparameters, is often crucial to obtaining usable results, and a given machine learning algorithm may have several of them. Choosing the best hyperparameters for a given algorithm and a given set of data is one of the most challenging parts of machine learning, and yet it’s often glossed over while learning these techniques. Even courses I’ve produced myself are guilty of this.
Let’s imagine you’re tasked with a simple logistic regression on some big data set, using Apache Spark’s machine learning library MLlib. All you’re trying to do is classify some new data given the trends found in your training data. You’ll soon find that Spark’s LogisticRegression
class wants you to initialize it with several parameters: the ElasticNet mixing parameter, the regularization parameter, threshold values, the convergence tolerance, and more. What’s worse, you’ll find that the documentation on what these values mean and what to set them to are non-existent. There is no documentation because there is no right answer—the values that produce the best results will depend on your particular data.
Here’s Spark’s Logistic Regression sample code, illustrating hard-coded hyperparameters:
// Print the coefficients and intercept for logistic regression println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}") // We can also use the multinomial family for binary classification val mlr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8) .setFamily("multinomial") val mlrModel = mlr.fit(training) // Print the coefficients and intercepts for logistic regression with multinomial family println(s"Multinomial coefficients: ${mlrModel.coefficientMatrix}") println(s"Multinomial intercepts: ${mlrModel.interceptVector}") // $example off$
Cross-validation is one way to find a good set of hyperparameters. The idea is to randomly separate your data into training and testing data sets, where you train your model on your training set and then evaluate its performance in predicting classifications on the data you held out for testing—data that your model hasn’t seen before. Cross-validation splits your data into training and testing sets using multiple “folds” to produce a more robust evaluation of your model.
The trouble is, things blow up quickly when you’re dealing with a lot of potential hyperparameter values across a lot of cross-validation folds. Let’s say you just want to find the best out of 10 possible values for our logistic regression’s ElasticNet and regularization parameters, using 3-fold cross-validation. You would need to run cross-validation over (10×10) * 3, which equates to 300 different models, just to tune these two values!
Given the power of cloud computing or your employer’s own data center, we shouldn’t get too hung up on the sheer computing power needed to do this—but that computing time still isn’t free. We want to use software tools to make this process run reliably the first time, and to do it as efficiently as possible. That’s why you need something like Apache Spark running on a cluster to tune even a simple model like logistic regression on a data set of even moderate scale.
Fortunately, Spark’s MLlib contains a CrossValidator
tool that makes tuning hyperparameters a little less painful. The CrossValidator
can be used with any algorithm supported by MLlib. Here’s how it works: you pass in an Estimator
, which is the specific algorithm or even a Pipeline
of algorithms you’re trying to tune. You also give it an Evaluator
used to measure the results of each model on your testing sets; in our logistic regression model, we might choose a BinaryClassificationEvaluator
if we are dealing with just two classifications. Finally, we need to pass in the set of parameters and values we wish to try out. This is defined in something called a parameter grid. A grid is a good way to think about it—you can imagine a 10×10 grid defining every possible combination of 10 ElasticNet and 10 regularization parameters you want to try out, for example. If you were to add a third hyperparameter to tune, your grid would then extend into a cube—you can see how these combinations add up quickly. Fortunately, Spark provides a ParamGridBuilder
utility to make constructing these parameter grids easy.
I’ll refer you to Spark’s documentation for an example of using a CrossValidator
to tune a set of hyperparameters spread across an entire machine learning Pipeline
that consists of tokenizing, hashing, and applying logistic regression to some sample data. Our simpler example might look something like this, assuming your entire training data set exists in a DataFrame called “training”:
val lr = new LogisticRegression().setMaxIter(10) val paramGrid = new ParamGridBuilder() .addGrid(lr.regParam, Array(0.5, 0.3, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00001)) .addGrid(lr.elasticNet, Array(0.5, 0.3, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00001)) .build() val cv = new CrossValidator() .setEstimator(lr) .setEvaluator(new BinaryClassificationEvaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(3) val model = cv.fit(training)
At this point, you’ll have a model that uses the best choice of hyperparameters out of the ones you listed.
While this is easy to do, there are still some problems with this approach. It’s still very expensive; there’s really no avoiding the need to evaluate every possible combination of the hyperparameter values you want to try out. However, K-fold cross-validation adds insult to injury—we need to evaluate every possible combination K times, once for each cross-validation “fold.” We can choose to take a shortcut by using Spark’s TrainValidationSplit
instead of CrossValidator
. This evaluates only a single random train/test data split per combination. This can be a dangerous thing to do, but it can yield good results given a sufficiently massive training data set. If your training data is so massive that performing K-fold cross-validation on it is impractical, the mere fact that you’re looking for a shortcut may mean that a simple TrainValidationSplit
could be good enough.
Both the CrossValidator
and TrainValidationSplit
also suffer from being “black boxes.” When applying them to machine learning pipelines, you lose visibility into the actual accuracy measured for each combination of hyperparameters. Visualizing and understanding this can be helpful, as you may see that the trends over the range of the values you provided imply that the optimal values lie in between two values you specified, or perhaps even outside the range of values you provided. Given the minimal information available for how these parameters are used, that’s an easy mistake to make. Often, practitioners will code up their own cross-validation instead, in order to gain more visibility into the individual accuracy of each set of parameters.
A third problem is the lack of guidance on which hyperparameter values to test. Often, the documentation will only tell you a value falls between 0 and 1, or is simply greater than 0. Trying a range of exponential steps is often a decent approach, but why should you do the guessing? If you take the step of implementing your own cross-validation code, it’s not much more work to implement a Monte Carlo approach. A Monte Carlo search of parameter values is a really fancy way of saying “just try a bunch of different parameters at random within some given range.” It also adds an element of randomness to which model is chosen among models with comparable performance. The Salmon Run blog has a lot more depth and sample code for implementing Monte Carlo search on an MLlib pipeline. There is also some research into using techniques such as simulated annealing to converge on the best hyperparameter values, without having to guess them first.
Performing machine learning on massive data sets is a resource-intensive task as it is, but the problem of hyperparameter tuning can increase those resource requirements by an order of magnitude. Although Spark provides tools for making this easy from a software standpoint, optimizing this task is an area of active research. Open source projects such as the Fregata library are tackling the problem of tuning and executing regression models at massive scale as efficiently as possible.