Train and test data

In practice, data usually will be split randomly 70-30 or 80-20 into train and test datasets respectively in statistical modeling, in which training data utilized for building the model and its effectiveness will be checked on test data:

In the following code, we split the original data into train and test data by 70 percent - 30 percent. An important point to consider here is that we set the seed values for random numbers in order to repeat the random sampling every time we create the same observations in training and testing data. Repeatability is very much needed in order to reproduce the results:

# Train & Test split ...

Get Statistics for Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.