Creating training and test sets

When a dataset is large enough, it's a good practice to split it into training and test sets, the former to be used for training the model and the latter to test its performances. In the following diagram, there's a schematic representation of this process:

Training/test set split process schema

There are two main rules in performing such an operation:

  • Both datasets must reflect the original distribution
  • The original dataset must be randomly shuffled before the split phase in order to avoid a correlation between consequent elements

With scikit-learn, this can be achieved by using the train_test_split() function: ...

Get Machine Learning Algorithms - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.