Chapter 3. Finding or Building a Dataset

Many introductory exercises on machine learning or artificial intelligence (AI) provide datasets for use, as is done in this book. When faced with your own implementation, this will likely not be the case. Sourcing and transforming input data that can be modeled to solve a specific problem is a challenge in and of itself.

A common colloquialism in data science is GIGO, which stands for “garbage in, garbage out,” which refers to the fact that a model can be only as good as the data on which it was trained. This chapter covers available options for sourcing premade datasets or designing and building your own.

Be aware that every step in this process is highly subjective to your individual data and solution requirements, so it’s a little high-level in places. Many practices in data-wrangling are honed only with experience.

Planning and Identifying Data to Target

Some refer to the dataset planning phase as a design process, and this is no exaggeration. One of the easiest mistakes to make in the machine-learning process is not selecting the appropriate input data for the job—not necessarily data of poor quality or provenance, but even data that does not sufficiently correlate with the outcome the model is looking to predict.

Not all machine-learning problems are solved outside of the technical aspects. Although it might seem intuitive that to predict the weather will require records of historical temperature, humidity, and outlook data at least, ...

Get Practical Artificial Intelligence with Swift now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.