Chapter 7. How to Build a Dataset

The dataset is the foundation of any edge AI project. With a great dataset, every task in the workflow becomes both easier and less risky—from selecting the right algorithm to understanding your hardware requirements and evaluating real-world performance.

Datasets are indisputably critical for machine learning projects, where data is used directly for training models. However, data is vital even if your edge AI application doesn’t require machine learning. Datasets are necessary in order to select effective signal processing techniques, design heuristic algorithms, and test applications under realistic conditions.

Collecting a dataset is typically the most difficult, time-consuming, and expensive part of any edge AI project. It’s also the most likely place you will make terrible, hard-to-detect mistakes that can doom your project to failure. This chapter is designed to introduce today’s best practices for building an edge AI dataset. It’s probably the most important section of this book.

What Does a Dataset Look Like?

Every dataset is made up of a bunch of individual items, known as records, each of which contains one or more pieces of information, known as features. Each feature may be a completely different data type: numbers, time series, images, and text are all common. This structure is shown in Figure 7-1.

A diagram showing a stack of records, each with features.
Figure 7-1. A dataset contains ...

Get AI at the Edge now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.