5.1 OVERVIEW

The ability to generate summaries and make general statements about the data, and relationships within the data, is at the heart of exploratory data analysis and data mining methods. In almost every situation we will be making general statements about entire populations, yet we will be using a subset or sample of observations. The distinction between a population and a sample is important:

  • Population: A precise definition of all possible outcomes, measurements or values for which inferences will be made about.
  • Sample: A portion of the population that is representative of the entire population.

Parameters are numbers that characterize a population, whereas statistics are numbers that summarize the data collected from a sample of the population. For example, a market researcher asks a portion or a sample of consumers of a particular product, about their preferences, and uses this information to make general statements about all consumers. The entire population, which is of interest, must be defined (i.e. all consumers of the product). Care must be taken in selecting the sample since it must be an unbiased, random sample from the entire population. Using this carefully selected sample, it is possible to make confident statements about the population in any exploratory data analysis or data mining project.

The use of statistical methods can play an important role including:

  • Summarizing the data: Statistics, not only provide us with methods for summarizing sample data ...

Get Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.