The first edition of Statistics in a Nutshell was a resounding success, but all books can be improved, and I'm grateful to have the opportunity to revise this one. My basic approach to the material hasn't changed: this is much more a book for people who want to think about and understand statistics than it is a book showing you how to use a particular computing package or delving into the mathematical theory behind statistics formulas. This book is also a little different from many titles in the O’Reilly Nutshell series—it’s really somewhere between a handbook for people who already know statistics and an introductory textbook for people learning statistics for the first time.
Despite the continued infiltration of statistics into many realms of life, one thing hasn't changed: telling people I work as a statistician is still the best way to derail a promising conversation at a party. For some reason, this seems to prompt people to tell me about how much they hated the required statistics class they needed for their college major or to prompt them to quote that old chestnut popularized by Mark Twain that there are three kinds of lies: lies, damned lies, and statistics.
Personally, I find statistics fascinating, and I love working in this field. I like teaching statistics as well, and I like to believe that I communicate this enthusiasm to others. It’s often an uphill battle, however; many people seem to believe that statistics is no more than a set of tricks and manipulations whose purpose is to twist reality to mislead other people. Others take the opposite view, believing that statistics is a collection of magical procedures that will do their thinking for them.
Before you jump into the technical details of learning and using statistics, step back for a minute and consider what can be meant by the word “statistics.” Don’t worry if you don’t understand all the vocabulary immediately; it will become clear over the course of reading this book.
When people speak of statistics, they usually mean one or more of the following:
Numerical data, such as the unemployment rate, the number of persons who die annually from bee stings, or the population of New York City in 2006 as compared to 1906.
Numbers used to describe samples of data as opposed to parameters (numbers used to describe populations). For instance, an advertising firm might be interested in the average age of people who subscribe to Sports Illustrated. To answer this question, it could draw a random sample of subscribers, calculate the mean of that sample (a statistic), and use that as an estimate of the mean of the entire population of subscribers (a parameter).
Particular procedures used to analyze data, and the results of those procedures, such as the t statistic or the chi-square statistic.
A field of study that develops and uses mathematical procedures to describe data and make decisions regarding it.
The type of statistics referred to in definition number 1 is not the primary concern of this book. If you simply want to find the latest figures on unemployment, health, or any of the myriad other topics on which governments and other organizations regularly release statistical data, your best bet is to consult a reference librarian or subject matter expert. If, however, you want to know how to interpret those figures (to understand why the mean is often misleading as a statement of average value, for instance, or the difference between crude and standardized mortality rates), Statistics in a Nutshell can definitely help you.
The concepts included in definition number 2 will be discussed in Chapter 3, which introduces inferential statistics, but these concepts also permeate the entire book. It is partly a question of vocabulary (statistics are numbers that describe samples, whereas parameters are numbers that describe populations) but underscores a fundamental point about the practice of statistics. The concept of using information gained from studying a sample to make statements about a population is the basis of inferential statistics, and inferential statistics is the primary focus of this book (as it is of most books about statistics).
Definition number 3 is also fundamental to most chapters of this book. The process of learning statistics is to some extent the process of learning particular statistical procedures, including how to calculate and interpret them, how to choose the appropriate statistic for a given situation, and so on. In fact, many new students of statistics subscribe primarily to this definition; learning statistics to them means learning to execute a set of statistical procedures. This is not so much an invalid approach to statistics as it is incomplete; learning to execute statistical procedures is a necessary part of the practice of statistics, but it is far from being the entire story. What’s more, since computer software has made it increasingly easy for anyone, regardless of mathematical background, to produce statistical analyses, the need to understand and interpret statistics has far outstripped the need to learn how to do the calculations themselves.
Definition number 4 is nearest to my heart because I chose statistics as my professional field. If you are a secondary or post-secondary student, you are probably aware of this definition of statistics because many universities and colleges today either have a separate department of statistics or include statistics as a field of specialization within the department of mathematics. Statistics is increasingly taught in high school as well, and in the United States, enrollment in advanced placement (AP) statistics classes is increasing rapidly.
Statistics is not only a specialist subject at the university level. Many university departments require students to take one or more statistics courses alongside subjects in their major. In addition, it’s worth knowing that many important techniques in modern statistics have been developed by people who learned and used statistics as part of their work in another field. Stephen Raudenbush, a pioneer in the development of hierarchical linear modeling, studied Policy Analysis and Evaluation Research at Harvard, and Edward Tufte, perhaps the world’s leading expert on statistical graphics, began his career as a political scientist: he wrote his PhD dissertation at Yale on the American Civil Rights movement.
Because the use of statistics in many professions and at all levels from management to line workers is increasing, acquiring a basic knowledge of statistics has become a necessity for many people who have been out of school for years. Such individuals are often ill served by textbooks aimed at introductory college courses, which are too specialized, too focused on calculation, and too expensive.
Finally, statistics cannot be left to the statisticians because it’s also a necessity to take part in modern civic life, in particular to understand much of what you read in the newspaper and hear on the television and radio. A working knowledge of statistics is the best check against the proliferation of misleading or outright false numerical claims (whether by politicians, advertisers, or social reformers), which seem to occupy an ever-increasing portion of our daily news diet. There’s a reason that Darryl Huff’s 1954 classic How to Lie with Statistics remains in print: statistics are easy to misuse, the common techniques of statistical distortion have been around for decades, and the best defense against those who would lie with statistics is to educate yourself so you can spot the lies and stop the liars in their tracks.
There are so many statistics books already on the market that you might well wonder why I feel the need to add another to the pile. The primary reason is that I haven’t found any statistics books that answer the needs I have addressed in Statistics in a Nutshell. In fact, if I may wax poetic for a moment, the situation is, to paraphrase the plight of Coleridge’s Ancient Mariner, “books, books, everywhere, nor any with which to learn.” The issues I have tried to address with this book are the following:
The need for a book that focuses on using and understanding statistics in a research or applications context, not as a discrete set of mathematical techniques but as part of the process of reasoning with numbers.
The need to integrate discussion of issues such as measurement and data management into an introductory statistics text.
The need for a statistics book that isn’t focused on a particular subject area. Elementary statistics is largely the same across subjects (a t-test is pretty much the same whether the data comes from medicine, finance, or criminal justice), so there’s no need for a proliferation of texts presenting the same information with a slightly different spin.
The need for an introductory statistics book that is compact, inexpensive, and easy for beginners to understand without being condescending or overly simplistic.
So who is the intended audience of Statistics in a Nutshell? I see three groups whose needs it particularly addresses:
Students taking introductory statistics classes in high schools, colleges, and universities
Adults who need to learn statistics as part of their current jobs or to be eligible for a promotion
People who are interested in learning about statistics out of intellectual curiosity
My focus throughout Statistics in a Nutshell is not on particular techniques, although many are taught within this work, but on statistical reasoning. You might say that the focus in this book is less on doing statistics and more on thinking statistically. What does that mean? Several things are necessary to be able in the process of thinking with numbers. More particularly, I focus on thinking about data and using statistics to aid in that process. Most chapters include some practice exercises, but these are meant to provide an opportunity to review the material presented and think about the important concepts covered in the chapter; they are not meant to be mindless calculation.
All the material in Statistics in a Nutshell has been revised, and most of the chapters beefed up with new examples and exercises. In particular, more examples working with proportions have been added, as have additional examples using real data sets, from sources such as the United Nations Human Development Project and the Behavioral Risk Factor Surveillance System; both data sets are available for free download from the Internet, so students can experiment with them as well as replicate the analyses in this book. One new chapter has been added to this edition: Chapter 19. I added this chapter because of my observation that, particularly for people learning statistics for vocational reasons, the ability to communicate statistical information is at least as important as the ability to perform statistical computations. Several new appendixes have also been added, mainly to make the book more self-sufficient and user-friendly. These include probability tables for the most common distributions, a bibliography of online sources of information, and a glossary and table of statistical notation.
It’s become fashionable to say that we’re living in the Age of Information, when so many facts are collected and disseminated that no one could possibly keep up with them. This is one cliché based in truth; as a society, we are drowning in data, and the problem seems likely to increase. There are both positive and negative sides to this circumstance. On the positive side, wide access to computing technology and electronic means of data storage and dissemination have made information easier to access, so researchers have less need to travel to a particular library or archive to peruse printed copies of records.
However, data has no meaning in and of itself. It has to be organized and interpreted by human beings before it becomes meaningful, so participating fully in the Information Age requires becoming fluent in understanding data, including the ways it is collected, analyzed, and interpreted. And because the same data can often be interpreted in many ways to support radically different conclusions, even people who don’t engage in statistical work themselves need to understand how statistics work and how to identify invalid claims and arguments based on the misuse of data.
Statistics in a Nutshell is organized in three parts: introductory material (Chapters 1–4) that lays the necessary foundation for the chapters that follow; inferential statistical techniques (Chapters 5–13), specialized techniques used in different professional fields (Chapters 14–16); and ancillary topics that are often part of the statistician’s job, even if they are not strictly statistical (Chapters 17–20).
Here’s a more detailed breakdown of the chapters:
- Chapter 1, Basic Concepts of Measurement
Discusses foundational issues for statistics, including levels of measurement, operationalization, proxy measurement, random and systematic error, reliability and validity, and types of bias.
- Chapter 2, Probability
Introduces the basics of probability, including trials, events, independence, mutual exclusivity, the addition and multiplication laws, combinations and permutations, conditional probability, and Bayes’ theorem.
- Chapter 3, Inferential Statistics
Introduces some basic concepts of inferential statistics, including probability distributions, independent and dependent variables, populations and samples, common types of sampling, the central limit theorem, hypothesis testing, Type I and Type II errors, confidence intervals and p-values, and data transformation.
- Chapter 4, Descriptive Statistics and Graphic Displays
Introduces common measures of central tendency and dispersion, including mean, median, mode, range, interquartile range, variance, and standard deviation, and discusses outliers. Some of the most commonly used graphical techniques for presenting statistical information are also covered in this chapter, including frequency tables, bar charts, pie charts, Pareto charts, stem and leaf plots, boxplots, histograms, scatterplots, and line graphs.
- Chapter 5, Categorical Data
Reviews the concepts of categorical and interval data and introduces the R×C table. Statistics covered in this chapter include the chi-squared tests for independence, equality of proportions, and goodness of fit, Fisher’s exact test, McNemar’s test, large-sample tests for proportions, and measures of association for categorical and ordinal data.
- Chapter 6, The t-Test
Discusses the t-distribution and the theory and use of the one-sample t-test, the two independent samples t-test, the repeated measures t-test, and the unequal variance t-test.
- Chapter 7, The Pearson Correlation Coefficient
Introduces the concept of association with graphics displaying different strengths of association between two variables and discusses the Pearson Correlation Coefficient and the Coefficient of Determination.
- Chapter 8, Introduction to Regression and ANOVA
Relates linear regression and ANOVA to the concept of the General Linear Model and discusses assumptions made when using these designs. Simple (bivariate) regression, one-way ANOVA, and post hoc testing are discussed and demonstrated.
- Chapter 9, Factorial ANOVA and ANCOVA
Discusses more-complex ANOVA designs, including two-way and three-way ANOVA and ANCOVA, and presents the topic of interaction.
- Chapter 10, Multiple Linear Regression
Extends the multiple regression model to include multiple predictors. Topics covered include relationships among predictor variables, standardized and unstandardized coefficients, dummy variables, methods of model building, and violations of assumptions of linear regression, including nonlinearity, autocorrelation, and heteroscedasticity.
- Chapter 11, Logistic, Multinomial, and Polynomial Regression
Expands the technique of regression to data with binary outcomes (logistic regression), categorical outcomes (multinomial regression), and nonlinear models (polynomial regression) and discusses the problem of overfitting a model.
- Chapter 12, Factor Analysis, Cluster Analysis, and Discriminant Function Analysis
Demonstrates three advanced statistical procedures, factor analysis, cluster analysis, and discriminant function analysis, and discusses the types of problems for which each technique might be useful.
- Chapter 13, Nonparametric Statistics
Discusses when to use nonparametric rather than parametric statistics and presents nonparametric statistics for between-subjects and within-subjects designs, including the Wilcoxon Rank Sum and Mann-Whitney U tests, the sign test, the median test, the Kruskal-Wallis H test, the Wilcoxon signed rank test, and the Friedman test.
- Chapter 14, Business and Quality Improvement Statistics
Demonstrates statistical procedures commonly used in business and quality improvement contexts. Analytical and statistical procedures covered include index numbers; time series; the minimax, maximax, and maximin decision criteria; decision making under risk; decision trees; and control charts.
- Chapter 15, Medical and Epidemiological Statistics
Introduces concepts and demonstrates statistical procedures particularly relevant to medicine and epidemiology. Concepts and statistics presented include the definition and use of ratios, proportions, and rates; measures of prevalence and incidence; crude and standardized rates; direct and indirect standardization; measures of risk; confounding; the simple and Mantel-Haenszel odds ratio; and precision, power, and sample-size calculations.
- Chapter 16, Educational and Psychological Statistics
Introduces concepts and statistical procedures commonly used in the fields of education and psychology. Subjects covered include percentiles; standardized scores; methods of test construction; classical test theory; the reliability of a composite test; measures of internal consistency, including coefficient alpha; and procedures for item analysis. An overview of item response theory is also provided.
- Chapter 17, Data Management
Discusses practical issues in data management, including codebooks, the unit of analysis, procedures to troubleshoot an existing file, methods for storing data electronically, string and numeric data, and missing data.
- Chapter 18, Research Design
Discusses observational and experimental studies, common elements of good research designs, the steps involved in data collection, types of validity, and methods to limit or eliminate the influence of bias.
- Chapter 19, Communicating with Statistics
Covers general issues about communicating statistical information to different audiences and then provides more detail about writing for a professional journal, for the general public, and for the workplace.
- Chapter 20, Critiquing Statistics Presented by Others
Offers guidelines for reviewing the use of statistics, including a checklist of questions to ask of any statistical presentation and examples of when legitimate statistical procedures may be manipulated to support questionable conclusions.
Six appendixes cover topics that are a necessary background to the material covered in the main text and provide references to supplemental reading:
- Appendix A
Provides a self-test and review of basic arithmetic and algebra for people whose memory of their last math course is fast receding on the distant horizon. Topics covered include the laws of arithmetic, exponents, roots and logs, methods to solve equations and systems of equations, fractions, factorials, permutations, and combinations.
- Appendix B
Provides an introduction to some of the most common computer programs used for statistical applications, demonstrates basic analyses in each program, and discusses their relative strengths and weaknesses. Programs covered include Minitab, SPSS, SAS, and R; the use of Microsoft Excel (not a statistical package) for statistical analysis is also discussed.
- Appendix C
An annotated bibliography organized by chapters that includes published works and websites cited in the text and others that are good starting points for people researching a particular topic.
- Appendix D
Includes tables for the most commonly used statistical distributions—normal, t, binomial, and chi-square—as well as directions for using the tables. Even in the age of the computer and the Internet, it’s worth knowing how to read a distribution table, and it’s convenient to have the tables available in printed form.
- Appendix E
A bibliography of some of the best sites on the Internet for people who are learning, using, or teaching statistics. This appendix is organized into general resources, glossaries, probability tables, online calculators, and online textbooks.
- Appendix F
Includes a table of the Greek alphabet (the bane of many a beginning statistician), a table of statistical notation, and a brief glossary of the major statistical terms used in this book.
This book is a tool that can be adapted according to the background and needs of individual readers. Some of the chapters cover subjects that are often skipped in introductory statistics books but that I think are important; these include data management, writing about statistics, and reading statistical articles written by others. These chapters also serve as useful references for people who suddenly find themselves placed in charge of managing the data for a project or who have been appointed, more or less out of the blue, to create a statistical presentation about their team’s work. Neither scenario, unfortunately, is particularly uncommon.
Classification of what is elementary and what is advanced depends on an individual’s background and purposes. I designed Statistics in a Nutshell to answer the needs of many types of users. For this reason, there’s no perfect way to organize the material to meet everyone’s needs, which brings us to an important point: there’s no reason you should feel the need to read the chapters in the order they are presented here. Statistics presents many chicken-and-egg dilemmas. For instance, you can’t design experiments without knowing what statistics are available to you, but you can’t understand how statistics are used without knowing something about research design. Similarly, it might seem logical that someone assigned to manage data should already have experience in statistical analysis, but I’ve advised many research assistants and project managers who are put in charge of large data sets before they’ve completed a single course in statistics. So use the chapters in the way that best facilitates your specific purposes, and don’t be shy about skipping around and focusing on whatever meets your particular needs.
Not all the material in this book will be relevant to everyone; this is most obviously the case with Chapters 14–16, which are written with particular subject areas in mind (business and quality improvement, medicine and epidemiology, and education and psychology, respectively). However, it’s wise to keep an open mind regarding what statistics you need to know. You might currently believe that you will never need to conduct a nonparametric test or a logistic regression analysis, but you never know what will come in handy in the future. It’s also a mistake to compartmentalize too much by subject field; because statistical techniques are ultimately about numbers rather than content, techniques developed in one field often prove to be useful in another. For instance, control charts (covered in Chapter 14) were developed in a manufacturing context but are now used in many fields from medicine to education, whereas the odds ratio (covered in Chapter 15) was developed in epidemiology but is now applied to all sorts of data.
The following typographical conventions are used in this book:
- Plaintext
Indicates menu titles, menu options, menu buttons, and keyboard accelerators (such as Alt and Ctrl).
- Italic
Indicates new terms, URLs, email addresses, filenames, file extensions, pathnames, directories, and Unix utilities.
Tip
This icon signifies a tip, suggestion, or general note.
Caution
This icon indicates a warning or caution.
This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Statistics in a Nutshell by Sarah Boslaugh (O’Reilly). Copyright 2013 Sarah Boslaugh, 978-1-449-31682-2.”
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com.
Note
Safari Books Online (www.safaribooksonline.com) is an on-demand digital library that delivers expert content in both book and video form from the world’s leading authors in technology and business.
Technology professionals, software developers, web designers, and business and creative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training.
Safari Books Online offers a range of product mixes and pricing programs for organizations, government agencies, and individuals. Subscribers have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and dozens more. For more information about Safari Books Online, please visit us online.
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc. |
1005 Gravenstein Highway North |
Sebastopol, CA 95472 |
800-998-9938 (in the United States or Canada) |
707-829-0515 (international or local) |
707-829-0104 (fax) |
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://oreil.ly/stats-nutshell.
To comment or ask technical questions about this book, send email to bookquestions@oreilly.com.
For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Only one author is listed on the cover of this book, but the contributions of many people played a role in its creation.
I would like to thank my agent, Neil Salkind, for his continued guidance and support; the crew at O’Reilly, including Mary Treseler, Sarah Schneider, and Meghan Blanchette; and all the statisticians who assisted in the technical review process. I would also like to thank my nonstatistician friends who kept pestering me to explain statistical concepts to them and thus encouraged me to write this book, and my colleagues at the Center for Sustainable Journalism at Kennesaw State University for their forbearance and tolerance while I have been working on this revision. On a personal note, I would like to thank my former colleague Rand Ross at Washington University in St. Louis for helping me remain sane throughout the writing process for the first edition and my husband Dan Peck for being the very model of a modern supportive spouse.
Get Statistics in a Nutshell, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.