The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".
The following errata were submitted by our customers and approved as valid errors by the author or editor.
Version |
Location |
Description |
Submitted By |
Date submitted |
Date corrected |
Other Digital Version |
Example notebook (3)
F-Statistic section |
There are two function that are use. As far as I understand, they should return the same result. This is not the case with the code as it is writte.
model = smf.ols('
model = smf.ols('Time ~ Page', data=four_sessions).fit()
aov_table = sm.stats.anova_lm(model)
print(aov_table)
df sum_sq mean_sq F PR(>F)
Page 3.0 831.4 277.133333 2.739825 0.077586
Residual 16.0 1618.4 101.150000 NaN NaN
res = stats.f_oneway(four_sessions[four_sessions.Page == 'Page 1'].Time,
four_sessions[four_sessions.Page == 'Page 2'].Time,
four_sessions[four_sessions.Page == 'Page 3'].Time,
four_sessions[four_sessions.Page == 'Page 4'].Time)
print(f'F-Statistic: {res.statistic / 2:.4f}')
print(f'p-value: {res.pvalue / 2:.4f}')
F-Statistic: 1.3699
p-value: 0.0388
As we can see, the first F-statistic and p-value are two times bigger than the second ones. But there are no explanation at all to tell the reader why...
To get the same result, I had to pivot the data frame before the call to f_oneway:
four_sessions['index'] = four_sessions.reset_index().index // 4
p_sessions = four_sessions.pivot(index='index', columns='Page', values='Time')
r = stats.f_oneway(p_sessions['Page 1'], p_sessions['Page 2'], p_sessions['Page 3'], p_sessions['Page 4'])
print(r)
F_onewayResult(statistic=2.739825341901467, pvalue=0.0775862152580146)
Note from the Author or Editor: This only impacts the jupyter notebook.
The code with the error (division by two when printing the F-statistic and the p-value) is not included in the book. The mistake was due to copy/paste from the t_test example code.
The jupyter notebook contains the correct code now.
|
Fabrice Kinnar |
May 06, 2020 |
Jun 19, 2020 |
|
Page ch 1
text |
In Chapter 1, there is an external link that does not work: "step-by-step guide to creating a boxplot." at location 684. Please update with a valid external URL.
Note from the Author or Editor: The link:
https://oreil.ly/wTpnE
should be replace with:
https://web.archive.org/web/20190415201405/https://www.oswego.edu/~srp/stats/bp_con.htm
|
Anonymous |
Mar 17, 2022 |
|
|
Page Ch 2
text |
In Chapter 2 there is an external link that does not work "Fooled by Randomness Through Selection Bias" at location 1347. Please update a valid external URL.
Note from the Author or Editor: This is referring to link https://oreil.ly/v_Q0u
The correct link is now:
https://www.priceactionlab.com/Blog/2012/06/fooled-by-randomness-through-selection-bias/
|
Anonymous |
Mar 17, 2022 |
|
|
Page Page 37
The second last code snippet |
The R code snippet will not generate a figure similar to Figure 1-8. But the Python code snippet at the bottom of the same page will.
Note from the Author or Editor: This is a temporary issue that was introduced in version 3.4.0 of ggplot. The ggplot developers are aware of the problem and fixed it. An updated version has not been release yet.
https://github.com/tidyverse/ggplot2/pull/5045
https://github.com/tidyverse/ggplot2/issues/5037
|
Jiamin Wang |
Jan 05, 2023 |
|
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page Python code using the get_dummies function
NA |
The behavior of the get_dummies method has changed recently. Instead of creating integer columns containing 0 and 1, the function now creates boolean columns with True and False values. This causes statsmodels model building to fail with an exception.
To revert to the original behavior, add the keyword argument `dtype=int` to the get_dummies method calls.
|
Peter Gedeck |
May 24, 2023 |
|
|
Page Example: Web Stickiness
https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch03.html#:-:text=The%20question%20is%20whether,e.%2C%20is%20statistically%20significant. |
In the sub-section Example: Web Stickiness of Permutation Test of Resampling of Chapter 3, there is a conflict between two statements as below:
S1: Page B has times that are greater than those of page A by 35.67 seconds, on average. The question is whether this difference is within the range of what random chance might produce, i.e., is statistically significant.
--> The conclusion "i.e., is statistically significant" seems to be misleading when comparing to the following statement:
S2: This suggests that the observed difference in time between page A and page B is well within the range of chance variation and thus is not statistically significant.
In brief, I think it should be "i.e., is not statistically significant." in S1.
Note from the Author or Editor: p. 99, center paragraph, second sentence should read: "The question is whether this difference is within the range of what random chance might produce, i.e. is not statistically significant." [the "not" had been left out]
|
Anonymous |
May 27, 2023 |
|
Printed |
Page 4
Further Reading |
First bullet point in "Further Reading" repeated in second half of Second bullet point.
Delete first bullet point
Added as gitlab issue
|
Peter Gedeck |
Jun 06, 2020 |
Jun 19, 2020 |
PDF, ePub |
Page 4
First and second bullets in the "Further Reading" section. |
The link to the pandas documentation ( https://oreil.ly/UGX-4 ) results in a 404 error. The O'Reilly redirect appear to attempt to access https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dtypes .
Note from the Author or Editor: We need to change to redirect
https://oreil.ly/UGX-4
to redirect to:
https://pandas.pydata.org/docs/user_guide/basics.html#dtypes
Ideally, this can be done without changing the short URL
--redirect all set (O'Reilly errata team)
|
Matt Slaven |
Mar 29, 2021 |
Mar 30, 2021 |
|
Page 19
3d bullet point of Key Ideas |
Bullet point suggests that mean absolute deviation is robust which contradicts 2nd paragraph of page 16
Note from the Author or Editor: We change the 2nd and 3rd paragraph on page 16 to:
Neither the variance, the standard deviation, nor the mean absolute deviation is fully robust to outliers and extreme values
(see <<Median>> for a discussion of robust estimates for location).
The variance and standard deviation are especially sensitive to outliers since they are based on the squared deviations;
more robust is the _median absolute deviation from the median_ or MAD:
Gitlab code is updated 2021-01-04.
|
Anonymous |
Dec 29, 2021 |
|
Printed |
Page 27
1st paragraph |
Text currently states:
...flights by the cause of delay at Dallas/Fort Worth Airport since 2010.
should be:
...flights by the cause of delay at Dallas/Fort Worth Airport in 2010.
|
Peter Gedeck |
Sep 16, 2020 |
Oct 02, 2020 |
|
Page 44
Ordered Item 1 |
The writer's text statement ggplot has functions facet_wrap and facet_grid is unclear. It is unclear because the writer instructs the reader to use the function facet_grid in R but does not provide the R syntax. The Python facet_grid syntax is provided on page 45.
Note from the Author or Editor: The example uses facet_wrap as there is only one conditioning variable. The R function facet_wrap will, by default, set the number of rows and columns in such a way that the resulting grid is close to square. In the example, this leads to a 2x2 grid. If there are two conditioning variables, you would need to use facet_grid.
In general, we recommend to consult the package documentation. The package ggplot comes with comprehensive documentation at https://ggplot2.tidyverse.org/index.html.
I'm going to add a sentence to the manuscript to highlight the fact that facet_grid would be used for two conditioning variables.
|
Stephen Dawson |
Mar 10, 2022 |
|
Printed |
Page 66
end of page |
The mean of the sample of 20 datasets that was used to generate Figure 2-9 was $55,734.
Replace $62,231 with $55,734.
|
Peter Gedeck |
Jun 06, 2020 |
Jun 19, 2020 |
Printed |
Page 66
End of last paragraph |
This was already changed once to $55,836, but the actual value should be $55,734. I remember that I found this confusing too, so I suggest we add a clarification to this.
... for which the mean was $55,734. Note that this is the mean of the subset of 20 records and not the mean of the bootstrap analysis, $55,836.
Changed in repository
|
Peter Gedeck |
Sep 16, 2020 |
Oct 02, 2020 |
PDF |
Page 79
Second to last paragraph in key terms box |
In the key term box, under "Binomial distribution" the sentence reads as follows: "Distribution of number of successes in x trials."
However, I think it should read "n trials" for the sake of consistency with the first sentence following the key terms box, where it reads: "The binomial distribution is the frequency distribution of the number of successes (x) in a given number of trials (n) with specified probability (p) of success in each trial."
I find it confusing that the sentence after the the number of trials is abbreviated with n, while in the box it is abbreviated as x".
Best regards,
Michael
Note from the Author or Editor: Thank you for the feedback.
I checked other uses in the book and we consistently use _n_ trials. We will change this. (Done in Gitlab)
|
Michael Ustaszewski |
Nov 05, 2020 |
|
PDF |
Page 84
2nd line of code |
The mean of the random values generated using the rexp(n=100, rate=0.2) function in R is ~5, which makes sense given that the mean number of events per time period is 0.2. However, for the Python code given in the book as stats.expon.rvs(0.2, size=100) we have the mean of the random values generated ~1.2, where loc=0.2 is the starting location for the exponential distribution. To get the same range of random values as those obtained with R we need to use stats.expon.rvs(scale=5, size=100) instead.
Note from the Author or Editor: The errata is correct and requires a change in the book.
Suggested change:
The +scipy+ implementation in +Python+ specifies the exponential distribution using +scale+ instead of rate. With scale being the inverse of rate, the corresponding command in Python is:
.Python
[source,python]
----
stats.expon.rvs(scale=1/0.2, size=100)
stats.expon.rvs(scale=5, size=100)
----
|
Joao Correia |
Sep 05, 2020 |
Oct 02, 2020 |
PDF |
Page 98
4th and 5th paragraphs |
In Google Analytics the average session time does not measure the time spent on a given page (as stated in the book), the correct metric is average time on page. Furthermore, on the last paragraph we have "Also note that Google Analytics, which is how we measure session time, cannot measure session time for the last session a person visits." I think it would be more correct to say: Also note that Google Analytics, which is how we measure average time on page, cannot measure the time spent on the last page within a session. Finally, Google Analytics does indeed set the time spent on the last page in a session to zero, and a single-page session is also set to zero. Having said that, this is true only if there are no user interaction events triggered on that page, such as click events, scroll events, video events, etc.
Note from the Author or Editor: Thank you for the feedback. We change the text int the book to:
Also note that Google Analytics, which is how we measure average time on page, cannot measure the time spent
on the last page within a session. ((("Google Analytics")))
Google Analytics will set the time spent on the last page in a session to zero, unless the user interacts with the page, e.g. clicks or scrolls. This is also the case for single-page sessions. The data requires additional processing to take this into account.
|
Joao Correia |
Sep 06, 2020 |
Oct 02, 2020 |
|
Page 122
first paragraph |
For the grand average, sum of squares is the departure of the grand average from 0, squared, times 20 (the number of observations). The degrees of freedom for the grand average is 1, by definition.
-degree of freedom for grand average is 19 not 1. also i think the whole page need review since the code result don't match with the written text as example For the residuals, degrees of freedom is 20 (all observations can vary) while it is actually 16 not 20
Note from the Author or Editor: The last sentence in this paragraph, "The degrees of freedom for the grand average is 1, by definition." should be eliminated, without a replacement.
|
Mohammed Kamal Alsyd |
May 05, 2023 |
|
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 127
Python code end of page |
Issue reported on github repository:
The following code makes a variable call to the chi2 value calculated using the permutation test (chi2observed), vice the chi2 value computed using the scipy stats module (chisq).
chisq, pvalue, df, expected = stats.chi2_contingency(clicks)
print(f'Observed chi2: {chi2observed:.4f}')
print(f'p-value: {pvalue:.4f}')
I believe the first print line should be:
print(f'Observed chi2: {chisq:.4f}') since the purpose is to demonstrate using the chi2 module for statistical tests rather than the previous sections permutation test.
This is correct. Code and book text corrected.
|
Peter Gedeck |
Apr 09, 2021 |
|
Printed |
Page 170
Key terms box - first item |
Change definition of `Correlated variables` to
Variables that tend to move in the same direction - when one goes up so does the other, and vice-versa (with negative correlation, when one goes up the other does down). When the predictor variables are highly correlated, it is difficult to interpret the
individual coefficients.
|
Peter Gedeck |
Sep 16, 2020 |
Oct 02, 2020 |
Printed, PDF, ePub |
Page 175
2nd paragraph |
Regarding the paragraph
"Location and house size appear to have a strong interaction.
For a home in the lowest +ZipGroup+,
the slope is the same as the slope for the main effect +SqFtTotLiving+,
which is $118 per square foot (this is because _R_ uses _reference_ coding for factor variables; see 'Factor Variables in Regression').
For a home in the highest +ZipGroup+,
the slope is the sum of the main effect plus +SqFtTotLiving:ZipGroup5+,
or $115 + $227 = $342 per square foot.
In other words, adding a square foot in the most expensive zip code group boosts the predicted sale price by a factor of almost three, compared to the average boost from adding a square foot."
I am thinking about two things:
1.) The coefficient for +SqFtTotLiving+ is 1.148e+02, but it is stated that "the main effect +SqFtTotLiving+ [...] is $118 per square foot". I think it should be adjusted to $115 as mentioned in the subsequent sentece.
2.) Since R uses reference coding (and not deviation coding), I wonder whether the last sentence is correct. Is it really the "average boost from adding a square foot" you compare to with the total effect of the most expensive zip code group? I mean, if you don't include any interaction effect the coefficient of +SqFtTotLiving+ would be the "average boost" (as far as I think about it). But in the setting with an interaction effect and reference coding, I would have interpreted it as "compared to the average boost for the lowest zip code group". Or am I wrong and the average boost is the same as the main affect, which in turn is equal for the first ZipGroup?
Best regards
Note from the Author or Editor: Thank you for your feedback. This corresponds to page 175 second paragraph in the print edition.
1) $118 should be replaced with $115
2) We are going to change the end of the second paragraph for clarification to:
... to the average boost from adding a square foot in the lowest zip code group.
Gitlab is changed.
|
Marcus Fraaß |
Nov 10, 2020 |
|
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 191
Figure 4-12 |
I believe that Figure 4-12 on page 191 is in error because the code used to generate it (Chapter 4 - Regression and Prediction.R from the practical-statistics-for-data-scientists-master.zip file) appears to be in error.
The code states:
terms1 <- predict(lm_spline, type='terms')
partial_resid1 <- resid(lm_spline) + terms
but surely partial_resid1 should be:
partial_resid1 <- resid(lm_spline) + terms1
which would give rise to a slightly different plot?
Note from the Author or Editor: I can confirm the error in the R code. The R code is not printed in the book, but the image created is. As mentioned in the errata, the difference in the plot is only small.
I changed the code to create the correct plot.
New figure file images/psds_0412.png added to book repository. This file will need to be processed (cropping the whitespace) to replace the file psds2_0412.png.
|
Gabriel Simmonds |
Apr 25, 2021 |
|
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 200
predicted probabilities |
A reader reported different results for the predictions from the Naive Bayes model. The change was caused by the following. In version 4 of R, read.csv no longer converts string columns automatically into factors. The old behavior can be restored by setting stringsAsFactors=TRUE .
There is no change required in the book. The GitHub repository will be updated with the change.
|
Peter Gedeck |
Feb 27, 2021 |
|
Printed |
Page 213
7th (4th of paragraph "Interpreting the Coefficients and Odds Ratios") |
Regarding the paragraph
"An example will make this more explicit.
For the model fit in "Logistic Regression and the GLM" on page 210,
the regression coefficient for +purpose_small_business+ is 1.21526.
This means that a loan to a small business compared to a loan to pay off credit card debt reduces the odds of defaulting versus being paid off by exp(1.21526) ≈ 3.4.
Clearly, loans for the purpose of creating or expanding a small business are considerably riskier than other types of loans."
Suggested change:
This means that a loan to a small business compared to a loan to pay off credit card debt *increases* the odds of defaulting versus being paid off by exp(1.21526) ≈ 3.4.
Best regards
Note from the Author or Editor: The errata is correct. Gitlab document changed accordingly - PG
|
Marcus Fraaß |
Nov 17, 2020 |
|
|
Page 213
(3rd of paragraph "Interpreting the Coefficients and Odds Ratios") |
Why bother with an odds ratio rather than probabilities? We work with odds because
the coefficient βj in the logistic regression is the log of the odds ratio for Xj .
anyway we can't state that coefficient βj is the log of the odds ration for Xj, since that mean we take log of log
i think the correct statement will be "We work with odds because the coefficient βj in the logistic regression is the *change in* log of the odds ratio for Xj.
Note from the Author or Editor: In the cited location..
EXISTING
Why bother with an odds ratio rather than probabilities? We work with odds because the coefficient βj in the logistic regression is the log of the odds ratio for Xj .
CHANGE TO (end of second sentence)
Why bother with an odds ratio rather than probabilities? We work with odds because the coefficient βj in the logistic regression is CHANGE IN the log (Odds(Y=1)) associated with a change in Xj .
|
Mohammed Kamal Alsyd |
Jun 20, 2023 |
|
Printed |
Page 217
R code block at end of page |
On page 217 of the printed book (2nd edition), the R code at the end of the page reads:
terms <- predict(logistic_gam, type='terms')
partial_resid <- resid(logistic_model) + terms
df <- data.frame(payment_inc_ratio = loan_data[, 'payment_inc_ratio'],
terms = terms[, 's(payment_inc_ratio)'],
partial_resid = partial_resid[, 's(payment_inc_ratio)'])
I believe that partial_resid here should be:
partial_resid <- resid(logistic_gam) + terms
I'm not sure if the graph produced on page 218 (Figure 5-4) using this data needs correction or not, as the difference using logistic_model and logistic_gam is quite minor, and it is hard to tell comparing a screenshot and the printed page.
Note from the Author or Editor: The line needs to be changed in the asciidoc code. It is already corrected in the book's Github repository, however I overlooked changing the book text. That is now corrected too.
|
Gabriel Simmonds |
May 11, 2021 |
|
Printed |
Page 240
2nd paragraph |
Since the R code yields TRUE for the prediction knn_pred == 'paid off', the sentence
"The KNN prediction is for the loan to default."
seems to be wrong and "default" should be replaced with "be paid off".
Note from the Author or Editor: This is correct. The sentence should read:
The KNN prediction is for the loan to be paid off.
|
Marcus Fraaß |
Dec 06, 2020 |
|
Printed |
Page 257
Section 'Controlling tree complexity in _Python_' |
scikit-learn implements Tree-complexity pruning like in R
In version 0.22, scikit-learn implemented tree complexity pruning for decision trees.
https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html#sphx-glr-auto-examples-tree-plot-cost-complexity-pruning-py
https://scikit-learn.org/stable/modules/tree.html#minimal-cost-complexity-pruning
Replace with:
===== Controlling tree complexity in _Python_
In the +scikit-learn+’s decision tree implementation, the complexity parameter is called +cpp_alpha+. The default value is 0, which means that the tree is not pruned; increasing the value leads to smaller trees. You can use GridSearchCV to find an optimal value.
There are a number of other model parameters, that allow controlling the tree size. For example, we can vary +max_depth+ in the range 5 to 30 and +min_samples_split+ between 20 and 100. The +GridSearchCV+ method in +scikit-learn+ is a convenient way to combine the exhaustive search through all combinations with cross-validation. An optimal parameter set is then selected using the cross-validated model performance.
|
Peter Gedeck |
Sep 16, 2020 |
Oct 02, 2020 |
Printed |
Page 279
1st paragraph inside box |
The sentence starting with "The xgboost parameters..." is duplicated in the second paragraph.
Delete first paragraph.
|
Peter Gedeck |
Jun 06, 2020 |
Jun 19, 2020 |
Printed |
Page 302
last paragraph |
"Figure 7-7 shows the cumulative percent of variance explained for the default data for the number of clusters ranging from 2 to 15."
Just a few minor things here:
- "2 to 15" should be replaced by "2 to 14"
- "default data" should be replaced by "stock data"
- Due to harmonization, the python code on the following page might be adjusted, so that range(2, 15) is used instead of range(2, 14).
Note from the Author or Editor: All suggestions confirmed.
Book text changed.
|
Marcus Fraaß |
Dec 06, 2020 |
|
Printed |
Page 306
Python code middle |
Due to a change in one of the Python packages, the code causes an error. The following code is working:
fig, ax = plt.subplots(figsize=(5, 5))
dendrogram(Z, labels=list(df.index), color_threshold=0)
plt.xticks(rotation=90)
ax.set_ylabel('distance')
Book text changed
|
Peter Gedeck |
Dec 07, 2020 |
|
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 323
First code snippet |
The output of the first line of x is incorrect and should be major_purchase instead of car.
> x
dti payment_inc_ratio home_ purpose_
1 1.00 2.39320 RENT major_purchase
2 5.55 4.57170 OWN small_business
3 18.08 9.71600 RENT other
4 10.08 12.21520 RENT debt_consolidation
5 7.06 3.90888 RENT other
gitlab code corrected
|
Peter Gedeck |
Feb 22, 2021 |
|
|
Page 441
The Boosting Algorithm section, step 3 |
The equation for alpha_m is surely wrong as in my kindle app it is shown as
alpha_m = (log 1 - e_m)/e_m
This can't be right as it would simplify to -1
According to wikipedia section on adaboost example, I suppose the formula should be alpha_m = 1/2 * ln ((1 - e_m)/e_m)
Which would make more sense
Note from the Author or Editor: I can confirm the issue and needs to be corrected as suggested.
- Gitlab updated to latexmath:[$\alpha_m = \frac12 \log\frac{1 - e_m}{e_m}$]
|
Tapani Raunio |
Dec 06, 2021 |
|
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 2735/9783
5th paragraph |
In the F-Statistics section :
"For the residuals, degrees of freedom is 20 (all observations can vary), and SS is the sum of squared difference between the individual observations and the treatment means. Mean squares (MS) is the sum of squares divided by the degrees of freedom."
Bruce, Peter; Bruce, Andrew; Gedeck, Peter. Practical Statistics for Data Scientists (Emplacements du Kindle 2760-2762). O'Reilly Media. Édition du Kindle.
When you run the ANOVA on R or Python, you have 16 for df in Residuals, not 20 !!
Note from the Author or Editor: The text should read:
For the residuals, degrees of freedom is 16 (20 observations, 16 of which can vary after the the grand mean and the treatment means are set), and SS is the sum of squared difference between the individual observations and the treatment means. Mean squares (MS) is the sum of squares divided by the degrees of freedom.
|
Fabrice Kinnar |
May 06, 2020 |
Jun 19, 2020 |