
Practical Statistics for Data Scientists

Errata for Practical Statistics for Data Scientists, Second Edition

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date submitted
PDF Page Statistical Significance and p-Values
Code example

In this section, we used the function perm_fun() which was defined in "Resampling" Section. The original function calculates the mean difference between samples, but in the example presented here, we need to calculate the difference between proportions. To solve this issue, a proposed solution is to create a new function that only changes this specific line of code:

return x.loc[list(idx_B)].mean() - x.loc[list(idx_A)].mean()

with this:

return x.loc[list(idx_B)].sum()/ nB - x.loc[list(idx_A)].sum()/ nA

(The code is in Python)

Anonymous  May 02, 2023 
Printed Page AB testing

I had a question about A/B testing which I hope you can help with.

My situation is that I am building a propensity model to identify which of our company's customers are most likely to sign up to a service after being sent an email. The model produces a likelihood score between 0 and 1 for each member and ranks them from 1 to X, where X is the size of our customer base. In practice we would then select the top N from this list to email. So, if our customer base is 1 million members, we would then select say the 100,000 ranked most highly (i.e. that have the highest likelihood score) by the model.

I want to compare how good the model is at identifying likely signups compared to the current business rules. However, I do not know how to conduct an A/B test in this scenario. Or indeed whether an A/B test is the most appropriate test here.

I understand the usual principle of randomly splitting the population into two groups and applying different treatments to each, such as a webpage layout or a drug treatment. However, in my case, the thing we are testing the efficacy of is the selection method itself (the email we send to each group would be the same). If we were to randomly split the population into two groups, then there will likely be customers who the model has ranked very highly, but which are in the business-rules group. Which seems to me like it wouldn't be fairly testing the model, because we are not giving it the chance to prove itself - we are not letting it have all of its 'top picks'.

Do you have any advice on this?

Anonymous  Mar 19, 2024 
PDF Page 9
2nd paragraph

Hi I do like to know what´s the exactly theme used in RStudio at the book practical statistics for data science.

I search for the theme but I couldn't find the exact theme in Rstudio as appear in the book. Aswell as the topography. It may be excellent if include that in a footnote in the same page you give the GitHub link.

It's seems without importance but I trying to practice the code specially in R language and I feel that if I use the same theme it would be easier to find my own mistakes. Somehow more pedagogy to me.


PD: congratulations for the amazing book. I'm a native Spanish speaker and I didn't read an statistics related book with this degree of well writing and the explanation are accurate for those like me who are starting in this Data science universe.


Albar Ugalde   Oct 16, 2023 
Printed Page 13
python example on the top

There was no clear way to install the package required for wquantiles, I searched online and no one I could find had a solution. When using "pip install wquantiles" there is no way to just "import wquantiles." I had to find an alternative way to calculate the weighted median:

state.sort_values('Murder.Rate', inplace = True)
cumsum = state.Population.cumsum()
cutoff = state.Population.sum()/2
median = state['Murder.Rate'][cumsum >= cutoff].iloc[0]

and I get the same result of 4.4

Katie Jones  Apr 22, 2024 
Printed Page 53
Sample Mean Versus Population mean

The symbol used to represent the mean of population is missing.

Anonymous  Feb 07, 2022 
Printed Page 53
3rd paragraph

left out symbol for mean of population in …”whereas is used to represent the mean of a population.”

John Taylor  Jul 30, 2022 
Printed Page 90
Third code chunk. The perm_fun function definition.

The book tells us to specify n2 = 15 and n1 = 21, corresponding to group B and group A, respectively. The function defines indx_b as sample(1:n, n1). This does not make sense to me as it gives an index of 21 which is actually the number of observations for group A.

Should the function instead be like below?

perm_fun <- function(x, n1, n2) { # n1 = n in group A, n2 = n in group B
n <- n1 + n2
idx_b <- sample(1:n, n2)
idx_a <- setdiff(1:n, idx_b)
mean_diff <- mean(x[idx_b]) - mean(x[idx_a])

Anonymous  Apr 27, 2024 
Printed, PDF Page 99
4rd paragraph that start with "Page B has session times that are greater than those of page A by 35.67 seconds"

i think in "The question is whether this difference is within the range of what random chance might produce, i.e., is statistically significant", last statement should removed "i.e., is statistically significant"

Mohammed Kamal Alsyd  Jul 21, 2023 
Printed Page 137
3rd paragrapgh

Because the earlier draws presented in both the previous page and in the page in question (137) are suggesting that whatever value (ones) Box A takes, the Box B gets as many zeros as the remainder of the whole, which is 10,000.
The whole being 10,000 and the Box B gets the remainder of what Box A got might not have been a strict rule there, but if it was meant to be as such, then from a standpoint of consistency, I suppose in the 3rd paragraph of page 137, with the boosted up new value of 165 (1.65%) ones for Box A, the Box B should also be equal to the remainder of 10,000 which would be 9835 (not 9868).

Emir Bilim  Dec 24, 2021 
Printed Page 278
Inset on Ridge regression and the Lasso

The indices of X should be X_{p,i} (cf. p. 151), but as it currently stands we have X_i and X_p. Shouldn't `i` refer to the example index and `p` refer to the dimension index?

Amine Laghaout  Feb 15, 2023