Errata

Errata for Data Science from Scratch

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted by	Date Submitted
Printed	Page 168 iris_data = [parse_iris_row(row) for row in reader]	iris_data = [parse_iris_row(row) for row in reader] ... should be... iris_data = [parse_iris_row(row) for row in reader if row] adding "if row" checks for empty rows, which exist in the iris dataset (at the end).	Karl Wilson	Apr 20, 2024
Printed	Page p. 107 both at top and bottom	I submitted a "numerical overflow" issue, and said I fixed it by unitizing the gradient average. Well...never mind. Much later I found that I had a typo in vector_mean. Fix that. All good. Sorry.	David Barton Cooke	Mar 05, 2024
Printed	Page p. 107 both at top and bottom	When running the code at the bottom of the page, trying to find the slope and intercept of the line, the code does as asked for [-14, 14] input range, but I get numeric overflow for [-15, 15] (and larger). I can fix that by unitizing grad in the linear gradient function (top of page), so that I have a direction, but not a magnitude. I'm not sure that's quite kosher.	Dave Cooke	Mar 01, 2024
Printed	Page 86 For Further Information second bullet point	Link to Introduction to Probability is broken. New link is ~prob/prob/prob.pdf	Jamie Mellway	Aug 19, 2023
Other Digital Version	Section 11 Code for normal_pdf function	While the code for the return statement is accurate, it is slightly difficult to follow because it is structured differently from the equation given above it. It could be made easier to comprehend by utilizing parentheses like math.exp((-(x-mu)2/2)/sigma2)/(SQRT_TWO_PI/sigma) or math.exp(-(x-mu)*2/(2sigma**2))/(SQRT_TWO_PI/sigma) This would make it easier to understand the order of operations.	Neelakantan	Nov 24, 2022
PDF	Page Chapter 7. Hypothesis and Inference - Example: Flipping a Coin the first paragraph before the next section; p-Values	"Imagine instead that our null hypothesis was that the coin is not biased toward heads, or that p ≤ 0. 5. In that case we want a one-sided test that rejects the null hypothesis when X is much larger than 500 but not when X is smaller than 500. So, a 5% significance test involves using normal_probability_below to find the cutoff below which 95% of the probability lies: ...." In the first line should not "our null hypothesis" be replaced with "our alternative hypothesis"? Because the null hypothesis is always the default one and the alternative hypothesis is the scenario we may change and test to evaluate its compliance with the real world.	Milad N Rahbar	Nov 18, 2022
Printed	Page 95 4th paragraph	For the first example of the A/B testing code ["tastes great" 200 clicks, "less bias" 180 clicks] the books says: "The probability of seeing such a large difference if the means were actually equal..." Isn't "large" a misleading quantifier in this case, as the difference is not significant?	Steffen	Jun 04, 2022
Printed	Page 144 1st	The standard deviation calculated is the sample standard deviation, not the population standard deviation. In this example, you never mention that the vectors used in the calculation are part of a sample and not an entire population. In the text, you also don't specify which you intend to calculate.	Mateusz Rakowski	May 22, 2022
PDF	Page 157 Last sentence between parahrEnd of 9th paragraph	I think that at this sentence "(Of course the model that performed best on the test set is going to perform well on the test set)", the second "test set" should be "training set". I think it has no sense with two test sets.	Anonymous	May 02, 2022
Printed	Page 89 3	In your normal_two_sided_bounds function, defining the tail_probability as (1-probability)/2 makes the upper_bound < lower_bound in your return result which then feeds into an incorrect answer on page 90 regarding the result of power = 1 - type_2_probability. To produce the correct answer, you should subtract the tail_probability from 1, and use this value instead of tail_probability inside the calls to normal_lower_bound and normal_upper_bound. The use of an assert statement would have been perfect to validate your answer on page 90 which would have caught the bug on page 89.	Mateusz Rakowski	May 01, 2022
Printed	Page 85 3	When using a line chart to show the normal approximation, you create the heights by taking the difference of two CDF calls. In the first CDF call, you add 0.5 and in the second CDF call you subtract 0.5. I suspect this is because you assume the integer value of x covers the range from x + 0.5 to x - 0.5 and you map this full probability to x. It would be ideal if you clarified the reason behind this. Thank you.	Mateusz Rakowski	Apr 30, 2022
PDF	Page 33 4, 5, 6	Personal opinion about optimizing part of the Data Science from Scratch 2nd edition book Hello Thank you very much and the author of Data Science from Scratch 2nd edition book for its very useful content. I have an opinion on optimizing a part of the Randomness section in Chapter 2 ( A Crash Course in Python ): For the Randomness section, you can use the Numpy library, because this library is used in Data Science, and in addition, it can make the subject of the image sent from the book (topic is choice 1 or more elements and with replacement or without replacement) easier and more understandable because its no need to use random.sample rather, we play with numpy.random.choice so I optimized the method. My code is: #importing module from numpy import random #for choosing 1 element my_best_friend = random.choice(["Alice", "Bob", "Charlie"]) #for choosing 3 element with replecement my_best_friend = random.choice(["Alice", "Bob", "Charlie"], size=2) #for choosing 3 element without replecement my_best_friend = random.choice(["Alice", "Bob", "Charlie"], size=2, replace=False)	RZM	Apr 04, 2022
Printed	Page 41 first 2 code samples	Both of the sample functions are returning sum(total) when they should return sum(xs).	Dylan Kaufman	Jan 25, 2022
Printed	Page 17 2nd and 3rd paragraphs	"source activate" should be replaced with "conda activate". (on 2 lines) "source deactivate" should be replaced with "conda deactivate"	Jonathan	Oct 31, 2021
ePub	Page 36 middle of page	the code sorted(num_friends_by_id, key=lambda(user_id, num_friends): num_friends, reverse = True) returns a syntax error: "invalid syntax"	Anonymous	Oct 04, 2021
Printed	Page 63 The whole chapter	I’m working through the examples in the statistics chapter in “Data Science from Scratch, 2nd edition”,by Joel Grus, and I am getting the following error: ModuleNotFoundError: No module named 'scratch' Where do I get the module “scratch”? I’ve tried updating Anaconda and that didn’t help.	Anonymous	May 10, 2021
Printed	Page 328 Penultimate para.	Link to file download should be: https://files.grouplens.org/datasets/movielens/ml-100k.zip not https://files.group-lens.org/datasets/movielens/ml-100k.zip	Michael Shearer	Mar 12, 2021
Printed	Page 305 Code block	The tags have changed and the page currently lists 137 companies. I found the following worked: companies = list({a.text for a in soup("a") if "company-name" in a.get("class", ())}) assert len(companies) == 137	Michael Shearer	Mar 07, 2021
Printed	Page 299 Penultimate code block	Should model description be ‘as a word_id’ versus ‘as a vector of word_ids’ in this particular example.	Michael Shearer	Mar 06, 2021
Printed	Page 287 3rd code block, else comment	‘If the total is 8 or more’ not 7. 7 is dealt with in <=7 case.	Michael Shearer	Mar 06, 2021
Printed	Page 282 1st code block	content = soup.find('div', 'post-radar-content') See github issue #77	Michael Shearer	Mar 06, 2021
Printed	Page 319 code block at bottom of page	the first line in the `page_rank` function says ``` # Compute how many people each person endorses outgoing_counts = Counter(target for source, target in endorsements) ``` but this actually counts the number of endorsements that each person receives (exactly like `endorsement_counts` earlier in this section). the correct counter for # of outgoing endorsements should be `Counter(source for source, target in endorsements)`.	Anji Z	Mar 05, 2021
Printed	Page 245 Linear code	In the comments below is it the o-th neuron or the o-th layer of neurons? # self.w[o] is the weights for the o-th neuron self.w = random_tensor(output_dim, input_dim, init=init) # self.b[o] is the bias term for the o-th neuron self.b = random_tensor(output_dim, init=init)	Michael Shearer	Feb 28, 2021
Other Digital Version	5. Statistics	For instance, if you don’t mind being angrily accused of https://www.nytimes.com/2014/06/30/technology/facebook-tinkers-with-users-emotions-in-news-feed-experiment-stirring-outcry.html?r=0[experimenting on your users], you could randomly choose a subset of your users and show them content from only a fraction of their friends. If this subset subsequently spent less time on the site, this would give you some confidence that having more friends _causes more time to be spent on the site. I think this is a typing mistake. However, it was something that I found and was curious about. It may be a little mistake. It was not a huge problem for me. P.S. I have the Oreilly account so I'm reading the book online.	Anonymous	Feb 18, 2021
Printed	Page 16 Middle of the page	The text says: "Whitespace is ignored inside parentheses and brackets...", which is true, but it is meant that line breaks are ignored inside parentheses.	Markus Gottwald	Feb 17, 2021
Printed	Page 105 Code Sample	Instead of using the specific sum_of_squares_gradient we could have used the generic estimate_gradient method as grad = estimate_gradient(sum_of_squares, v, 0.0001 )	Michael Shearer	Feb 07, 2021
Printed	Page 198 last paragraph	The beginning of this paragraph talks about testing the null hypothesis "beta_i = 0". However, the subsequent formula and example code all uses "beta_j" / "beta_hat_j". Is this difference in subscript letter deliberate? Do beta_i and beta_j actually mean slightly different things, or is this just a typo? Thanks very much!	Anji Z	Feb 04, 2021
Printed	Page 319, 320 code	for iter in tqdm.trange(num_iters): next_pr = {user.id : base_pr for user in users} # start with base_pr for source, target in endorsements: # Add damped fraction of source pr to target next_pr[target] += damping * pr[source] / outgoing_counts[source] pr = next_pr return pr ============================================================= The looping does not differ from iteration to iteration. The value of pr, as determined the first time through, never changes.	Gregory Sherman	Dec 08, 2020
Printed	Page 247 middle	On looking more closely at the github code, there is another similarly named variable. There is no problem with the text.	Gregory Sherman	Dec 08, 2020
ePub	Page 35 sort function at bottom of page	The lambda expression in this function should be ...key=lambda num_friends_by_id: num_friends_by_id[1] “num_friends_by_id.sort( # Sort the list key=lambda id_and_friends: id_and_friends[1], # by num_friends reverse=True)”	John Kilbourne	Dec 06, 2020
Printed	Page 247 assignment in middle	xor_net = Sequential([ Linear(input_dim=2, output_dim=2), Sigmoid(), Linear(input_dim=2, output_dim=1), Sigmoid() ]) In the github code, the assignment to "net" is the same except the last call to Sigmoid() is omitted. Which is the correct representation of a neural network for xor?	Gregory Sherman	Dec 04, 2020
Printed	Page 168 parse_iris_row()	The previously reported crash of the program can be avoided by deleting the blank lines at the end of the downloaded data file.	Gregory Sherman	Nov 29, 2020
Printed	Page 156 both code blocks	data = [n for n in range(1000)] xs = [x for x in range(1000)] are much faster as l = list(range(1000))	Gregory Sherman	Nov 28, 2020
Printed	Page 91 simulation code	import random extreme_value_count = 0 for _ in range(1000): num_heads = sum(1 if random.random() < 0.5 else 0 # Count # of heads for _ in range(1000)) # in 1000 flips, if num_heads >= 530 or num_heads <= 470: # and count how often extreme_value_count += 1 # the # is 'extreme' # p-value was 0.062 => ~62 extreme values out of 1000 assert 59 < extreme_value_count < 65, f"{extreme_value_count}" When run under Python 3.8.1, the assertion fails much more often than it succeeds	Gregory Sherman	Nov 25, 2020
Printed	Page 70 first 3 Python statements	The 3 statements can be written more succinctly as: num_friends_good, daily_minutes_good = num_friends[:], daily_minutes[:] num_friends_good.remove(100) daily_minutes_good.pop(num_friends.index(100))	Gregory Sherman	Nov 24, 2020
Printed	Page 63 list	This ambiguity makes Figure 5-1 and the code that produced it difficult to understand num_friends = [100, 49, 41, 40, 25, # ... and lots more ] There is no description of the list here, and no such list appears in Ch 1. In that small subset of the list, no number is repeated, but such repetition has to be assumed for the histogram and code to make sense - if I have the right meaning of the list.	Gregory Sherman	Nov 24, 2020
Printed	Page 35 last lines	To choose elements with replacement (i.e., allowing duplicates), you can just make multiple calls to random.choice: four_with_replacement = [random.choice(range(10)) for _ in range(4)] --------------------------------------------------------------- The code works, but isn't optimal - better as: four_with_replacement = [random.randrange(10) for _ in range(4)] An example using random.choice is [random.choice(list('abcdefghij')) for _ in range(4)]	Gregory Sherman	Nov 22, 2020
PDF	Page 68 2nd paragraph	Paragraph states that East coast data scientists skew more towards PhD types, but it makes nos sense with the exp[lanation and the table above shows the oposite.	Anonymous	Oct 09, 2020
Printed	Page 22 4th paragraph	"If you leave off the start of the slice, you'll slice from the beginning of the list, and if you leave of the end of the slice,..." It should be "leave off the end of the slice,..."	Bill Ward	Sep 22, 2020
Printed	Page 26 2.9" from top	# {"Joel": {"City": Seattle"}} needs another quotation mark in front of Seattle	ColinGT	Apr 16, 2020
Printed	Page 142 2nd line of the 3rd code snippet	The news.cnet.com site is not available. Is there an alternative site?	Ryoko	Mar 09, 2020
Printed	Page 96 beta_pdf definition code-block	The text defines the beta_pdf like so: def beta_pdf(x:float, alpha: float, beta: float) -> float: if x <= 0 or x >= 1: # no weight outside [0, 1] return 0 return (x ** (alpha - 1)) * ((1 - x) ** (beta - 1)) / __b(alpha, beta) The first condition line should instead use '<' and '>' rather than the or-equal-to counterparts: this properly excludes 0 and 1 per the in-line comment, and produces an incorrect beta pdf for beta_pdf(0, 1, 1) (which should be 1, but instead is 0), making the graph on the same page not-reproducible without modifying the pdf definition	Brendan King	Mar 06, 2020
Printed	Page 104 First paragraph, second to last sentence	Book states "...we can estimate derivatives by evaluating the difference quotient for a very small e". I believe the text should be "for a very small h".	Andrew Mathena	Jan 21, 2020
Printed	Page 7 penultimate paragraph	While the phrase "...they share interests in Java and big data" is correct, it would be more complete to also include Hadoop in this summary of shared interests.	Matt S	Jan 15, 2020
PDF	Page 202 code comment of _negative_log_partial_j function	This comment is not necessary. "Here i is the index of the data point."	Anonymous	Jan 10, 2020
Printed	Page 41 1st paragraph	It's the same as already listed for ePub, return sum(xs) instead of sum(total), just asking to note it's present in the print version too, and different page number, 41 vs 87. def total(xs: list) -> float: return sum(total) def total(xs: List[float]) -> float: return sum(total) For anyone looking for first edition, which I also purchased some years ago, errata is neither here, nor there, ie Grus github. Not that this correction would apply to py 2.7 pre Typing 1st Ed anyway.	Charles Shoopak	Dec 23, 2019
Printed	Page 182 next-to-last paragraph	This could be due to the SpamAssassin files on the site changing. When I run the program (as copied from github), I get different numbers for the "confusion_matrix" counter: text: "This gives 84 true positives ..., 25 false positives ..., 703 true negatives ... and 44 false negatives" My run results in 86 true positives, 40 false positives, 670 true negatives, and 29 false negatives	Gregory Sherman	Dec 21, 2019
Printed	Page 167 code at bottom of page	"with open ('iris.dat', 'w') as f:" does not match "iris.data" in requests.get() call above and in open() call on next page	Gregory Sherman	Dec 21, 2019
Printed	Page 168 parse_iris_row()	Upon running the code on iris.data (both downloaded from github), the program fails: Traceback (most recent call last): File "k_nearest_neighbors.py", line 150, in <module> if __name__ == "__main__": main() File "k_nearest_neighbors.py", line 75, in main iris_data = [parse_iris_row(row) for row in reader] File "k_nearest_neighbors.py", line 75, in <listcomp> iris_data = [parse_iris_row(row) for row in reader] File "k_nearest_neighbors.py", line 69, in parse_iris_row label = row[-1].split("-")[-1] IndexError: list index out of range	Gregory Sherman	Dec 21, 2019
Printed	Page 137 Figure 10-5	"if stock_price.clo" To be consistent with previous StockPrice assignment on page 136, it would be better as "if price.clo"	Gregory Sherman	Dec 16, 2019
Printed	Page 127 code and following paragraph	if len(tweets) >= 100: self.disconnect() . . . "... disconnects the streamer after it's collected 1000 tweets." Text at top of pg 128 also mentions "100 tweets", so "1000"seem to be a typo.	Gregory Sherman	Dec 16, 2019
Printed	Page 112 last sentence of note	"... use chmod x egrep.py++ to make the file executable" should be the following (or a+x, etc.) chmod u+x egrep.py	Gregory Sherman	Dec 13, 2019
Printed	Page 107 first line	squared_error = error * 2 The value "squared_error" is not returned from nor used in linear_gradient()	Gregory Sherman	Dec 13, 2019
Printed	Page 86 first sentence	"...if you want to know the probability that (say) a fair coin turns up more than 60 heads in 100 flips, you can estimate it as that the probability that a Normal(50, 5) ..." How it is known that the coin flipping follows (at least approximately) the normal distribution?	Gregory Sherman	Dec 12, 2019
Printed	Page 84 last paragraph	"A Binomial(n,p) random variable is simply the sum of n independent Bernoulli(p) variables ..." This is not clear without a definition of a Bernoulli variable	Gregory Sherman	Dec 12, 2019
Printed	Page 64 code at top of page	comment below should be "#height is just # of people" ys = [friend_counts[x] for x in xs] #height is just # of friends . . . plt.ylabel("# of people")	Gregory Sherman	Dec 12, 2019
Printed	Page 62 assertions	assert friend_matrix[0][2] == 1, "0 and 2 are friends" assert friend_matrix[0][8] == 0, "0 and 8 are not friends" The strings are incorrect, as 1 indicates friendship, so the AssertionError messages should be "are not" and "are", respectively.	Gregory Sherman	Dec 12, 2019
Printed	Page 50 plt.annotate() call	plt.annotate() call results in "ValueError: offset_points is not a recognized coordinate" There is no underscore in "textcoords='offset points'" Even if the assignment is removed, the same error occurs.	Gregory Sherman	Dec 12, 2019
Printed	Page 85 last sentence	"make_hist" should be "binomial_histogram"	Gregory Sherman	Dec 12, 2019
Printed	Page 84 paragraph at top	"The mean of a Bernoulli variable is p, and its standard deviation sqrt(p(1 - p)). The central limit theorem says that as n gets large, a Binomial(n,p) variable is approximately a normal random variable with mean mu = np and standard deviation sigma = sqrt(np(1 - p))." However at bottom of pg 92, with n presumbably 1000, the # of flips: "... the central limit theorem tells us that the average of those Bernoulli variables should be approximately normal, with mean p and standard deviation math.sqrt(p * (1 - p) / 1000)" pg 92, 93 assignments: "sigma = math.sqrt(p_hat * (1 - p_hat) / 1000)" ------------------------------------------------------------ I can't reconcile the text's initial description of using the central limit theorem to determine the mean & standard deviation with the later wording and assignment statements	Gregory Sherman	Dec 12, 2019