Errata

Data Science from Scratch

Errata for Data Science from Scratch

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date Submitted
Printed Page 168
iris_data = [parse_iris_row(row) for row in reader]

iris_data = [parse_iris_row(row) for row in reader]
... should be...
iris_data = [parse_iris_row(row) for row in reader if row]

adding "if row" checks for empty rows, which exist in the iris dataset (at the end).

Karl Wilson  Apr 20, 2024 
Printed Page p. 107
both at top and bottom

I submitted a "numerical overflow" issue, and said I fixed it by unitizing the gradient average. Well...never mind. Much later I found that I had a typo in vector_mean. Fix that. All good. Sorry.

David Barton Cooke  Mar 05, 2024 
Printed Page p. 107
both at top and bottom

When running the code at the bottom of the page, trying to find the slope and intercept of the line, the code does as asked for [-14, 14] input range, but I get numeric overflow for [-15, 15] (and larger).

I can fix that by unitizing grad in the linear gradient function (top of page), so that I have a direction, but not a magnitude. I'm not sure that's quite kosher.

Dave Cooke  Mar 01, 2024 
Printed Page 86
For Further Information second bullet point

Link to Introduction to Probability is broken. New link is ~prob/prob/prob.pdf

Jamie Mellway  Aug 19, 2023 
Other Digital Version Section 11
Code for normal_pdf function

While the code for the return statement is accurate, it is slightly difficult to follow because it is structured differently from the equation given above it. It could be made easier to comprehend by utilizing parentheses like
math.exp((-(x-mu)**2/2)/sigma**2)/(SQRT_TWO_PI/sigma)
or
math.exp(-(x-mu)**2/(2*sigma**2))/(SQRT_TWO_PI/sigma)

This would make it easier to understand the order of operations.

Neelakantan  Nov 24, 2022 
PDF Page Chapter 7. Hypothesis and Inference - Example: Flipping a Coin
the first paragraph before the next section; p-Values

"Imagine instead that our null hypothesis was that the coin is not biased
toward heads, or that p ≤ 0. 5. In that case we want a one-sided test that
rejects the null hypothesis when X is much larger than 500 but not when X
is smaller than 500. So, a 5% significance test involves using
normal_probability_below to find the cutoff below which 95% of the
probability lies: ...."

In the first line should not "our null hypothesis" be replaced with "our alternative hypothesis"? Because the null hypothesis is always the default one and the alternative hypothesis is the scenario we may change and test to evaluate its compliance with the real world.

Milad N Rahbar  Nov 18, 2022 
Printed Page 95
4th paragraph

For the first example of the A/B testing code ["tastes great" 200 clicks, "less bias" 180 clicks] the books says: "The probability of seeing such a large difference if the means were actually equal..."
Isn't "large" a misleading quantifier in this case, as the difference is not significant?

Steffen  Jun 04, 2022 
Printed Page 144
1st

The standard deviation calculated is the sample standard deviation, not the population standard deviation. In this example, you never mention that the vectors used in the calculation are part of a sample and not an entire population. In the text, you also don't specify which you intend to calculate.

Mateusz Rakowski  May 22, 2022 
PDF Page 157
Last sentence between parahrEnd of 9th paragraph

I think that at this sentence "(Of course the model that performed best on the test set is going to perform well on the test set)", the second "test set" should be "training set". I think it has no sense with two test sets.

Anonymous  May 02, 2022 
Printed Page 89
3

In your normal_two_sided_bounds function, defining the tail_probability as (1-probability)/2 makes the upper_bound < lower_bound in your return result which then feeds into an incorrect answer on page 90 regarding the result of power = 1 - type_2_probability. To produce the correct answer, you should subtract the tail_probability from 1, and use this value instead of tail_probability inside the calls to normal_lower_bound and normal_upper_bound. The use of an assert statement would have been perfect to validate your answer on page 90 which would have caught the bug on page 89.

Mateusz Rakowski  May 01, 2022 
Printed Page 85
3

When using a line chart to show the normal approximation, you create the heights by taking the difference of two CDF calls. In the first CDF call, you add 0.5 and in the second CDF call you subtract 0.5. I suspect this is because you assume the integer value of x covers the range from x + 0.5 to x - 0.5 and you map this full probability to x. It would be ideal if you clarified the reason behind this. Thank you.

Mateusz Rakowski  Apr 30, 2022 
PDF Page 33
4, 5, 6

Personal opinion about optimizing part of the Data Science from Scratch 2nd edition book

Hello
Thank you very much and the author of Data Science from Scratch 2nd edition book for its very useful content.
I have an opinion on optimizing a part of the Randomness section in Chapter 2 ( A Crash Course in Python ):

For the Randomness section, you can use the Numpy library, because this library is used in Data Science, and in addition, it can make the subject of the image sent from the book (topic is choice 1 or more elements and with replacement or without replacement) easier and more understandable because its no need to use random.sample rather, we play with numpy.random.choice so I optimized the method. My code is:

#importing module
from numpy import random

#for choosing 1 element
my_best_friend = random.choice(["Alice", "Bob", "Charlie"])
#for choosing 3 element with replecement
my_best_friend = random.choice(["Alice", "Bob", "Charlie"], size=2)
#for choosing 3 element without replecement
my_best_friend = random.choice(["Alice", "Bob", "Charlie"], size=2, replace=False)

RZM  Apr 04, 2022 
Printed Page 41
first 2 code samples

Both of the sample functions are returning sum(total) when they should return sum(xs).

Dylan Kaufman  Jan 25, 2022 
Printed Page 17
2nd and 3rd paragraphs

"source activate" should be replaced with "conda activate". (on 2 lines)

"source deactivate" should be replaced with "conda deactivate"

Jonathan  Oct 31, 2021 
ePub Page 36
middle of page

the code

sorted(num_friends_by_id,
key=lambda(user_id, num_friends): num_friends, reverse = True)

returns a syntax error: "invalid syntax"


Anonymous  Oct 04, 2021 
Printed Page 63
The whole chapter

I’m working through the examples in the statistics chapter in “Data Science from Scratch, 2nd edition”,by Joel Grus, and I am getting the following error:

ModuleNotFoundError: No module named 'scratch'

Where do I get the module “scratch”? I’ve tried updating Anaconda and that didn’t help.

Anonymous  May 10, 2021 
Printed Page 328
Penultimate para.

Link to file download should be:

https://files.grouplens.org/datasets/movielens/ml-100k.zip

not

https://files.group-lens.org/datasets/movielens/ml-100k.zip

Michael Shearer  Mar 12, 2021 
Printed Page 305
Code block

The tags have changed and the page currently lists 137 companies.

I found the following worked:

companies = list({a.text
for a in soup("a")
if "company-name" in a.get("class", ())})
assert len(companies) == 137

Michael Shearer  Mar 07, 2021 
Printed Page 299
Penultimate code block

Should model description be ‘as a word_id’ versus ‘as a vector of word_ids’ in this particular example.

Michael Shearer  Mar 06, 2021 
Printed Page 287
3rd code block, else comment

‘If the total is 8 or more’ not 7. 7 is dealt with in <=7 case.

Michael Shearer  Mar 06, 2021 
Printed Page 282
1st code block

content = soup.find('div', 'post-radar-content')

See github issue #77

Michael Shearer  Mar 06, 2021 
Printed Page 319
code block at bottom of page

the first line in the `page_rank` function says
```
# Compute how many people each person endorses
outgoing_counts = Counter(target for source, target in endorsements)
```

but this actually counts the number of endorsements that each person receives (exactly like `endorsement_counts` earlier in this section).

the correct counter for # of outgoing endorsements should be `Counter(source for source, target in endorsements)`.

Anji Z  Mar 05, 2021 
Printed Page 245
Linear code

In the comments below is it the o-th neuron or the o-th layer of neurons?


# self.w[o] is the weights for the o-th neuron
self.w = random_tensor(output_dim, input_dim, init=init)

# self.b[o] is the bias term for the o-th neuron
self.b = random_tensor(output_dim, init=init)

Michael Shearer  Feb 28, 2021 
Other Digital Version 5. Statistics

For instance, if you don’t mind being angrily accused of https://www.nytimes.com/2014/06/30/technology/facebook-tinkers-with-users-emotions-in-news-feed-experiment-stirring-outcry.html?r=0[experimenting on your users], you could randomly choose a subset of your users and show them content from only a fraction of their friends. If this subset subsequently spent less time on the site, this would give you some confidence that having more friends _causes more time to be spent on the site.

I think this is a typing mistake. However, it was something that I found and was curious about. It may be a little mistake. It was not a huge problem for me.

P.S. I have the Oreilly account so I'm reading the book online.

Anonymous  Feb 18, 2021 
Printed Page 16
Middle of the page

The text says: "Whitespace is ignored inside parentheses and brackets...", which is true, but it is meant that line breaks are ignored inside parentheses.

Markus Gottwald  Feb 17, 2021 
Printed Page 105
Code Sample

Instead of using the specific sum_of_squares_gradient we could have used the generic estimate_gradient method as

grad = estimate_gradient(sum_of_squares, v, 0.0001 )

Michael Shearer  Feb 07, 2021 
Printed Page 198
last paragraph

The beginning of this paragraph talks about testing the null hypothesis "beta_i = 0". However, the subsequent formula and example code all uses "beta_j" / "beta_hat_j". Is this difference in subscript letter deliberate? Do beta_i and beta_j actually mean slightly different things, or is this just a typo? Thanks very much!

Anji Z  Feb 04, 2021 
Printed Page 319, 320
code

for iter in tqdm.trange(num_iters):
next_pr = {user.id : base_pr for user in users} # start with base_pr

for source, target in endorsements:
# Add damped fraction of source pr to target
next_pr[target] += damping * pr[source] / outgoing_counts[source]

pr = next_pr

return pr
=============================================================


The looping does not differ from iteration to iteration.
The value of pr, as determined the first time through, never changes.


Gregory Sherman  Dec 08, 2020 
Printed Page 247
middle

On looking more closely at the github code, there is another similarly named variable.
There is no problem with the text.

Gregory Sherman  Dec 08, 2020 
ePub Page 35
sort function at bottom of page

The lambda expression in this function should be

...key=lambda num_friends_by_id: num_friends_by_id[1]

“num_friends_by_id.sort( # Sort the list
key=lambda id_and_friends: id_and_friends[1], # by num_friends
reverse=True)”



John Kilbourne  Dec 06, 2020 
Printed Page 247
assignment in middle

xor_net = Sequential([
Linear(input_dim=2, output_dim=2),
Sigmoid(),
Linear(input_dim=2, output_dim=1),
Sigmoid()
])

In the github code, the assignment to "net" is the same except the last call to Sigmoid() is omitted. Which is the correct representation of a neural network for xor?

Gregory Sherman  Dec 04, 2020 
Printed Page 168
parse_iris_row()

The previously reported crash of the program can be avoided by deleting the blank lines at the end of the downloaded data file.

Gregory Sherman  Nov 29, 2020 
Printed Page 156
both code blocks

data = [n for n in range(1000)]

xs = [x for x in range(1000)]


are much faster as l = list(range(1000))

Gregory Sherman  Nov 28, 2020 
Printed Page 91
simulation code

import random

extreme_value_count = 0
for _ in range(1000):
num_heads = sum(1 if random.random() < 0.5 else 0 # Count # of heads
for _ in range(1000)) # in 1000 flips,
if num_heads >= 530 or num_heads <= 470: # and count how often
extreme_value_count += 1 # the # is 'extreme'

# p-value was 0.062 => ~62 extreme values out of 1000
assert 59 < extreme_value_count < 65, f"{extreme_value_count}"



When run under Python 3.8.1, the assertion fails much more often than it succeeds

Gregory Sherman  Nov 25, 2020 
Printed Page 70
first 3 Python statements

The 3 statements can be written more succinctly as:

num_friends_good, daily_minutes_good = num_friends[:], daily_minutes[:]
num_friends_good.remove(100)
daily_minutes_good.pop(num_friends.index(100))

Gregory Sherman  Nov 24, 2020 
Printed Page 63
list

This ambiguity makes Figure 5-1 and the code that produced it difficult to understand

num_friends = [100, 49, 41, 40, 25,
# ... and lots more
]
There is no description of the list here, and no such list appears in Ch 1.
In that small subset of the list, no number is repeated, but such repetition
has to be assumed for the histogram and code to make sense - if
I have the right meaning of the list.


Gregory Sherman  Nov 24, 2020 
Printed Page 35
last lines

To choose elements with replacement (i.e., allowing duplicates), you can just make multiple calls to random.choice:

four_with_replacement = [random.choice(range(10)) for _ in range(4)]
---------------------------------------------------------------
The code works, but isn't optimal - better as:
four_with_replacement = [random.randrange(10) for _ in range(4)]

An example using random.choice is
[random.choice(list('abcdefghij')) for _ in range(4)]


Gregory Sherman  Nov 22, 2020 
PDF Page 68
2nd paragraph

Paragraph states that East coast data scientists skew more towards PhD types, but it makes nos sense with the exp[lanation and the table above shows the oposite.

Anonymous  Oct 09, 2020 
Printed Page 22
4th paragraph

"If you leave off the start of the slice, you'll slice from the beginning of the list, and if you leave of the end of the slice,..."

It should be "leave off the end of the slice,..."

Bill Ward  Sep 22, 2020 
Printed Page 26
2.9" from top

# {"Joel": {"City": Seattle"}}
needs another quotation mark in front of Seattle

ColinGT  Apr 16, 2020 
Printed Page 142
2nd line of the 3rd code snippet

The news.cnet.com site is not available.
Is there an alternative site?

Ryoko  Mar 09, 2020 
Printed Page 96
beta_pdf definition code-block

The text defines the beta_pdf like so:

def beta_pdf(x:float, alpha: float, beta: float) -> float:
if x <= 0 or x >= 1: # no weight outside [0, 1]
return 0
return (x ** (alpha - 1)) * ((1 - x) ** (beta - 1)) / __b(alpha, beta)

The first condition line should instead use '<' and '>' rather than the or-equal-to counterparts: this properly excludes 0 and 1 per the in-line comment, and produces an incorrect beta pdf for beta_pdf(0, 1, 1) (which should be 1, but instead is 0), making the graph on the same page not-reproducible without modifying the pdf definition

Brendan King  Mar 06, 2020 
Printed Page 104
First paragraph, second to last sentence

Book states "...we can estimate derivatives by evaluating the difference quotient for a very small e". I believe the text should be "for a very small h".

Andrew Mathena  Jan 21, 2020 
Printed Page 7
penultimate paragraph

While the phrase "...they share interests in Java and big data" is correct, it would be more complete to also include Hadoop in this summary of shared interests.

Matt S  Jan 15, 2020 
PDF Page 202
code comment of _negative_log_partial_j function

This comment is not necessary.

"Here i is the index of the data point."

Anonymous  Jan 10, 2020 
Printed Page 41
1st paragraph

It's the same as already listed for ePub, return sum(xs) instead of sum(total), just asking to note it's present in the print version too, and different page number, 41 vs 87.

def total(xs: list) -> float:
return sum(total)

def total(xs: List[float]) -> float:
return sum(total)

For anyone looking for first edition, which I also purchased some years ago, errata is neither here, nor there, ie Grus github. Not that this correction would apply to py 2.7 pre Typing 1st Ed anyway.

Charles Shoopak  Dec 23, 2019 
Printed Page 182
next-to-last paragraph

This could be due to the SpamAssassin files on the site changing.
When I run the program (as copied from github), I get different numbers
for the "confusion_matrix" counter:

text: "This gives 84 true positives ..., 25 false positives ..., 703 true negatives ...
and 44 false negatives"
My run results in 86 true positives, 40 false positives, 670 true negatives, and 29 false negatives

Gregory Sherman  Dec 21, 2019 
Printed Page 167
code at bottom of page

"with open ('iris.dat', 'w') as f:"
does not match "iris.data" in requests.get() call above and in open() call on next page

Gregory Sherman  Dec 21, 2019 
Printed Page 168
parse_iris_row()

Upon running the code on iris.data (both downloaded from github),
the program fails:

Traceback (most recent call last):
File "k_nearest_neighbors.py", line 150, in <module>
if __name__ == "__main__": main()
File "k_nearest_neighbors.py", line 75, in main
iris_data = [parse_iris_row(row) for row in reader]
File "k_nearest_neighbors.py", line 75, in <listcomp>
iris_data = [parse_iris_row(row) for row in reader]
File "k_nearest_neighbors.py", line 69, in parse_iris_row
label = row[-1].split("-")[-1]
IndexError: list index out of range

Gregory Sherman  Dec 21, 2019 
Printed Page 137
Figure 10-5

"if stock_price.clo"

To be consistent with previous StockPrice assignment on page 136, it would be better as "if price.clo"

Gregory Sherman  Dec 16, 2019 
Printed Page 127
code and following paragraph

if len(tweets) >= 100:
self.disconnect()
.
.
.
"... disconnects the streamer after it's collected 1000 tweets."

Text at top of pg 128 also mentions "100 tweets", so "1000"seem to be a typo.

Gregory Sherman  Dec 16, 2019 
Printed Page 112
last sentence of note

"... use chmod x egrep.py++ to make the file executable"
should be the following (or a+x, etc.)
chmod u+x egrep.py

Gregory Sherman   Dec 13, 2019 
Printed Page 107
first line

squared_error = error * 2

The value "squared_error" is not returned from nor used in linear_gradient()

Gregory Sherman   Dec 13, 2019 
Printed Page 86
first sentence

"...if you want to know the probability that (say) a fair coin
turns up more than 60 heads in 100 flips, you can estimate
it as that the probability that a Normal(50, 5) ..."

How it is known that the coin flipping follows (at least approximately)
the normal distribution?

Gregory Sherman  Dec 12, 2019 
Printed Page 84
last paragraph

"A Binomial(n,p) random variable is simply the sum of
n independent Bernoulli(p) variables ..."

This is not clear without a definition of a Bernoulli variable

Gregory Sherman  Dec 12, 2019 
Printed Page 64
code at top of page

comment below should be "#height is just # of people"

ys = [friend_counts[x] for x in xs] #height is just # of friends
.
.
.
plt.ylabel("# of people")

Gregory Sherman  Dec 12, 2019 
Printed Page 62
assertions

assert friend_matrix[0][2] == 1, "0 and 2 are friends"
assert friend_matrix[0][8] == 0, "0 and 8 are not friends"

The strings are incorrect, as 1 indicates friendship,
so the AssertionError messages should be "are not" and "are", respectively.

Gregory Sherman  Dec 12, 2019 
Printed Page 50
plt.annotate() call

plt.annotate() call results in
"ValueError: offset_points is not a recognized coordinate"
There is no underscore in "textcoords='offset points'"
Even if the assignment is removed, the same error occurs.

Gregory Sherman  Dec 12, 2019 
Printed Page 85
last sentence

"make_hist" should be "binomial_histogram"

Gregory Sherman  Dec 12, 2019 
Printed Page 84
paragraph at top

"The mean of a Bernoulli variable is p, and its standard deviation
sqrt(p(1 - p)). The central limit theorem says that as n gets large,
a Binomial(n,p) variable is approximately a normal random variable
with mean mu = np and standard deviation sigma = sqrt(np(1 - p))."

However at bottom of pg 92, with n presumbably 1000, the # of flips:

"... the central limit theorem tells us that the average of those
Bernoulli variables should be approximately normal, with mean p
and standard deviation math.sqrt(p * (1 - p) / 1000)"

pg 92, 93 assignments:
"sigma = math.sqrt(p_hat * (1 - p_hat) / 1000)"

------------------------------------------------------------
I can't reconcile the text's initial description of using the
central limit theorem to determine the mean & standard deviation
with the later wording and assignment statements



Gregory Sherman  Dec 12, 2019