Errata

Data Science from Scratch

Errata for Data Science from Scratch

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
Printed
Page 4
Block of code below second paragraph

The hashed comments are incorrect. They read:

# add i as a friend of j
# add j as a friend of i

They should read:

# add j as a friend of i
# add i as a friend of j

Note from the Author or Editor:
agreed, those two comments should be switched

James Whitehead  Jan 15, 2016  Mar 10, 2017
Printed
Page 7
3rd paragraph

The statement "For example, Thor (id 4) has no friends in common with Devin (id 7) , . . ." is incorrect. They share the friend Clive (id 5).

Note from the Author or Editor:
good point, change that sentence to

For example, Hero (id 0) has no friends in common with Klein (id 9), but they share interests in Java and big data.

Stephen N. Cole  May 17, 2015  Mar 10, 2017
Printed
Page 40
Code

In the histogram code, please add the following to resolve Counter():

from collections import Counter

Note from the Author or Editor:
I don't care that much either way, I sort of assumed importing Counter was implied, but I don't mind adding a

from collections import Counter

to the start of the example

_j_j  Jun 01, 2016  Mar 10, 2017
Printed
Page 52
Top of page

The following sentence at the top of the page:

"The dot product measures how far the vector v extends in the w direction."

is usually false, but can be true if w is a unit vector. Alternatively, the following correction would make the statement true:

"The dot product of v and w, divided by the magnitude of w, measures how far the vector v extends in the w direction."

Or alternatively:

"Given two vectors w and v, if w is a unit vector, then the dot product measures how far the vector v extends in the w direction."

Note from the Author or Editor:
yeah, this is a fair criticism.

I would simply change the first sentence on the page to

If _w_ has magnitude 1, the dot product measures how far the vector _v_ extends in the _w_ direction.

Matt Goldwasser  Feb 10, 2016  Mar 10, 2017
Printed
Page 67
2nd paragraph

The list x contains second value 1, whereas it should contain second value -1.

Note from the Author or Editor:
yes, this is a mistake, x should be [-2, -1, 0, 1, 2]

Zach Landes  Jul 05, 2015  Mar 10, 2017
Printed, PDF
Page 69
Paragraph #2

In the first sentence of paragraph #2, it says: "For our purposes you should think of probability as a way of quantifying the uncertainty associated with events chosen from a some universe of events." ('universe' is italicized)

Does 'a some <i>universe</i>' include an extra word, or does it have a special meaning in this context?

Note from the Author or Editor:
the "a" should not be there, it should just say

"chosen from some universe"

Anonymous  Dec 14, 2016  Mar 10, 2017
Printed
Page 75
second paragraph

The sentence "It has the distribution function:" would be improved by substituting "probability density" in place of "distribution". In the preceding section the author introduced the "probability density function" and the "cumulative distribution function". Given that context the reader might incorrectly infer that the equation following the second paragraph is the Normal cumulative distribution function.

Note from the Author or Editor:
agree, should change to

It has the probability density function:

Stephen N. Cole  Jan 01, 2016  Mar 10, 2017
Printed
Page 78
the function inverse_normal_cdf at the top of the page

Values are assigned to low_p and hi_p, but these are never used. Statements that refer to low_p and hi_p should be simplified.

Note from the Author or Editor:
agree with this, revised version at

https://gist.github.com/joelgrus/71c1ba8f96b6422a12adf10d04783512

Stephen N. Cole  Mar 27, 2016  Mar 10, 2017
PDF
Page 83
1st paragraph

In:
" X should be distributed approximately normally with mean 50 and
standard deviation 15.8:"

mean should be 500 not 50

Note from the Author or Editor:
confirmed, the mean should be 500

Luis Miguel Soares  Jun 24, 2015  Mar 10, 2017
Printed
Page 83
last line

both 50 should be 500

Note from the Author or Editor:
agreed, change both 50 to 500

Dong Zhou  Apr 16, 2016  Mar 10, 2017
Printed
Page 84
2nd/3rd paragraph

The title of a section has disappeared between the 2nd and 3rd paragraphs. It should be p-values. This title appears at the end of the 2nd paragraph with its markup before: ===

Note from the Author or Editor:
yes, looks like the markup wasn't quite right.

Pierre Nugues  Aug 12, 2015  Mar 10, 2017
Printed
Page 89
1st statement in function beta_pdf

If beta_pdf is called with [x=0 and alpha<1] or with [x=1 and beta<1], the function crashes, because python does not permit 0 to be raised to a negative power. An easy fix is to change the 1st statement to
if x <= 0 or x >= 1:

Note from the Author or Editor:
I agree, change the first line of the beta_pdf function to

if x <= 0 or x >= 1:

Stephen N. Cole  Jun 15, 2016  Mar 10, 2017
Printed
Page 106
2nd code block

for line in file:
should be
for line in f:

Note from the Author or Editor:
agree, should be

for line in f: # look at each line in the file

Dong Zhou  Apr 16, 2016  Mar 10, 2017
Printed
Page 108
lines 1 through 14

The script in lines 1 through 9 on page 108 is incorrect, because it does not produce the results printed on lines 11 through 14. Instead, the lines of text in bad_csv.txt get merged into a single line of text - as if the f.write("\n") were missing. One way to correct the error is to change the 6th line to "with open('bad_csv.txt', 'w') as f:" (omitting 'b' from open's 2nd argument). This script does not need 'wb', because it does not use the CSV module. Another (less elegant) resolution is to replace line 9 with f.write("\r\n").

Note from the Author or Editor:
I cannot reproduce this (the code as written works for me); however, I agree that the 'b' parameter does not need to be there, and I am ok with getting rid of the b and changing the line of code to

`with open('bad_csv.txt', 'w') as f:`

Stephen N, Cole  Nov 27, 2016  Mar 10, 2017
Printed
Page 147
3rd paragraph from bottom

figure out what do
should be
figure out what to do

Note from the Author or Editor:
just like the errata says

Dong Zhou  Apr 16, 2016  Mar 10, 2017
Printed
Page 167
2nd paragraph from bottom

all 'spam' in this paragraph should be 'non-spam'

Note from the Author or Editor:
agree, in that paragraph both "additional spams" should instead be "additional non-spams" (the spam in "spam probabilities" can stay as is)

Dong Zhou  Apr 17, 2016  Mar 10, 2017
Printed
Page 181
Paragaph after "we will understimage beta(1)"

The predictions would tend to be too small for users who work many hours and too large for users who work few hours, because Beta(2) > 0 and we "forgot" to include it.

In the actual model, Beta(2) is < 0 (on page 182 in the example, this is confirmed), since in the example "people who work more hours spend less time on the site"

Note from the Author or Editor:
yes, that whole paragraph is wrong, it should be

Think about what would happen if we made predictions using the single variable model with the "actual" value of beta_1. (That is, the value that arises from minimizing the errors of what we called the "actual" model.) The predictions would tend to be way too large for users who work many hours and a little too large for users who work few hours, because beta_2 < 0 and we "forgot" to include it. Because work hours is positively correlated with number of friends, this means the predictions tend to be way too large for users with many friends, and only slightly too large for users with few friends.

J.R. Scally  Jan 21, 2016  Mar 10, 2017
Printed
Page 185
2nd code block

'unemployed' should be 'work_hours'

Note from the Author or Editor:
agree, the comment on the third line should say work hours

# 0.131, # work hours, actual error = 0.127

Dong Zhou  Apr 18, 2016  Mar 10, 2017
PDF
Page 219
backpropagate code listing

```
# back-propagate errors to hidden layer
hidden_deltas = [hidden_output * (1 - hidden_output) * dot(output_deltas, [n[i] for n in output_layer])
for i, hidden_output in enumerate(hidden_outputs)]
```

`output_layer` is not defined in the `backpropagate` function, and not passed in, hence running the example produces `NameError: global name 'output_layer' is not defined`

Note from the Author or Editor:
the line of code

dot(output_deltas, [n[i] for n in output_layer])

should be replaced with

dot(output_deltas, [n[i] for n in network[-1]])

Anonymous  Feb 26, 2016  Mar 10, 2017
Printed
Page 295
code block for matrix_multiply_mapper and matrix_multiply_reducer

For the function of matrix_multiply_mapper, two matrix indexes should be passed: the row number of A and column number of B

For any nonzero A_ij, all C_ik may be affected, with k being any column index of B

Similarly, for any nonzero B_ij, all C_kj may be affected, with k being any row index of A

In the text, the common dimension was used, which is wrong.

Also, for the function of matrix_multiply_reducer, m is not used.

Note from the Author or Editor:
yes, the code is wrong. here is a fixed version of the functions

https://gist.github.com/joelgrus/cd0558f2fc6eeaea22ba8d286775e6a1

and then at the very bottom of the page you need to change the definition of mapper

mapper = partial(matrix_multiply_mapper, 2, 3)

and at the top of the next page change the definition of reducer

reducer = matrix_multiply_reducer

Dong Zhou  Apr 23, 2016  Mar 10, 2017