Errata

Doing Data Science

Errata for Doing Data Science

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
Printed
Page multiple
multiple

This errata was submitted by Philipp Marek via email.

Errata for Doing data science

I mark /deletions/, and *changes*.
This is in UTF8 -- so eg. a CRLF is shown as down-left pointing arrow: ↵


xvii: move 3 words: there is more breath // than depth *in some cases*

xxi: Forgot to mention "Visual Display of Quantitative Information" ... although listed on p37

2: statis/i/tican

14: 1-4 use different shades of gray, or dashes or something like that

30: observed real-world phenomen*a* (or *a* phenomenon)

32: x in seconds? Don't integrate over minutes

38: http://stat.columbia.edu - everything else on github

43: hypo-thesis, not th-esis (?)
2-3 Huma*n* behavi*or* (nouns)
Trying to read associations fails; put Olympics beneath Olympic records?

44: an extension /of/ or variation of

48: an-swered?

49: Did Doug use ... (... "CPC") -- aren't used in text, no need to explain

50: plot(log(), log()): see http://spacecraft.ssl.umd.edu/akins_laws.html, twice.
"6. (Mar's Law) Everything is linear if plotted log-log with a fat magic marker."
bk.homes[which ...] -- indentation of 3rd line wrong
log() <= 5 ... better use <= 1e5 or 100e3 and remove log()

68: 3-6 truth = d*e*gree 2 (top right)

69: x*_2* * x_3

71: x&#8321;�, not x�&#8321;

72: you'd have establish*ed* the bins (or have *to* establish)

73: 3-7 doesn't include the points listed above

74: 3-8 use "x" for new guy, this point is already in 3-7

76: Hamming: shoe +s-s => hose, distance is 2
we start with a Google search ... *which to use*.

77: n.points = length(data)
Why not simply use a boolean vector of some length on data?
swap lines: train <- and #define

78: swap cl <- and #
swap true.labels and #

79: # We're using ... comment not helpful

85: http://abt.cm -- why a different link shortener?
we showed how *to* explore and clean

87: remove line setwd()

90: U of Edinb*o*rough?

101: parallel/-/ly

108: WWW::Mechanize, and generally Perl for text extraction

111,112: script could use a few functions

117: *An Empirical...* format different from other book references or titles

129: "non discrete)" is still a comment, wrong format used
c[, 2] - space before "," missing

131: vlist <- use less space to avoid line break, twice

132: "use holdout group" join to previous line
"vars" within for loop?

137: prop/o/agates

140: 6-3 no colors visible. use distinguisable grays?

141: 6-4 no counts visible

147: what does 6-7 show?

151: 6-8 label both axes with text

155: 6-12 factors not distinguishable

156: this_E is unused

176: "Director of Research..." in one line

177: the modeling part isn't *what* we want

183: AIC Info*r*mation

184: a college studen ... spend *her* time

191: "column which is our response" is still a comment, has wrong format

194: "Google's Hybrid Approach" title => italic

201: simple but comp*l*ete

215: vr = indentation wrong

236: to a*c*cept

241: the Predicted=False row should have FN, TN

246: 2nd mouse/keyboard is not needed, other person should read and think, not type simultaneously

247: discrep*a*ncy

251: partic-ipate ?

254: digital media at&#9251;Columbia (space missing)

281: "Overlapping..." title => italic

287: people that take/s/ some drug -- people take, not the population

293: "Oral..." title => italic

304: (hers is shown ... *)*

341: line 44 is hard to read, code doesn't match other formatting

349: Map*R*educe

351ff: Index:
"Amazon Mechanical Turk" in Amazon
bunch together "causal ..."
bunch together "chaos ..."
"Protocol buffers" instead of "prtobuf"
and probably some more.

Note from the Author or Editor:
32: x in seconds? Don't integrate over minutes

Cathy: change the "measured in seconds" to "measured in minutes" in the above paragraph.

50: plot(log(), log()): see http://spacecraft.ssl.umd.edu/akins_laws.html, twice.
"6. (Mar's Law) Everything is linear if plotted log-log with a fat magic marker."
bk.homes[which ...] -- indentation of 3rd line wrong
log() <= 5 ... better use <= 1e5 or 100e3 and remove log()

Cathy: please indent after the "bk.homes" line as the above lines are indented. Otherwise fine to ignore these suggestions.

73: 3-7 doesn't include the points listed above

Cathy: That's true. We might wanna change it to be more reasonable. I don't have the original data that was plotted here.

74: 3-8 use "x" for new guy, this point is already in 3-7

Cathy: Can erase the "?" point in 3-7 for clarity.

76: Hamming: shoe +s-s => hose, distance is 2

Cathy: this is false. Ignore.

77: n.points = length(data)
Why not simply use a boolean vector of some length on data?

Cathy: ignore

108: WWW::Mechanize, and generally Perl for text extraction

Cathy: ignore.

111,112: script could use a few functions

Cathy: please write your own book with a few functions.

141: 6-4 no counts visible

Cathy: that's ok.

147: what does 6-7 show?

Cathy: X-axis should be labeled "time in seconds"

246: 2nd mouse/keyboard is not needed, other person should read and think, not type simultaneously

Cathy: ludicrous comment. Ignore. Also this is on page 245.

Rachel Schutt
 
Nov 20, 2013  Dec 13, 2013
PDF
Page multiple
multiple

Page Error Note
p.207 star-up should be "start-up"
p.359 want achieve should be "want to achieve"
p.162-163 section headers are different sizes "Exercise: GetGlue and Timestamped Event Data" and "Exercise: Financial Data" should be same size font
p.68 dgree In figure 3-6, should be "degree"
p.32-33 inconsistent capitalization of random variables: x vs X
p.21-22 indentation is odd and seems arbitrary
index curse of dimensionality missing
p.282 "That experimental infrastructure" strange phrasing

Rachel Schutt
 
Nov 20, 2013  Dec 13, 2013
PDF
Page xxi
2nd bullet

Introduction to Machine Learning (Adaptive Computation
and Machine Learning) by Ethem Alpaydim (MIT Press)

It is not Alpayd&#305;m, it's Alpayd&#305;n.

Tolga Bakkaloglu  Jan 01, 2014  Oct 10, 2014
PDF
Page xx
Line 1-2 from the top

There is a typo error in surname "Vendenberghe". The proper surname sounds: Vandenberghe.

Zdzislaw Ploski  Aug 13, 2014  Oct 10, 2014
Printed
Page 48
numbered paragraph 2, first sentence

The parenthetical reads, "typical in a startup when its still building its product". That first "its" should be "it's".

George Schneiderman  Jun 17, 2016 
PDF
Page 49
last line

In order to use the count() function in the line:

count(is.na(bk$SALE.PRICE.N))

you need to use plyr package (see http://www.miskatonic.org/2012/09/24/counting-and-aggregating-r/)

library(plyr)

(I'm using R Studio on a Mac, and the count() function doesn't work without first specifying the plyr library)

Note from the Author or Editor:
please add

library(plyr)

before "require(gdata)" and after "# Author: Benjamin Reddy," on its own line.

Shafique Jamal  Jan 26, 2014  Oct 10, 2014
Printed
Page 69
model formula

please change:
model <- lm(y ~ x_1 + x_2 + x_3 + x2_*x_3)

to
model <- lm(y ~ x_1 + x_2 + x_3 + x_2:x_3)
or
model <- lm(y ~ x_1 + x_2*x_3)

Note from the Author or Editor:
Happy to change this to

model <- lm(y ~ x_1 + x_2*x_3)

for simplicity, but I don't think it was wrong as is.

Matthias Kohl  Dec 14, 2013  Oct 10, 2014
PDF
Page 88
3rd line of code

This line doesn't seem to put any info in the variable 'dup_add':

dup_add <- mt_add[mt_add$dup,1]

After entering this line, when I type 'dup_add' I get:

character(0)

And when I type 'dup_add[1]' or 'dup_add[2]' I get the following output:

[1] NA

I think that the formula for 'dup_add' given is wrong. Can you confirm?

To get rid of the duplicates, here is what I did instead:

dup_add2 <- mt_add[which(dup==TRUE),]
mt_add2 <- mt_add[(mt_add$address.noapt != dup_add2[[1]][1] & mt_add$address.noapt != dup_add2[[1]][2]),]

(This drops 4 observations (rows) instead of two - it drops BOTH copies of EACH duplicate. I'm a novice with R, so I still have to figure out how to drop only ONE copy of each duplicate, for a total of 2 rows dropped)

Note from the Author or Editor:
This line should read:

dup_add <- mt_add[dup,1]

Shafique Jamal  Jan 28, 2014  Oct 10, 2014
PDF
Page 95
2d paragraph 1st sentence

"Thinking back to the previous chapter, in order to use liner regression,..."

should be 'linear'

donald f caldwell  Dec 01, 2013  Dec 13, 2013
Printed
Page 103
1st para in section "Fancying..." and subsequent

The notation for the number of occurances of the jth word in all emails would not be ambiguous if it were n_j ("n subscript j"), rather than n_c.

Otherwise the ratios of counts used to compute probabilities theta_j and theta_k for two words, j and k, in spam would seem to have the same denominator. Thus, it is better to write,
p(word_j,spam) = theta_j = n_jc/n_j
and
p(word_k,spam) = theta_k = n_kc/n_k

n_c seems more likely to represent the number of spam emails.

Note from the Author or Editor:
This is terrible notation! Please change all the n_{j c} to n_{j s} and please also change all the n_{c} to n_{j c}.

This is for the entire section called "Fancy It Up: Laplace Smoothing"

So it should read "where n_js denotes the number of times that word appears in a spam email and n_jc denotes the number of times that word appears in any email"

leif wennerberg  May 25, 2014  Oct 10, 2014
Printed
Page 103
para before last equation

Delete the equal sign and right side of the equation in the first sentence. It should read, "... values theta_j is the answer...". Otherwise the question following makes no sense: the lead in has answered it.

Note from the Author or Editor:
Agreed!

That sentence should start:

In other words, the vector of values θ_j is the answer...

leif wennerberg  May 25, 2014  Oct 10, 2014
PDF
Page 108
lines 17-21

On page 108 a part of paragraph is repeated (four lines
from "Represent each image..." to "...between 0 and 255").

Zdzisław Płoski  May 06, 2014  Oct 10, 2014
ePub
Page 119
United States

w.r.t. to my just submitted errata, it appears that its my github ignorance. Shift clicking on the file doesn't have the obvious semantics, but the button on the right side of the pane "download zipfile" does. So my request would be for a slight change to the text to make this clear for us cvs, sccs, svn, bitkeeper folks who didn't get with Git.

Note from the Author or Editor:
Github Readme adjusted to indicate Download Zip button.

Keith Bierman  Oct 31, 2013  Dec 03, 2013
PDF
Page 119
Line 14 from the top

There is: "Recall that in Chapter 3".
There should be "Recall that in Chapter 4".

Zdzislaw Ploski  Jul 31, 2014  Oct 10, 2014
ePub
Page 121
United States

The equation after "In order to model the data, you need to work with a slightly more general function that expresses the relationship between the data and a probability of a click. Start by defining:" reads simply "z".

The PDF is fine; only the ePub is affected. But this makes this part of the ePub incomprehensible.

Adam Merberg  Aug 29, 2014  Oct 10, 2014
PDF
Page 153
Line 16 from the bottom

The expression -ln(2) begins from hyphen. It should begin from minus sign.

Note from the Author or Editor:
I don't know the difference between a hyphen and a minus sign!

So let's go with it, what the heck.

Zdzislaw Ploski  Aug 01, 2014  Oct 10, 2014
PDF
Page 156
lines 10-9 from the bottom

There is: "your buying it has actually changed the process, through
your market impact, and decreased the signal". Is it right? "Decreased"? Not "increased"?

Note from the Author or Editor:
This sentence should read:

"But if you think about it, your buying it has actually changed the process, through your market impact, and decreased the signal you were anticipating, at least if the other market players bought it because it looked cheap to them at the previous price; you brought up the price a bit, so you might expect them to buy less in response, which means the overall signal is smaller."

Zdzislaw Ploski  May 24, 2014  Oct 10, 2014
PDF
Page 160
Line 10 from the bottom

There is: "we solve for beta to get".
Should be: "we solve for <Greek letter 'beta'> to get" as in several other places before.

Zdzislaw Ploski  May 25, 2014  Oct 10, 2014
PDF
Page 161
Line 10 from the bottom (inside formula)

There is: ")/". Should be: ")".

Zdzislaw Ploski  May 25, 2014  Oct 10, 2014
PDF
Page 162
line 16 from the bottom

In the sentence: "Here�s some R code to look at the first 10 rows in R" words "in R" are redundant.

Note from the Author or Editor:
change to "Here's some R code to look at the first 10 rows"

Zdzislaw Ploski  May 19, 2014  Oct 10, 2014
PDF
Page 173
Line 7 from the bottom

There is: ascii. Better notation here: ASCII.

Zdzislaw Ploski  Aug 03, 2014  Oct 10, 2014
PDF
Page 191
Line 17 from the bottom (not counting an interline)

There is "fir"in place of predicate in comment. Does it mean "fire"?

Zdzislaw Ploski  Aug 03, 2014  Oct 10, 2014
PDF
Page 200
16g

There is: "They were questions".
There should be, I think: "There were questions".

Zdzislaw Ploski  Jun 03, 2014  Oct 10, 2014
PDF
Page 201
Lines 9-10 from the top

In the sentence: "there are lines from a user to an item if that
user has expressed an opinion about that item" words "are lines" should be replaced by "is a line" (cp. Fig. 8-1).

Zdzislaw Ploski  Jun 04, 2014  Oct 10, 2014
PDF
Page 204
Lines 4 and 1 from the bottom

The order of indexes concerning three "f" (user attributes) is inverted (cp. their order three lines above). Is it correct?

Note from the Author or Editor:
Yes the last line on page 204 should read

p_i = \beta_1 f_{i, 1} + \beta_2 f_{i,2}+ \beta_3 f_{i, 3}.

right now we see "f_{1, i}" instead of "f_{i, 1}" for example.

Zdzislaw Płoski  Aug 04, 2014  Oct 10, 2014
PDF
Page 205
17 from the top

There is: "the coefficients on one can be 100,000".
There should (?) be: "the coefficient on one can be
100,000".

Zszislaw Ploski  Jun 03, 2014  Oct 10, 2014
PDF
Page 209
lines 2-1 from the bottom

In the sentence: "the age vectors of all the
users will be a row in V" is the plural of '"vector" correct?
Why "age vector"? Isn't age a scalar value?

Note from the Author or Editor:
The parenthetical phrase at the bottom of the page should read:

so the vector of ages of all the users will be a row in V

Zdzislaw Ploski  Jun 03, 2014  Oct 10, 2014
PDF
Page 222
Line 6 form the top

There is: "a cool example of how ideally, data science integrates".
Should be: "a cool example of how ideally data science integrates".

Zdzislaw Ploski  Jun 14, 2014  Oct 10, 2014
PDF
Page 229
Lines 5, 10 and 11 from the to

In line 5 there is "bit.ly". In lines 10 and 11: "bitly". Suggestion: use uniform notation everywhere.

Zdzislaw Ploski  Jun 14, 2014  Oct 10, 2014
PDF
Page 245
line 8 from the bottom

There is: "using git. Learn about git".
Better: "using Git. Learn about Git".

Zdzislaw Ploski  Jun 09, 2014  Oct 10, 2014
PDF
Page 269
Lines 19-20 from the top

There is a typo error in surname "Kolazcyk". The proper surname sounds: Kolaczyk.

Zdzislaw Ploski  Jun 21, 2014  Oct 10, 2014
PDF
Page 277
Line 6 from the bottom

There is : "say on". Should be: "say in" (cp. appropriate site)..

Zdzislaw Ploski  Jun 24, 2014  Oct 10, 2014
Printed
Page 287
last line

the causal effect is 10 percentage points, not 10%.

Note from the Author or Editor:
Correct, it should read "10 percentage points."

Stephanie Eckman  Sep 11, 2014  Oct 10, 2014
PDF
Page 296
lines 5-7 from the top

In the quotation: "After adjustment for length of use, users of oral contraceptives were at least twice the risk of clotting compared with users of other kinds of oral contraceptives" lacks at least the phrase: "with desogestrel, gestodene, or drospirenone", otherwise the quoted sentence is not clear. The end of the quoted sentence is also changed (shortened) without any remark.

Note from the Author or Editor:
The quote in question should be adjusted to read:

After adjustment for length of use, users of oral contraceptives with desogestrel, gestodene, or drospirenone were at least at twice the risk of venous thromboembolism compared with users of oral contraceptives with levonorgestrel.

Zdzislaw Ploski  Jun 27, 2014  Oct 10, 2014
PDF
Page 298
Lines 1-2 from the top

The sentence "The kinds of decisions they tweaked were of the
following types" sounds not good due to these "kinds of the types". Perhaps "The kinds of decisions they tweaked were as follows" would be better.

Zdzislaw Ploski  Jun 27, 2014  Oct 10, 2014
PDF
Page 301
Line 4 from the bottom

There is a word "medicare" (starting from a lower case "m"). Is it about Medicare (cp. www.medicare.gov)?

Zdzislaw Ploski  Jun 28, 2014  Oct 10, 2014
PDF
Page 313
Line 16 from the top

Is the word "clean" mandatory in the sentence "The best practice is to start from scratch with clean, raw data"? Isn't "clean data"an antithesis of 'raw data" in the context of the book?

Note from the Author or Editor:
Please replace the word "clean" with the word "unfiltered".

Zdzislaw Ploski  Aug 10, 2014  Oct 10, 2014
PDF
Page 315
Lines 6-7 from the top

In the sentence "if the vast majority is of binary outcomes are 1" is
the word "is" mandatory?

Note from the Author or Editor:
delete "is" from sentence

Zdzislaw Ploski  Jun 30, 2014  Oct 10, 2014
PDF
Page 319
Lines 15-16 from the top

In the sentence: "You�d like to save money and only send money to people who are likely to give" the second word "money" should be replaced with "letter".

Note from the Author or Editor:
change sentence to "...and only send a letter to people..."

Zdzislaw Ploski  Jun 30, 2014  Oct 10, 2014
PDF
Page 323
Paragraph that starts

Please footnote the end of that first sentence as follows:

By some estimates, one or two patients died per week in a certain
smallish town because of the lack of information flow between the
hospital’s emergency room and the nearby mental health clinic \footnote{Andrew Gelman thinks this parable is unlikely, and he wrote up a response which you can read here: http://andrewgelman.com/2014/01/24/parables-vs-data/.}.

Cathy O'Neil
 
Sep 25, 2014  Oct 10, 2014
PDF
Page 329
Line 17 from the bottom

What does it mean: "to shave off nanoseconds 10^-9"? That
nanosecond equals 10^-9 of a second? (It is). 10^-9 of [one] nanosecond?? Something else?

Note from the Author or Editor:
This sentence should read:

Once you get into the optimization process, you find yourself tuning MapReduce jobs to shave off nanoseconds from repetitive processes because you're dealing with petabytes
of data.

Zdzislaw Ploski  Jul 03, 2014  Oct 10, 2014
PDF
Page 330
Lines 2-4 from the top

There is useless redundancy in the sentence: "a record with a person living in zip code 90210 who clicked on an ad would get emitted to (90210,{1,1}) if that person saw an ad and clicked, or (90210,{0,1}) if they saw an ad and didn�t click.". Two times is written that a person clicked on an ad.

Note from the Author or Editor:
change sentence to: "You could run MapReduce keyed by zip code so that a record with a person living in zip code 90210 would get emitted to (90210,{1,1}) if that person saw an ad and clicked, or (90210,{0,1}) if they saw an ad and didn�t click."

Zdzislaw Ploski  Jul 03, 2014  Oct 10, 2014
PDF
Page 330
Line 13 from the top

Does the expression ((90210], user_5321} <- {1,1} is correct? What about the correctness of parentheses?

Note from the Author or Editor:
That expression should be rewritten as:

({90210,user_5321}, {1, 1})

Zdzislaw Ploski  Aug 11, 2014  Oct 10, 2014
PDF
Page 334
Lines16-15 from the bottom

Something is lack in the sentence: "Writing MapReduce in the Java API not pleasant". Lack of predicate?

Note from the Author or Editor:
"Writing MapReduce in the Java API is not pleasant."

Zdzislaw Ploski  Jul 05, 2014  Oct 10, 2014
PDF
Page 335
Lines 12-11 from the bottom

There is: "Github". Should be: "GitHub".

Zdzislaw Ploski  Jul 04, 2014  Oct 10, 2014
PDF
Page 341
Line 10 from the top

There is: "git". Should be: "Git".

Zdzislaw Ploski  Jul 07, 2014  Oct 10, 2014
PDF
Page 344
Line 19 from the top

There is: "In addition". Should be: ". In addition".

Note from the Author or Editor:
add period between equation and "In addition" as indicated

Zdzislaw Ploski  Jul 07, 2014  Oct 10, 2014