The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".
The following errata were submitted by our customers and approved as valid errors by the author or editor.
Version |
Location |
Description |
Submitted By |
Date submitted |
Date corrected |
Printed |
Page multiple
multiple |
This errata was submitted by Philipp Marek via email.
Errata for Doing data science
I mark /deletions/, and *changes*.
This is in UTF8 -- so eg. a CRLF is shown as down-left pointing arrow: ↵
xvii: move 3 words: there is more breath // than depth *in some cases*
xxi: Forgot to mention "Visual Display of Quantitative Information" ... although listed on p37
2: statis/i/tican
14: 1-4 use different shades of gray, or dashes or something like that
30: observed real-world phenomen*a* (or *a* phenomenon)
32: x in seconds? Don't integrate over minutes
38: http://stat.columbia.edu - everything else on github
43: hypo-thesis, not th-esis (?)
2-3 Huma*n* behavi*or* (nouns)
Trying to read associations fails; put Olympics beneath Olympic records?
44: an extension /of/ or variation of
48: an-swered?
49: Did Doug use ... (... "CPC") -- aren't used in text, no need to explain
50: plot(log(), log()): see http://spacecraft.ssl.umd.edu/akins_laws.html, twice.
"6. (Mar's Law) Everything is linear if plotted log-log with a fat magic marker."
bk.homes[which ...] -- indentation of 3rd line wrong
log() <= 5 ... better use <= 1e5 or 100e3 and remove log()
68: 3-6 truth = d*e*gree 2 (top right)
69: x*_2* * x_3
71: x₁�, not x�₁
72: you'd have establish*ed* the bins (or have *to* establish)
73: 3-7 doesn't include the points listed above
74: 3-8 use "x" for new guy, this point is already in 3-7
76: Hamming: shoe +s-s => hose, distance is 2
we start with a Google search ... *which to use*.
77: n.points = length(data)
Why not simply use a boolean vector of some length on data?
swap lines: train <- and #define
78: swap cl <- and #
swap true.labels and #
79: # We're using ... comment not helpful
85: http://abt.cm -- why a different link shortener?
we showed how *to* explore and clean
87: remove line setwd()
90: U of Edinb*o*rough?
101: parallel/-/ly
108: WWW::Mechanize, and generally Perl for text extraction
111,112: script could use a few functions
117: *An Empirical...* format different from other book references or titles
129: "non discrete)" is still a comment, wrong format used
c[, 2] - space before "," missing
131: vlist <- use less space to avoid line break, twice
132: "use holdout group" join to previous line
"vars" within for loop?
137: prop/o/agates
140: 6-3 no colors visible. use distinguisable grays?
141: 6-4 no counts visible
147: what does 6-7 show?
151: 6-8 label both axes with text
155: 6-12 factors not distinguishable
156: this_E is unused
176: "Director of Research..." in one line
177: the modeling part isn't *what* we want
183: AIC Info*r*mation
184: a college studen ... spend *her* time
191: "column which is our response" is still a comment, has wrong format
194: "Google's Hybrid Approach" title => italic
201: simple but comp*l*ete
215: vr = indentation wrong
236: to a*c*cept
241: the Predicted=False row should have FN, TN
246: 2nd mouse/keyboard is not needed, other person should read and think, not type simultaneously
247: discrep*a*ncy
251: partic-ipate ?
254: digital media at␣Columbia (space missing)
281: "Overlapping..." title => italic
287: people that take/s/ some drug -- people take, not the population
293: "Oral..." title => italic
304: (hers is shown ... *)*
341: line 44 is hard to read, code doesn't match other formatting
349: Map*R*educe
351ff: Index:
"Amazon Mechanical Turk" in Amazon
bunch together "causal ..."
bunch together "chaos ..."
"Protocol buffers" instead of "prtobuf"
and probably some more.
Note from the Author or Editor: 32: x in seconds? Don't integrate over minutes
Cathy: change the "measured in seconds" to "measured in minutes" in the above paragraph.
50: plot(log(), log()): see http://spacecraft.ssl.umd.edu/akins_laws.html, twice.
"6. (Mar's Law) Everything is linear if plotted log-log with a fat magic marker."
bk.homes[which ...] -- indentation of 3rd line wrong
log() <= 5 ... better use <= 1e5 or 100e3 and remove log()
Cathy: please indent after the "bk.homes" line as the above lines are indented. Otherwise fine to ignore these suggestions.
73: 3-7 doesn't include the points listed above
Cathy: That's true. We might wanna change it to be more reasonable. I don't have the original data that was plotted here.
74: 3-8 use "x" for new guy, this point is already in 3-7
Cathy: Can erase the "?" point in 3-7 for clarity.
76: Hamming: shoe +s-s => hose, distance is 2
Cathy: this is false. Ignore.
77: n.points = length(data)
Why not simply use a boolean vector of some length on data?
Cathy: ignore
108: WWW::Mechanize, and generally Perl for text extraction
Cathy: ignore.
111,112: script could use a few functions
Cathy: please write your own book with a few functions.
141: 6-4 no counts visible
Cathy: that's ok.
147: what does 6-7 show?
Cathy: X-axis should be labeled "time in seconds"
246: 2nd mouse/keyboard is not needed, other person should read and think, not type simultaneously
Cathy: ludicrous comment. Ignore. Also this is on page 245.
|
Rachel Schutt |
Nov 20, 2013 |
Dec 13, 2013 |
PDF |
Page multiple
multiple |
Page Error Note
p.207 star-up should be "start-up"
p.359 want achieve should be "want to achieve"
p.162-163 section headers are different sizes "Exercise: GetGlue and Timestamped Event Data" and "Exercise: Financial Data" should be same size font
p.68 dgree In figure 3-6, should be "degree"
p.32-33 inconsistent capitalization of random variables: x vs X
p.21-22 indentation is odd and seems arbitrary
index curse of dimensionality missing
p.282 "That experimental infrastructure" strange phrasing
|
Rachel Schutt |
Nov 20, 2013 |
Dec 13, 2013 |
PDF |
Page xxi
2nd bullet |
Introduction to Machine Learning (Adaptive Computation
and Machine Learning) by Ethem Alpaydim (MIT Press)
It is not Alpaydım, it's Alpaydın.
|
Tolga Bakkaloglu |
Jan 01, 2014 |
Oct 10, 2014 |
PDF |
Page xx
Line 1-2 from the top |
There is a typo error in surname "Vendenberghe". The proper surname sounds: Vandenberghe.
|
Zdzislaw Ploski |
Aug 13, 2014 |
Oct 10, 2014 |
Printed |
Page 48
numbered paragraph 2, first sentence |
The parenthetical reads, "typical in a startup when its still building its product". That first "its" should be "it's".
|
George Schneiderman |
Jun 17, 2016 |
|
PDF |
Page 49
last line |
In order to use the count() function in the line:
count(is.na(bk$SALE.PRICE.N))
you need to use plyr package (see http://www.miskatonic.org/2012/09/24/counting-and-aggregating-r/)
library(plyr)
(I'm using R Studio on a Mac, and the count() function doesn't work without first specifying the plyr library)
Note from the Author or Editor: please add
library(plyr)
before "require(gdata)" and after "# Author: Benjamin Reddy," on its own line.
|
Shafique Jamal |
Jan 26, 2014 |
Oct 10, 2014 |
Printed |
Page 69
model formula |
please change:
model <- lm(y ~ x_1 + x_2 + x_3 + x2_*x_3)
to
model <- lm(y ~ x_1 + x_2 + x_3 + x_2:x_3)
or
model <- lm(y ~ x_1 + x_2*x_3)
Note from the Author or Editor: Happy to change this to
model <- lm(y ~ x_1 + x_2*x_3)
for simplicity, but I don't think it was wrong as is.
|
Matthias Kohl |
Dec 14, 2013 |
Oct 10, 2014 |
PDF |
Page 88
3rd line of code |
This line doesn't seem to put any info in the variable 'dup_add':
dup_add <- mt_add[mt_add$dup,1]
After entering this line, when I type 'dup_add' I get:
character(0)
And when I type 'dup_add[1]' or 'dup_add[2]' I get the following output:
[1] NA
I think that the formula for 'dup_add' given is wrong. Can you confirm?
To get rid of the duplicates, here is what I did instead:
dup_add2 <- mt_add[which(dup==TRUE),]
mt_add2 <- mt_add[(mt_add$address.noapt != dup_add2[[1]][1] & mt_add$address.noapt != dup_add2[[1]][2]),]
(This drops 4 observations (rows) instead of two - it drops BOTH copies of EACH duplicate. I'm a novice with R, so I still have to figure out how to drop only ONE copy of each duplicate, for a total of 2 rows dropped)
Note from the Author or Editor: This line should read:
dup_add <- mt_add[dup,1]
|
Shafique Jamal |
Jan 28, 2014 |
Oct 10, 2014 |
PDF |
Page 95
2d paragraph 1st sentence |
"Thinking back to the previous chapter, in order to use liner regression,..."
should be 'linear'
|
donald f caldwell |
Dec 01, 2013 |
Dec 13, 2013 |
Printed |
Page 103
1st para in section "Fancying..." and subsequent |
The notation for the number of occurances of the jth word in all emails would not be ambiguous if it were n_j ("n subscript j"), rather than n_c.
Otherwise the ratios of counts used to compute probabilities theta_j and theta_k for two words, j and k, in spam would seem to have the same denominator. Thus, it is better to write,
p(word_j,spam) = theta_j = n_jc/n_j
and
p(word_k,spam) = theta_k = n_kc/n_k
n_c seems more likely to represent the number of spam emails.
Note from the Author or Editor: This is terrible notation! Please change all the n_{j c} to n_{j s} and please also change all the n_{c} to n_{j c}.
This is for the entire section called "Fancy It Up: Laplace Smoothing"
So it should read "where n_js denotes the number of times that word appears in a spam email and n_jc denotes the number of times that word appears in any email"
|
leif wennerberg |
May 25, 2014 |
Oct 10, 2014 |
Printed |
Page 103
para before last equation |
Delete the equal sign and right side of the equation in the first sentence. It should read, "... values theta_j is the answer...". Otherwise the question following makes no sense: the lead in has answered it.
Note from the Author or Editor: Agreed!
That sentence should start:
In other words, the vector of values θ_j is the answer...
|
leif wennerberg |
May 25, 2014 |
Oct 10, 2014 |
PDF |
Page 108
lines 17-21 |
On page 108 a part of paragraph is repeated (four lines
from "Represent each image..." to "...between 0 and 255").
|
Zdzisław Płoski |
May 06, 2014 |
Oct 10, 2014 |
ePub |
Page 119
United States |
w.r.t. to my just submitted errata, it appears that its my github ignorance. Shift clicking on the file doesn't have the obvious semantics, but the button on the right side of the pane "download zipfile" does. So my request would be for a slight change to the text to make this clear for us cvs, sccs, svn, bitkeeper folks who didn't get with Git.
Note from the Author or Editor: Github Readme adjusted to indicate Download Zip button.
|
Keith Bierman |
Oct 31, 2013 |
Dec 03, 2013 |
PDF |
Page 119
Line 14 from the top |
There is: "Recall that in Chapter 3".
There should be "Recall that in Chapter 4".
|
Zdzislaw Ploski |
Jul 31, 2014 |
Oct 10, 2014 |
ePub |
Page 121
United States |
The equation after "In order to model the data, you need to work with a slightly more general function that expresses the relationship between the data and a probability of a click. Start by defining:" reads simply "z".
The PDF is fine; only the ePub is affected. But this makes this part of the ePub incomprehensible.
|
Adam Merberg |
Aug 29, 2014 |
Oct 10, 2014 |
PDF |
Page 153
Line 16 from the bottom |
The expression -ln(2) begins from hyphen. It should begin from minus sign.
Note from the Author or Editor: I don't know the difference between a hyphen and a minus sign!
So let's go with it, what the heck.
|
Zdzislaw Ploski |
Aug 01, 2014 |
Oct 10, 2014 |
PDF |
Page 156
lines 10-9 from the bottom |
There is: "your buying it has actually changed the process, through
your market impact, and decreased the signal". Is it right? "Decreased"? Not "increased"?
Note from the Author or Editor: This sentence should read:
"But if you think about it, your buying it has actually changed the process, through your market impact, and decreased the signal you were anticipating, at least if the other market players bought it because it looked cheap to them at the previous price; you brought up the price a bit, so you might expect them to buy less in response, which means the overall signal is smaller."
|
Zdzislaw Ploski |
May 24, 2014 |
Oct 10, 2014 |
PDF |
Page 160
Line 10 from the bottom |
There is: "we solve for beta to get".
Should be: "we solve for <Greek letter 'beta'> to get" as in several other places before.
|
Zdzislaw Ploski |
May 25, 2014 |
Oct 10, 2014 |
PDF |
Page 161
Line 10 from the bottom (inside formula) |
There is: ")/". Should be: ")".
|
Zdzislaw Ploski |
May 25, 2014 |
Oct 10, 2014 |
PDF |
Page 162
line 16 from the bottom |
In the sentence: "Here�s some R code to look at the first 10 rows in R" words "in R" are redundant.
Note from the Author or Editor: change to "Here's some R code to look at the first 10 rows"
|
Zdzislaw Ploski |
May 19, 2014 |
Oct 10, 2014 |
PDF |
Page 173
Line 7 from the bottom |
There is: ascii. Better notation here: ASCII.
|
Zdzislaw Ploski |
Aug 03, 2014 |
Oct 10, 2014 |
PDF |
Page 191
Line 17 from the bottom (not counting an interline) |
There is "fir"in place of predicate in comment. Does it mean "fire"?
|
Zdzislaw Ploski |
Aug 03, 2014 |
Oct 10, 2014 |
PDF |
Page 200
16g |
There is: "They were questions".
There should be, I think: "There were questions".
|
Zdzislaw Ploski |
Jun 03, 2014 |
Oct 10, 2014 |
PDF |
Page 201
Lines 9-10 from the top |
In the sentence: "there are lines from a user to an item if that
user has expressed an opinion about that item" words "are lines" should be replaced by "is a line" (cp. Fig. 8-1).
|
Zdzislaw Ploski |
Jun 04, 2014 |
Oct 10, 2014 |
PDF |
Page 204
Lines 4 and 1 from the bottom |
The order of indexes concerning three "f" (user attributes) is inverted (cp. their order three lines above). Is it correct?
Note from the Author or Editor: Yes the last line on page 204 should read
p_i = \beta_1 f_{i, 1} + \beta_2 f_{i,2}+ \beta_3 f_{i, 3}.
right now we see "f_{1, i}" instead of "f_{i, 1}" for example.
|
Zdzislaw Płoski |
Aug 04, 2014 |
Oct 10, 2014 |
PDF |
Page 205
17 from the top |
There is: "the coefficients on one can be 100,000".
There should (?) be: "the coefficient on one can be
100,000".
|
Zszislaw Ploski |
Jun 03, 2014 |
Oct 10, 2014 |
PDF |
Page 209
lines 2-1 from the bottom |
In the sentence: "the age vectors of all the
users will be a row in V" is the plural of '"vector" correct?
Why "age vector"? Isn't age a scalar value?
Note from the Author or Editor: The parenthetical phrase at the bottom of the page should read:
so the vector of ages of all the users will be a row in V
|
Zdzislaw Ploski |
Jun 03, 2014 |
Oct 10, 2014 |
PDF |
Page 222
Line 6 form the top |
There is: "a cool example of how ideally, data science integrates".
Should be: "a cool example of how ideally data science integrates".
|
Zdzislaw Ploski |
Jun 14, 2014 |
Oct 10, 2014 |
PDF |
Page 229
Lines 5, 10 and 11 from the to |
In line 5 there is "bit.ly". In lines 10 and 11: "bitly". Suggestion: use uniform notation everywhere.
|
Zdzislaw Ploski |
Jun 14, 2014 |
Oct 10, 2014 |
PDF |
Page 245
line 8 from the bottom |
There is: "using git. Learn about git".
Better: "using Git. Learn about Git".
|
Zdzislaw Ploski |
Jun 09, 2014 |
Oct 10, 2014 |
PDF |
Page 269
Lines 19-20 from the top |
There is a typo error in surname "Kolazcyk". The proper surname sounds: Kolaczyk.
|
Zdzislaw Ploski |
Jun 21, 2014 |
Oct 10, 2014 |
PDF |
Page 277
Line 6 from the bottom |
There is : "say on". Should be: "say in" (cp. appropriate site)..
|
Zdzislaw Ploski |
Jun 24, 2014 |
Oct 10, 2014 |
Printed |
Page 287
last line |
the causal effect is 10 percentage points, not 10%.
Note from the Author or Editor: Correct, it should read "10 percentage points."
|
Stephanie Eckman |
Sep 11, 2014 |
Oct 10, 2014 |
PDF |
Page 296
lines 5-7 from the top |
In the quotation: "After adjustment for length of use, users of oral contraceptives were at least twice the risk of clotting compared with users of other kinds of oral contraceptives" lacks at least the phrase: "with desogestrel, gestodene, or drospirenone", otherwise the quoted sentence is not clear. The end of the quoted sentence is also changed (shortened) without any remark.
Note from the Author or Editor: The quote in question should be adjusted to read:
After adjustment for length of use, users of oral contraceptives with desogestrel, gestodene, or drospirenone were at least at twice the risk of venous thromboembolism compared with users of oral contraceptives with levonorgestrel.
|
Zdzislaw Ploski |
Jun 27, 2014 |
Oct 10, 2014 |
PDF |
Page 298
Lines 1-2 from the top |
The sentence "The kinds of decisions they tweaked were of the
following types" sounds not good due to these "kinds of the types". Perhaps "The kinds of decisions they tweaked were as follows" would be better.
|
Zdzislaw Ploski |
Jun 27, 2014 |
Oct 10, 2014 |
PDF |
Page 301
Line 4 from the bottom |
There is a word "medicare" (starting from a lower case "m"). Is it about Medicare (cp. www.medicare.gov)?
|
Zdzislaw Ploski |
Jun 28, 2014 |
Oct 10, 2014 |
PDF |
Page 313
Line 16 from the top |
Is the word "clean" mandatory in the sentence "The best practice is to start from scratch with clean, raw data"? Isn't "clean data"an antithesis of 'raw data" in the context of the book?
Note from the Author or Editor: Please replace the word "clean" with the word "unfiltered".
|
Zdzislaw Ploski |
Aug 10, 2014 |
Oct 10, 2014 |
PDF |
Page 315
Lines 6-7 from the top |
In the sentence "if the vast majority is of binary outcomes are 1" is
the word "is" mandatory?
Note from the Author or Editor: delete "is" from sentence
|
Zdzislaw Ploski |
Jun 30, 2014 |
Oct 10, 2014 |
PDF |
Page 319
Lines 15-16 from the top |
In the sentence: "You�d like to save money and only send money to people who are likely to give" the second word "money" should be replaced with "letter".
Note from the Author or Editor: change sentence to "...and only send a letter to people..."
|
Zdzislaw Ploski |
Jun 30, 2014 |
Oct 10, 2014 |
PDF |
Page 323
Paragraph that starts |
Please footnote the end of that first sentence as follows:
By some estimates, one or two patients died per week in a certain
smallish town because of the lack of information flow between the
hospital’s emergency room and the nearby mental health clinic \footnote{Andrew Gelman thinks this parable is unlikely, and he wrote up a response which you can read here: http://andrewgelman.com/2014/01/24/parables-vs-data/.}.
|
Cathy O'Neil |
Sep 25, 2014 |
Oct 10, 2014 |
PDF |
Page 329
Line 17 from the bottom |
What does it mean: "to shave off nanoseconds 10^-9"? That
nanosecond equals 10^-9 of a second? (It is). 10^-9 of [one] nanosecond?? Something else?
Note from the Author or Editor: This sentence should read:
Once you get into the optimization process, you find yourself tuning MapReduce jobs to shave off nanoseconds from repetitive processes because you're dealing with petabytes
of data.
|
Zdzislaw Ploski |
Jul 03, 2014 |
Oct 10, 2014 |
PDF |
Page 330
Lines 2-4 from the top |
There is useless redundancy in the sentence: "a record with a person living in zip code 90210 who clicked on an ad would get emitted to (90210,{1,1}) if that person saw an ad and clicked, or (90210,{0,1}) if they saw an ad and didn�t click.". Two times is written that a person clicked on an ad.
Note from the Author or Editor: change sentence to: "You could run MapReduce keyed by zip code so that a record with a person living in zip code 90210 would get emitted to (90210,{1,1}) if that person saw an ad and clicked, or (90210,{0,1}) if they saw an ad and didn�t click."
|
Zdzislaw Ploski |
Jul 03, 2014 |
Oct 10, 2014 |
PDF |
Page 330
Line 13 from the top |
Does the expression ((90210], user_5321} <- {1,1} is correct? What about the correctness of parentheses?
Note from the Author or Editor: That expression should be rewritten as:
({90210,user_5321}, {1, 1})
|
Zdzislaw Ploski |
Aug 11, 2014 |
Oct 10, 2014 |
PDF |
Page 334
Lines16-15 from the bottom |
Something is lack in the sentence: "Writing MapReduce in the Java API not pleasant". Lack of predicate?
Note from the Author or Editor: "Writing MapReduce in the Java API is not pleasant."
|
Zdzislaw Ploski |
Jul 05, 2014 |
Oct 10, 2014 |
PDF |
Page 335
Lines 12-11 from the bottom |
There is: "Github". Should be: "GitHub".
|
Zdzislaw Ploski |
Jul 04, 2014 |
Oct 10, 2014 |
PDF |
Page 341
Line 10 from the top |
There is: "git". Should be: "Git".
|
Zdzislaw Ploski |
Jul 07, 2014 |
Oct 10, 2014 |
PDF |
Page 344
Line 19 from the top |
There is: "In addition". Should be: ". In addition".
Note from the Author or Editor: add period between equation and "In addition" as indicated
|
Zdzislaw Ploski |
Jul 07, 2014 |
Oct 10, 2014 |