Errata

Introduction to Machine Learning with Python

Errata for Introduction to Machine Learning with Python

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date Submitted
Printed, PDF Page 45
1st paragraph

When discussing R^2, the statement "a value of 0 corresponds to a constant model that just predicts the mean of the training set responses, y_train" is only true for reg.score(X_train, y_train). If you are doing reg.score(X_test, y_test) then the statement should have y_train replaced by y_test. In general the statement should just read as "a value of 0 corresponds to a constant model that just predicts the mean of the responses, y". Thanks.

RAMZI KUTTEH  Jan 05, 2024 
PDF Page 34
on the In[8] entry of jupyter notebook

There is a call for boston dataset, but this has been removed from sklearn dataset, as you can look for in the documentation page. So I believe there should be a warning, use another dataset, or indicate how to use this boston dataset (documentation page indicates this but it would be useful for a self contained book).

Daniel Jimenez  Jun 08, 2023 
PDF Page 19
bottom of the page, on entry 24 of jupyter notebook

In the code example, a line says:

grr = pd.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)

but pd.scatter_matrix is deprecated, or at least i wasn't able to make it work as it is with python 3 (maybe this works for python 2). Instead I had to look it up and relace it with:

pd.plotting.scatter_matrix

and it run well.

Daniel Jimeenz  Jun 08, 2023 
Printed Page p. 141
top

github.com/amueller/introduction_to_ml_with_python/blob/master/03-unsupervised-learning.ipynb got .97 and I got .97, but book got .63 (?)

Mike Sweeney  Sep 12, 2022 
Printed Page 41
bottom

"On the other hand, when considering 10 neighbors, the model is too simple and performance is even worse".

The graph seems to indicate that 10 gives a better result than 1 (but optimal is about six, as stated).

Mike Sweeney  Sep 01, 2022 
PDF Page 152
In[24]

Stackexchange suggests that updates to the KNeighborsClassifier on sklearn are invalidating the code. Using older versions though trigger other issues. Please revise!

In[24] code is:

from sklearn.neighbors import KNeighborsClassifier
# split the data in training and test set
X_train, X_test, y_train, y_test = train_test_split(
X_people, y_people, stratify=y_people, random_state=0)
# build a KNeighborsClassifier with using one neighbor:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
print("Test set score of 1-nn: {:.2f}".format(knn.score(X_test, y_test)))

Out[24] in the text is:
Test set score of 1-nn: 0.27

In github the results are:
Test set score of 1-nn: 0.23

In my Jupyter notebook its a disaster
ValueError Traceback (most recent call last)
<ipython-input-64-87c847658059> in <module>
1 from sklearn.neighbors import KNeighborsClassifier
2 # split the data in training and test set
----> 3 X_train, X_test, y_train, y_test = train_test_split(
4 X_people, y_people, stratify=y_people, random_state=0)
5 # build a KNeighborsClassifier with using one neighbor:

~\anaconda3\lib\site-packages\sklearn\model_selection\_split.py in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays)
2173
2174 n_samples = _num_samples(arrays[0])
-> 2175 n_train, n_test = _validate_shuffle_split(n_samples, test_size, train_size,
2176 default_test_size=0.25)
2177

~\anaconda3\lib\site-packages\sklearn\model_selection\_split.py in _validate_shuffle_split(n_samples, test_size, train_size, default_test_size)
1855
1856 if n_train == 0:
-> 1857 raise ValueError(
1858 'With n_samples={}, test_size={} and train_size={}, the '
1859 'resulting train set will be empty. Adjust any of the '

ValueError: With n_samples=0, test_size=0.25 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.


Regis O'Connor  Nov 19, 2021 
PDF Page 105
out[86] and last paragraph

The accuracy of the model in the text does not align with the Github results (or mine either). The conclusions drawn in the text therefore are in error.

Here is the text:
Accuracy on training set: 0.988
Accuracy on test set: 0.972
Here, increasing C allows us to improve the model significantly, resulting in 97.2%
accuracy

Here are the results form github:
Accuracy on training set: 1.000
Accuracy on test set: 0.958

This will be an awkward one to explain to my students!

Anonymous  Nov 08, 2021 
PDF Page 103
Out[81] and In[82]

Out[81] in the text does not match the results posted on github

Here is the text:
Out[81]:
Accuracy on training set: 1.00
Accuracy on test set: 0.63
The model overfits quite substantially, with a perfect score on the training set and
only 63% accuracy on the test set

Here are the github results:
Accuracy on training set: 0.90
Accuracy on test set: 0.94


In [82] has a typo - the correction is noted in github but not the text

Text:
plt.boxplot(X_train, manage_xticks=False)


Correct code:
plt.boxplot(X_train, manage_ticks=False)


Anonymous  Nov 08, 2021 
PDF Page 91
bottom

The results of the code do not match the text.


The code results are:
Accuracy of gbrt on training set 1.000
Accuracy of gbrt on test set 0.965


The results in the text are:

Accuracy on training set: 1.000
Accuracy on test set: 0.958

Anonymous  Nov 07, 2021 
7
Figure 7-6. Topic weights learned by LDA

Chapter 7. Working with Text Data

Shouldn't the figure 7-6 match the output (first 2 rows of each topic) given by Out[48]?
When I run this on my Python those do match.

Best
André

Anonymous  Jun 25, 2021