Errata

Errata for Introduction to Machine Learning with Python

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted by	Date Submitted
Printed, PDF	Page 45 1st paragraph	When discussing R^2, the statement "a value of 0 corresponds to a constant model that just predicts the mean of the training set responses, y_train" is only true for reg.score(X_train, y_train). If you are doing reg.score(X_test, y_test) then the statement should have y_train replaced by y_test. In general the statement should just read as "a value of 0 corresponds to a constant model that just predicts the mean of the responses, y". Thanks.	RAMZI KUTTEH	Jan 05, 2024
PDF	Page 34 on the In[8] entry of jupyter notebook	There is a call for boston dataset, but this has been removed from sklearn dataset, as you can look for in the documentation page. So I believe there should be a warning, use another dataset, or indicate how to use this boston dataset (documentation page indicates this but it would be useful for a self contained book).	Daniel Jimenez	Jun 08, 2023
PDF	Page 19 bottom of the page, on entry 24 of jupyter notebook	In the code example, a line says: grr = pd.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o', hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3) but pd.scatter_matrix is deprecated, or at least i wasn't able to make it work as it is with python 3 (maybe this works for python 2). Instead I had to look it up and relace it with: pd.plotting.scatter_matrix and it run well.	Daniel Jimeenz	Jun 08, 2023
Printed	Page p. 141 top	github.com/amueller/introduction_to_ml_with_python/blob/master/03-unsupervised-learning.ipynb got .97 and I got .97, but book got .63 (?)	Mike Sweeney	Sep 12, 2022
Printed	Page 41 bottom	"On the other hand, when considering 10 neighbors, the model is too simple and performance is even worse". The graph seems to indicate that 10 gives a better result than 1 (but optimal is about six, as stated).	Mike Sweeney	Sep 01, 2022
PDF	Page 152 In[24]	Stackexchange suggests that updates to the KNeighborsClassifier on sklearn are invalidating the code. Using older versions though trigger other issues. Please revise! In[24] code is: from sklearn.neighbors import KNeighborsClassifier # split the data in training and test set X_train, X_test, y_train, y_test = train_test_split( X_people, y_people, stratify=y_people, random_state=0) # build a KNeighborsClassifier with using one neighbor: knn = KNeighborsClassifier(n_neighbors=1) knn.fit(X_train, y_train) print("Test set score of 1-nn: {:.2f}".format(knn.score(X_test, y_test))) Out[24] in the text is: Test set score of 1-nn: 0.27 In github the results are: Test set score of 1-nn: 0.23 In my Jupyter notebook its a disaster ValueError Traceback (most recent call last) <ipython-input-64-87c847658059> in <module> 1 from sklearn.neighbors import KNeighborsClassifier 2 # split the data in training and test set ----> 3 X_train, X_test, y_train, y_test = train_test_split( 4 X_people, y_people, stratify=y_people, random_state=0) 5 # build a KNeighborsClassifier with using one neighbor: ~\anaconda3\lib\site-packages\sklearn\model_selection\_split.py in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays) 2173 2174 n_samples = _num_samples(arrays[0]) -> 2175 n_train, n_test = _validate_shuffle_split(n_samples, test_size, train_size, 2176 default_test_size=0.25) 2177 ~\anaconda3\lib\site-packages\sklearn\model_selection\_split.py in _validate_shuffle_split(n_samples, test_size, train_size, default_test_size) 1855 1856 if n_train == 0: -> 1857 raise ValueError( 1858 'With n_samples={}, test_size={} and train_size={}, the ' 1859 'resulting train set will be empty. Adjust any of the ' ValueError: With n_samples=0, test_size=0.25 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.	Regis O'Connor	Nov 19, 2021
PDF	Page 105 out[86] and last paragraph	The accuracy of the model in the text does not align with the Github results (or mine either). The conclusions drawn in the text therefore are in error. Here is the text: Accuracy on training set: 0.988 Accuracy on test set: 0.972 Here, increasing C allows us to improve the model significantly, resulting in 97.2% accuracy Here are the results form github: Accuracy on training set: 1.000 Accuracy on test set: 0.958 This will be an awkward one to explain to my students!	Anonymous	Nov 08, 2021
PDF	Page 103 Out[81] and In[82]	Out[81] in the text does not match the results posted on github Here is the text: Out[81]: Accuracy on training set: 1.00 Accuracy on test set: 0.63 The model overfits quite substantially, with a perfect score on the training set and only 63% accuracy on the test set Here are the github results: Accuracy on training set: 0.90 Accuracy on test set: 0.94 In [82] has a typo - the correction is noted in github but not the text Text: plt.boxplot(X_train, manage_xticks=False) Correct code: plt.boxplot(X_train, manage_ticks=False)	Anonymous	Nov 08, 2021
PDF	Page 91 bottom	The results of the code do not match the text. The code results are: Accuracy of gbrt on training set 1.000 Accuracy of gbrt on test set 0.965 The results in the text are: Accuracy on training set: 1.000 Accuracy on test set: 0.958	Anonymous	Nov 07, 2021
	7 Figure 7-6. Topic weights learned by LDA	Chapter 7. Working with Text Data Shouldn't the figure 7-6 match the output (first 2 rows of each topic) given by Out[48]? When I run this on my Python those do match. Best André	Anonymous	Jun 25, 2021