Errata

Python for Data Analysis

Errata for Python for Data Analysis, Third Edition

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date submitted
Other Digital Version Preface
Using Code Examples

Words wrong way around on wesmckinney.com

You can data find files

should be

You can find data files

Steven Mooney  Feb 16, 2024 
Printed, ePub Page Section 3.1, page 59
1st paragraph

The example:
In [118]: hash("string")
Out [118]: 3634226001988967898

However, when I did it I got inconsistent results from the hash function.
below are examples of the result from running the function 4 consecutive times:
-783493489962912440
-2593540438211823544
5958934601557521611
1519405966352344185

Thus this function could not be used to verify the object "string" could be used as a dictionary key.

I am using an 2021 iMac with an Apple M1 chip, 16 GB memory, and macOS Sonoma 14.2.1

I am using PyCharm 2023.3.3 (Community Edition)
Build #PC-233.13763.11, built on January 25, 2024
Runtime version: 17.0.9+7-b1087.11 aarch64
VM: OpenJDK 64-Bit Server VM by JetBrains s.r.o.
macOS 14.2.1
GC: G1 Young Generation, G1 Old Generation
Memory: 2048M
Cores: 8
Metal Rendering is ON
Registry:
ide.experimental.ui=true
Non-Bundled Plugins:
com.jetbrains.edu (2024.1-2023.3-882)

Patrick Salkeld  Feb 16, 2024 
Other Digital Version Section 5.2; Indexing, Selection, and Filtering
Selection on DataFrame with loc and iloc

The word rows is misspelled as "roles".

The result of selecting a single row is a Series with an index that contains the DataFrame's column labels. To select multiple roles, creating a new DataFrame, pass a sequence of labels:

Andrei  Feb 17, 2024 
Other Digital Version Generator expressions
3rd code listing

syntax typo for the statement `dict((i, i **2) for i inrange(5))`
should have a space between the keywords `in` and `range`.

Ben To  Feb 19, 2024 
Other Digital Version Set
hashable set elements part

just missing a space before the **first parenthesis** in the sentence "set elements generally must be immutable, and they must be hashable(which means that calling hash on a value does not raise an exception)."

Ben To  Feb 19, 2024 
Printed, ePub Page Page 98, Section 4.1
First example, first 3 paragraphs

When tried to duplicate this example:
names = np.array(["Bob", "Joe", "Will", "Bob", "Will", "Joe", "Joe"])
data = ([[4, 7], [0,2], [-5, 6], [0, 0],[1, 2], [-12, -4], [3, 4]])
names == "Bob"
data[names == "Bob"]

I got this error:
Traceback (most recent call last):
File "/Volumes/Extreme SSD/Python Data Analysis/Python3_for_Data_Analysis/main.py", line 550, in <module>
data[names == "Bob"]
~~~~^^^^^^^^^^^^^^^^
TypeError: only integer scalar arrays can be converted to a scalar index

This contradicts the subsequent text which states:
"...You can even mix and match Boolean arrays with slices or integers (or sequences of integers; more on this later)."

Patrick Salkeld  Feb 19, 2024 
Other Digital Version Chapter 4 - Data Types for ndarrays
Second note box

Where the online text says "A signed integer can represent both positive and negative integers, while an unsigned integer can only represent nonzero integers", the phrase "nonzero integers" should be "non-negative integers".

Ben To  Mar 04, 2024 
O'Reilly learning platform Page Chapter 10.x
Throughout the chapter

Chapter 10 uses DataFrame.groupby(...,axis="columns") on several occasions, which is deprecated.

Jochen Schüttler  Apr 09, 2024 
Other Digital Version Chapter 4, Section "Data Types for ndarrays"
The second Note (after Table 4.2)

Text:
"A signed integer can represent both positive and negative integers, while an unsigned integer can only represent nonzero integers."

Suggestion:
"A signed integer can represent both positive and negative integers, while an unsigned integer can only represent non-negative integers, including zero."

Alessandro Botelho Bovo  Jun 06, 2024 
Other Digital Version Chapter 2, section "Numeric types"
3rd paragraph

It says:
"Integer division not resulting in a whole number will always yield a floating-point number"

Suggestion:
"Integer division will always yield a floating-point number"

Alessandro Botelho Bovo  Jun 06, 2024 
Other Digital Version Chapter 4, Section "Unique and Other Set Logic"
1st paragraph

It says: "NumPy has some basic set operations for one-dimensional ndarrays. A commonly used one is numpy.unique, which returns the sorted unique values in an array:"

The sentence might imply that `numpy.unique` only works for one-dimensional arrays, which is not true. The `numpy.unique` function also works for n-dimensional arrays, although by default it flattens the array to one dimension before finding the unique values.

Alessandro Botelho Bovo  Jun 11, 2024 
ePub Page Chapter 3, List
Discussion regarding "Extend"

Document at learning.oreilly.com.

In the discussion of "Extend", the text compares extend to "+" with adding a multi-element list in _one_ move to another multi-element list.

However, when discussing performance, the text describes adding the multi-element list in _n_ moves where _n_ is the length of the list being added, using a for loop. There seems to be little point to using either "extend" or "+" to add one element at a time to a list. One might as well use "append", it would make the code easier to understand.

Steven O. Ellis  Jul 07, 2024 
O'Reilly learning platform Page Chapter 2
Tab Completion

"Also, you can also complete methods and attributes on any object after typing a period:" double use of 'also'

Anonymous  Sep 05, 2024 
ePub Page https://wesmckinney.com/book/data-analysis-examples#whetting_movielens
In [98]: movies["genre"] = movies.pop("genres").str.split("|")

In [98]: movies["genre"] = movies.pop("genres").str.split("|")

should be movies["genres"] = movies.pop("genre").str.split("|")

Anonymous  Sep 11, 2024 
Other Digital Version Creating ndarrays
Quinto parrafo.

In [31]: np.empty((2, 3, 2))
Out[31]:
array([[[0., 0.],
[0., 0.],
[0., 0.]],
[[0., 0.],
[0., 0.],
[0., 0.]]])

La función np.empty no inicializa los valores del array, por lo que los valores que muestra son arbitrarios y no necesariamente ceros. El resultado es más consistente con np.zeros.

In [46]: np.empty((2, 3, 2))
Out[46]:
array([[[4.67296746e-307, 1.69121096e-306],
[1.78020984e-306, 1.55762979e-307],
[1.78022342e-306, 8.06635958e-308]],

[[1.86921415e-306, 1.00132737e-307],
[1.33508506e-307, 9.45701377e-308],
[1.11257937e-307, 2.00755374e-317]]])

Gerald Juárez  Sep 14, 2024 
Other Digital Version Section 11.1
Table 11.2

In the “Open Access” HTML version, Table 11.2: datetime format specification:
It says:
"%j - Day of the year as a zero-padded integer (from 001 to 336)"
According to the official Python document (and common sense:) ), the value range should be "from 001 to 366"

Jihang Tang  Sep 26, 2024 
ePub Page Using Code Examples
First sentence.

The first sentence begins with "You can data find files", I assume it should be "You can find data files".

Adel Siddiquei  Oct 15, 2024 
Other Digital Version Chapter 3, Section 3.1
6th Subtopic

In the provided example, the description states that strings with a length of 2 or less should be filtered out. However, the code filters out strings where the length is greater than 2 (if len(x) > 2). This is inconsistent with the intended explanation.

Correction:

To correct this, either the description should state that strings with a length greater than 2 are included, or the code should be modified to reflect the original intention of filtering out strings with a length of 2 or less.

Here’s the corrected code if the description is to remain unchanged:

[x.upper() for x in strings if len(x) <= 2]

This will ensure that only strings with a length of 2 or less are included and converted to uppercase, aligning with the description.

Syed Mohammad Hasan  Oct 22, 2024 
ePub Page 1 Preliminaries
Installing Necessary Packages

Sorry, I don't have a massive tech background. Is there something different about python 3.12.2? Or are there permission issues that I need to get around?

Latest is python 3.12.2. Got to the step where I'm running: "(base) $ conda config --set channel_priority strict"

I get this in return:
"Error while loading conda entry point: conda-libmamba-solver"

Reason: "/miniconda3/lib/libarchive.19.dylib' (no such file)"

Anonymous  Aug 21, 2024 
Other Digital Version 1.4 Installation and Setup
Installing Necessary Packages

On Windows, substitute a carat ^ for the line continuation \ used on Linux and macOS.
"carat" should be "caret", right?

Anonymous  May 15, 2024 
ePub Page 3.1, List
Discussion of "Extend"

Please disregard the errata I just submitted. I missed that the example was a list of lists. The text makes perfect sense.

Steven O. Ellis  Jul 07, 2024 
O'Reilly learning platform Page 4 NumPy Basics: Arrays and Vectorized Computation
Data Types for ndarrays

In [45]: numeric_strings = np.array(["1.25", "-9.6", "42"], dtype=np.string_)

`np.string_` was removed in the NumPy 2.0 release. Use `np.bytes_` instead.

Dmitry  Aug 27, 2024 
Other Digital Version 4.2 Pseudorandom Number Generation
Table 4.3: NumPy random number generator methods

duplicate `uniform` function listed in the table

Ben To  Mar 09, 2024 
O'Reilly learning platform Page 4.2 Pseudorandom Number Generation
Table 4.3: NumPy random number generator methods

In Table 4.3, “uniform” distribution is repeated in the third and last row.

Gao Lu  Oct 01, 2024 
Other Digital Version 4.4 Array-Oriented Programming with Arrays
first code listing

In [169]: points = np.arange(-5, 5, 0.01) # 100 equally spaced points

But this results in "1000" points.

Ben To  Mar 11, 2024 
Other Digital Version 4.6 Linear Algebra
4th code example

The qr method in the import statement, is never used.

from numpy.linalg import inv, qr

Doug Richardson  Aug 15, 2024 
ePub Page 5 Indexing, Selection and Filtering
Using Code Examples

In the following sentence should 'columns' be changed to 'rows'. When I test this, it prints 2 rows and all the columns.

The row selection syntax data[:2] is provided as a convenience. Passing a single element or a list to the [] operator selects columns.


Steven Mooney  Feb 21, 2024 
ePub Page 7.1.1 Filtering Out Missing Data
6th Paragragh and [38]

"Suppose you want to keep only rows containing at most a certain number of missing observations. You can indicate this with the thresh argument:"

The thresh argument to numpy.Dataframe.dropna() does not govern how many NA values are allowed.
Instead it requires that many non-NA values to be present.

Anonymous  May 07, 2024 
Other Digital Version 9 Plotting and Visualization
Figure 9.27: Tipping percentage by day split by time/smoker

The code to generate figure 9.27 does not match the generated figure, as the generated figure has a hue to the bars (indicating the day) which is missing from:

In [113]: sns.catplot(x="day", y="tip_pct", row="time",
.....: col="smoker",
.....: kind="bar", data=tips[tips.tip_pct < 1])

This can be corrected with:

In [113]: sns.catplot(x="day", y="tip_pct", row="time",
.....: col="smoker", hue="day",
.....: kind="bar", data=tips[tips.tip_pct < 1])

Doug Richardson  Aug 19, 2024 
Other Digital Version 9 Plotting and Visualization
Figure 9.28: Box plot of tipping percentage by day

Figure 9.28 box plots have hues in the image, but the code to generate them does not match.

In [114]: sns.catplot(x="tip_pct", y="day", kind="box",
.....: data=tips[tips.tip_pct < 0.5])

should be

In [114]: sns.catplot(x="tip_pct", y="day", kind="box", hue="day",
.....: data=tips[tips.tip_pct < 0.5])

To match figure 9.28.

Doug Richardson  Aug 19, 2024 
O'Reilly learning platform Page 10.2
6th code box, In [72]

The code example is "grouped_pct.agg([("average", "mean"), ("stdev", np.std)])". There is a FutureWarning to use "grouped_pct.agg([("average", "mean"), ("stdev", "std")]) instead.

Jochen Schüttler  Apr 09, 2024 
Other Digital Version 13.3 US Baby Names In[116]
China edition page415

According to the up code block: def~~
In[116]: names Out[116]: table maybe wrong.
It should be
name sex births year prop
year sex
1880 F 0 Mary F 7065 1880 0.077643
1 Anna F 2604 1880 0.028618
2 Emma F 2003 1880 0.022013
3 Elizabeth F 1939 1880 0.021309
4 Minnie F 1746 1880 0.019188
... ... ... ... ... ... ... ...
2010 M 1690779 Zymaire M 5 2010 0.000003
1690780 Zyonne M 5 2010 0.000003
1690781 Zyquarius M 5 2010 0.000003
1690782 Zyran M 5 2010 0.000003
1690783 Zzyzx M 5 2010 0.000003

Zhang yingtan  Mar 19, 2024 
PDF Page 135
4 & 6

"If a DataFrame’s index and columns have their name attributes set, these will also be displayed:"

Next sentence says: "Unlike Series, DataFrame does not have a name attribute."

One sentence (par. 4) refers to df as having their name attributes "set", while in the next sentence it specifies the df's "does NOT have a name attribute"

This creates confusion.

Emile Jacques Bosman  May 01, 2024 
Printed, ePub Page 147
3rd paragraph

The second sentence in the following text has the word "role" rather than "row:
The result of selecting a single row is a Series with an index that contains the DataFrame's column labels. To select multiple roles, creating a new DataFrame, pass a sequence of labels:

Anonymous  Jul 31, 2024 
Printed, ePub Page 159
1st paragraph

The paragraph starts with "Here the function f, which…". Since the example function is named "f1", the paragraph should start with "Here the function f1, which…"

Anonymous  Jul 31, 2024 
Printed, ePub Page 166
3rd paragraph

"When an entire row or column contains all NA values, the sum is 0, whereas if any value is not NA, then the result is NA. "

This sentence should be: "When an entire row or column contains all NA values, the sum is 0, whereas if any value is not NA, then the result includes the value(s) not NA."

df
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3

df.sum(axis="columns")
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64

df.sum(axis="columns", skipna=False)
a NaN
b 2.60
c NaN
d -0.55
dtype: float64

Anonymous  Jul 31, 2024 
Printed Page 169
In[285]

In[283] and In[285] look exactly the same even though line above says that you could include more concise syntax.

Jude Cancellieri  Mar 09, 2024 
Printed, ePub Page 210, Section 7.2
2nd paragraph

The sentence is:
Relatedly, drop_duplicates returns a DataFrame with rows where the duplicated array is False filtered out:

The sentence should be:
Relatedly, drop_duplicates returns a DataFrame with rows where the duplicated array is True filtered out:

Anonymous  Aug 26, 2024 
Printed, ePub Page 273
last paragraph, following subtitle 'Pivoting "long" to "Wide" Format'

In the sentence "In this format, individual values are represented by a single row in a table rather than multiple values per row.", the text starting with "by" should be: "by a single column in a table rather than multiple values (i.e. columns) per row."

Anonymous  Aug 29, 2024