Errata

Python for Data Analysis

Errata for Python for Data Analysis, Third Edition

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
Page Preface - Acknowledgments
Acknowledgments for Third Edition (2022)

Minor typo: "Programmer" is mispelled.

It has more than a decade since I started writing the first edition of this book and more than 15 years since I originally started my journey as a Python prorammer.

Note from the Author or Editor:
I am fixing the typo

Andy Jessen  Sep 16, 2022 
Page Preface
first paragraph

In the first sentence of the preface here:
wesmckinney.com/book/preface.html

it says:
"This Open Access web version of Python for Data Analysis 3rd Edition is now available as a companion to the print and digital editions. If you encounter any errata, please report them here."
The URL for error reporting is: www.oreilly.com/catalog/errata.csp?isbn=0636920023784

that is the wrong URL. The correct URL is: oreilly.com/catalog/0636920519829/errata

Note from the Author or Editor:
will fix

Anonymous  Sep 25, 2022 
Page Ch 10, Data Aggregation and Group Operations
10.3 Quantile and Bucket Analysis

Error:

As you may recall from Ch 8: Data Wrangling: Join, Combine, and Reshape, pandas has some tools, in particular pandas.cut and pandas.qcut, for slicing data up into buckets with bins of your choosing, or by sample quantiles.

Correct:
As you may recall from Ch 7: Data Cleaning and Preparation, pandas has some tools, in particular pandas.cut and pandas.qcut, for slicing data up into buckets with bins of your choosing, or by sample quantiles.

Reason:
pandas.cut and pandas.qcut are discussed in Ch 7 Section 2, Discretization and Binning.

Note from the Author or Editor:
will fix

Young Tan  Sep 26, 2022 
Page "Reading Text Files in Pieces" in 6.1
4th paragraph

"The elipsis marks" should be "The ellipsis marks".

Note from the Author or Editor:
will fix

Noritada Kobayashi  Oct 10, 2022 
Page "A.6 More About Sorting" in Appendix A
2nd code block (program list)

The randomly generated array (as below) is inappropriate as an example, as the first column is in ascending order from the beginning. Therefore, although we want only the first column to be sorted, there is no change in the array before and after sorting, which makes it difficult to convey the intent.

It would be an appropriate example if it were generated with other parameters.

In [166]: arr = rng.standard_normal((3, 5))

In [167]: arr
Out[167]:
array([[-1.1956, 0.4691, -0.3598, 1.0359, 0.2267],
[-0.7448, -0.5931, -1.055 , -0.0683, 0.458 ],
[-0.07 , 0.1462, -0.9944, 1.1436, 0.5026]])

In [168]: arr[:, 0].sort() # Sort first column values in place

In [169]: arr
Out[169]:
array([[-1.1956, 0.4691, -0.3598, 1.0359, 0.2267],
[-0.7448, -0.5931, -1.055 , -0.0683, 0.458 ],
[-0.07 , 0.1462, -0.9944, 1.1436, 0.5026]])

Note from the Author or Editor:
I will improve the example to be more robust to random number generation

Noritada Kobayashi  Oct 30, 2022 
Page "Regular Expressions" in 7.4
1st paragraph in p.231

Original:
the match object can only tell us the start and end position of the pattern in the string:

Suggestion for improvement:
the match object can tell us the start and end position of the pattern in the string:

Reason:
As the code block that follows indicates, the string representation of the match object includes information about the matched substring in addition to the start and end positions:

Out[174]: <re.Match object; span=(5, 20), match='dave@google.com'>

Note from the Author or Editor:
will fix

Noritada Kobayashi  Nov 06, 2022 
Page "String Functions in pandas" in 7.4
Table 7-6

Error:
Equivalent to built-in str.alnum

Correct:
Equivalent to built-in str.isalnum

Reason:
See the online documentation of Python.

Note from the Author or Editor:
will fix

Noritada Kobayashi  Nov 12, 2022 
Page Table 6-2, page 257
Argument: skip_footer

In pandas.read_csv(), the argument "skip_footer" has been deprecated.

It's now "skipfooter".

Note from the Author or Editor:
will fix

Anonymous  Nov 26, 2022 
Page "Adding legends" in "Ticks, Labels, and Legends" in 9.1
Blocks above and below Figure 9-10

Error:
In [50]: ax.legend()

The `legend` method has several other choices for the location `loc` argument. See the docstring (with `ax.legend?`) for more information.

The `loc` legend option tells matplotlib where to place the plot. The default is `"best"`, which tries to choose a location that is most out of the way. To exclude one or more elements from the legend, pass no label or `label="_nolegend_"`.

Correct:
In [50]: ax.legend()

The `legend` method can take the `loc` option, which instructs matplotlib where to place the legend in the plot. The `loc` option defaults to `"best"`, which tries to choose a location that is most out of the way. To exclude one or more elements from the legend, pass no label or `label="_nolegend_"`. The `legend` method has several other choices for the `loc` argument. See the docstring (with `ax.legend?`) for more information.

Reason:
In the 2nd ed., the author passed `loc="best"` as the argument to `legend` in the code block 50, so readers could read the subsequent sentences under the assumption that the `loc` option could be passed. In the 3rd ed., the `loc` option is not passed to `legend` in the code block 50, so the explanation of the `loc` option seems abrupt.

Note from the Author or Editor:
I revised the text in this section

Noritada Kobayashi  Jan 07, 2023 
Page "Saving Plots to File" in 9.1
1st paragraph

Error:
You can save the active figure to file using the figure object’s savefig instance method.

Correct:
You can save the figure to file using the figure object’s savefig instance method.

Reason:
In the 2nd ed., the target of the operation was an active figure since the section described `plt.savefig`, but in the 3rd ed., since the `savefig` instance method of a figure object is described, I think the target of the operation does not need to be active.

Note from the Author or Editor:
I am removing "active" from the text

Noritada Kobayashi  Jan 07, 2023 
Page "Saving Plots to File" in 9.1
Table 9-2

Error:
`facecolor, edgecolor`
The color of the figure background outside of the subplots; `"w"` (white), by default.

Correct:
The color of the figure background outside of the subplots; default to `rcParams["savefig.facecolor"]` and `rcParams["savefig.edgecolor"]`, both of which default to `"auto"` (facecolor and edgecolor of the current figure).

Reason:
The default changed from matplotlib 3.3.

Note from the Author or Editor:
I'm removing the part about the default altogether since it's pretty in the weeds

Noritada Kobayashi  Jan 07, 2023 
Page "Quantile and Bucket Analysis" in 10.3
paragraph spanning p. 339 and p. 340

Error:
We can pass `4` as the number of bucket compute sample quartiles, and pass `labels=False` to obtain just the quartile indices instead of intervals:

Suggestion for improvements:
We can pass `4` as the number of bucket to compute sample quartiles, and pass `labels=False` to obtain just the quartile indices instead of intervals:

Reason:
"to" may be missing.

Note from the Author or Editor:
I am adding the missing "to"

Noritada Kobayashi  Jan 14, 2023 
Page "Exponentially Weighted Functions" in 11.7
3rd paragraph in p.400

Error:
with an exponentially weighted (EW) moving average with `span=60`

Correct:
with an exponentially weighted (EW) moving average with `span=30`

Reason:
The code states `span=30` and also the 1st paragraph describes that specifying with `span` makes the result comparable to a simple rolling with the same width.

Note from the Author or Editor:
I'm fixing this in the text

Noritada Kobayashi  Mar 05, 2023 
Page Chapter 2, Variables and argument passing section
3rd paragraph under the section

"In some languages, the assignment if b will cause the data [1, 2, 3] to be copied."

if -> of

Note from the Author or Editor:
confirmed

Jeremy Hageman  Aug 23, 2023 
Page Appendices- Advanced Numpy, A3 Broadcasting
P 667, 'demean_axis' function code

the last line of function definition of 'demean_axis' should be changed to 'return arr - means[tuple(indexer)]', from 'return arr - means[indexer]'.

Note from the Author or Editor:
will fix

Lance Lee  Sep 04, 2023 
Page Chapter 5, Indexing Selection and filtering, Selecting on dataframe with loc and iloc
2nd paragraph

The result of selecting a single row is a Series with an index that contains the DataFrame's column labels. To select multiple roles, creating a new DataFrame, pass a sequence of labels:


To select multiple rows
instead of
To select multiple roles

Note from the Author or Editor:
confirmed

Elombat Loic  Sep 05, 2023 
Page Page 112
2nd Paragraph

Wes, I hope your're doing well bro. Enjoying the paperback of edition 3!

This is a minor (possible negligible) language clarification.

paragraph 2 reads:
[Here, arr.mean(axis=1) means "compute mean across the columns," where arr.sum(axis=0) means "compute sum down the rows"]. The choice of wording here is a bit confusing and could potential be interpreted to mean the opposite of what it is saying.

May I suggest, [Here, arr.mean(axis=1) means "compute mean through the rows," where arr.sum(axis=0) means "compute sum through the columns"].

Note from the Author or Editor:
i will revise the language to use "over"

Daniel Gala  Nov 04, 2023 
Page Creating ndarrays
n/a

In the two examples for data type for the array that NumPy creates:
In [27]: arr1.dtype
Out[27]: dtype('float64')

In [28]: arr2.dtype
Out[28]: dtype('int64')

The output of the dtype for arr2 is not int64 but int32.

Note from the Author or Editor:
I will add a note that the output might be int32 on some platforms

Jaeeun Choi  Nov 11, 2023 
Page Chapter 10: Data Aggregation and Group Operations
Quantile and Bucket Analysis Section

In line "pandas has some tools, in particular pandas.cut and pandas.qcut", the referred section is incorrect.
Incorrect referred section: "Ch 8: Data Wrangling: Join, Combine, and Reshape, "
Correct referred section: "Ch7: Data Cleaning and Preparation"

Note from the Author or Editor:
i will fix the reference

Thinh Pham  Nov 12, 2023 
Page https://wesmckinney.com/book/python-builtin#control_exceptions
at the first mention of the "finally:" block

The write_to_file() function is not defined.

Note from the Author or Editor:
write_to_file is a fake function for illustration's sake, but I'll clarify anyway

Sandor Budai  Nov 14, 2023 
Page Section 4.1 - data types for ndarrays
second note

In the note it says "A signed integer can represent both positive and negative integers, while an unsigned integer can only represent nonzero integers. For example, int8 (signed 8-bit integer) can represent integers from -128 to 127 (inclusive), while uint8 (unsigned 8-bit integer) can represent 0 through 255."

Second part of the first sentence seems incorrect (nonzero integers)
It should most likely read ", while an unsigned integer can only represent non-negative integers."

The example makes that clear also.

Note from the Author or Editor:
confirmed, should be "non-negative"

Niclas Ericsson  Nov 28, 2023 
chapter 2
Chapter 2

can vs cann


Python Language Basics, IPython, and Jupyter Notebooks

Built-in Data Structures, Functions, and Files

"To check if two variables refer to the same object, use the is keyword. is not cann analogously be used to check that two objects are not the same:"

Note from the Author or Editor:
Corrected before publication. Thank you!

Anonymous  Dec 13, 2021  Aug 12, 2022
Other Digital Version
§2.3
Language Semantics\Binary operators and comparisons

"Python Language Basics, IPython, and Jupyter Notebooks
...
Language Semantics
...
Binary operators and comparisons
Most of the binary math operations and comparisons use familiar mathematical syntax used in other programming langauges:"

"languages" instead of "langauages"

Note from the Author or Editor:
Corrected before publication. Thank you!

Oussama Kiassi  Jan 12, 2022  Aug 12, 2022
Other Digital Version
1.2 Why Python for Data Analysis?, Solving the “Two-Language” Problem
Second paragraph

The first sentence of the paragraph lacks a verb:

"Over the last decade some new approaches to solving the "two-language" problem, such as the Julia programming language."

Note from the Author or Editor:
Corrected before publication. Thank you!

Ali Rahmjoo  Feb 15, 2022  Aug 12, 2022
Page 4.4 Array-Oriented Programming with Arrays
1st code block

In [169]: points = np.arange(-5, 5, 0.01) # 100 equally spaced points

-> this will return 1000 equally spaced points, not 100

Anonymous  Jan 14, 2024 
Page 7.5 Categorical Data
page 391

The input in [248] gives an error.

Here is the correct input:

%time
labels.astype('category')

Note from the Author or Editor:
fixing the code example

Marjorie Curry  Oct 30, 2022 
Page 11.6 Resampling and Frequency Conversion
Table 11-5

Expression:
Axis to resample on; default `axis=0`

Suggestion for improvements:
Axis to resample on; default `axis="index"`

Reason:
This is not a mistake, but since the 3rd edition seems to unify the specification of axis in pandas with `"index"` and `"columns"` instead of numbers, the specification with numbers may surprise the reader a little.

Note from the Author or Editor:
I am fixing in text

Noritada Kobayashi  Mar 03, 2023 
Page 11.6 Resampling and Frequency Conversion
Table 11-5

Error:
`fill_method` How to interpolate when upsampling, as in `"ffill"` or `"bfill"`; by default does no interpolation

Correct:
(deletion of description)

Reason:
This option has been removed from API in pandas v0.18.0. See doc/source/whatsnew/v0.18.0.rst in the pandas repository.

Note from the Author or Editor:
Removing from text

Noritada Kobayashi  Mar 04, 2023 
Page 11.6 Resampling and Frequency Conversion
Table 11-5

Error:
`limit` When forward or backward filling, the maximum number of periods to fill

Correct:
(deletion of description)

Reason:
This option has been removed from API in pandas v0.18.0. See doc/source/whatsnew/v0.18.0.rst in the pandas repository.

Note from the Author or Editor:
Removing in text

Noritada Kobayashi  Mar 04, 2023 
Page 11.7 Moving Window Functions
1st paragraph in p.399

Expression:
The `rolling` function also accepts a string indicating a fixed-size time offset rolling() in moving window functions rather than a set number of periods.

Reason:
The meaning of "rolling() in moving window functions", which are inserted in the 3rd edition, seemed to me to be difficult to understand. In the 2nd edition, the sentence corresponding to this sentence was as follows:

The `rolling` function also accepts a string indicating a fixed-size time offset rather than a set number of periods.

Note from the Author or Editor:
This "rolling() in moving window functions" piece was inserted in the text by the indexer in error. It can either be removed or converted into its proper indexterm form

Noritada Kobayashi  Mar 05, 2023 
Page 13.1 Bitly Data from 1.USA.gov
use the json module and its loads function invoked on each line in the sample file we downloaded

"import json
with open(path) as f:
records = [json.loads(line) for line in f]"
, but It cann't use loads function invoked on each line in the sample file, Ipython/jupyter pop up a error :"UnicodeDecodeError: 'gbk' codec can't decode byte 0xac in position 6991: illegal multibyte sequence"

Note from the Author or Editor:
We need to add encoding="utf-8" when opening the file because this fails in china

Sam Z.H.  Oct 30, 2023 
Page 29
1st paragraph

'If you bind a new object to a variable inside a function, that will not overwrite a variable of the same name in the "scope" outside the function (the "parent scope").'

I believe that the correct is "... that will overwrite a variable ..." as it is demonstrated in the given example below the paragraph.

Note from the Author or Editor:
the language is unclear, I will revise

John Maciel  Oct 03, 2023 
Page 88
Table 4-1, third entry, 'arange'

the Python built-in range() function does not return a list but a generator

Note from the Author or Editor:
fixing

Claas Rostock  Dec 26, 2022 
Page 104
Table 4-3

uniform appears two times in the table

Note from the Author or Editor:
fixing

Claas Rostock  Dec 26, 2022 
Page 133
first paragraph and code block

"If you assign a Series, its labels will be realigned exactly to the DataFrame's index ..."

In[65]: val = pd.Series([-1.2, -1.5, -1.7], index=["two", "four", "five"])"

This does not demonstrate any matching of frame2's index to the Series index.
It would be more informative as something like '... index=["two", 4, "five"]'

Note from the Author or Editor:
I am fixing the code example

Gregory Sherman  Feb 21, 2023 
Page 166 (3rd edition)
middle of page

It is *not* true that "if any value is not NA, then the result is NA." Apparently the default is to skip (exclude) NA values.

Note from the Author or Editor:
Yes, the language needs to be fixed to indicate that the result will be the sum of the non-NA values

Michael VanValkenburgh  Nov 15, 2022 
Page 176
mid

You write:
"
Indexing
Can treat one or more columns as the returned DataFrame..
"
Is this correct, or did you mean "treat .. as index of the returned DataFrame"?

Note from the Author or Editor:
fixing

Claas Rostock  Dec 28, 2022 
Page 207
paragraph at top & [38]

"Suppose you want to keep only rows containing at most a certain number of missing observations. You can indicate this with the thresh argument."

In fact, as command[38] shows, with thresh=2, only rows with <2 missing values were kept.
In the first sentence of the page, "at most" can be replaced with "less than".

Note from the Author or Editor:
I am correcting to "less than" in the text

Gregory Sherman  Mar 03, 2023 
Page 274 (third edition)
In [151]:

The line "In [151]: ..." appears to be superfluous---a holdover from the second edition.

Note from the Author or Editor:
will fix

Michael VanValkenburgh  Nov 29, 2022 
Page 282 (third edition)
second sentence of 9.1

Will you please clarify the difference between
%matplotlib inline
and
%matplotlib notebook
?
For example, Figure 9-15 on page 302 works with notebook but is blank with inline,
and Figure 9-19 on page 307 works with inline but partially overwrites Figure 9-18 with notebook.

Note from the Author or Editor:
I will clarify

Michael VanValkenburgh  Nov 30, 2022 
Page 301 (third edition)
Table 9-4

In Table 9-4, I believe the argument is "layout" (singular).

Note from the Author or Editor:
will fix

Michael VanValkenburgh  Nov 29, 2022 
Page 310
last paragraph

In the text, you say histplot can plot both histogram and density plot simultaneously, but then (in Figure 9-23) you only plot the histogram. I wonder if you intended to use kde=True so that both are plotted.

Note from the Author or Editor:
You're right, I will fix

Alex Dow  Aug 24, 2023 
Page 317
first sentence in 9.3

"there [are] many options..." (insert "are")

Note from the Author or Editor:
will fix

Michael VanValkenburgh  Nov 30, 2022