Errata

Python for Data Analysis

Errata for Python for Data Analysis, Second Edition

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
Chapter 2
Subsection :- Duck Typing

"this means it has a __iter__ “magic method,” though an alternative"

The comma after method should actually be after ".

Note from the Author or Editor:
Changing this in the source material

Naman Bhalla  Nov 11, 2017  Sep 21, 2018
Ch11
subsection "Converting between string and datetime"

In the part discussing converting datetime objects from strings, you say that strptime uses the same format codes as strftime, but that's not quite right:

value = '2011-01-03'
stamp = datetime.strptime(value, '%Y-%m-%d') # works
datetime.strptime(value, '%F') # ValueError: 'F' is a bad directive in format '%F'
datetime.strftime(stamp, '%F') # works

Note from the Author or Editor:
Quite right. Fixing the language to say "many of the same"

Alex Branham  Dec 04, 2017  Sep 21, 2018
??
Integer Indexes Section Paragraph 4

In the Integer Indexes section of Chapter 5 the following paragraph is ambiguous:

"To keep things consistent, if you have an axis index containing integers, data selection will always be label-oriented. For more precise handling, use loc (for labels) or iloc (for integers):"

To test this out I defined the following object:


`ser3 = Series(np.arange(4.), index=['a', 'b', -1, 34])`

and ran these two commands, both of which return 2.0:

`ser3[-1]`
`ser3[-2]`

`ser3.index` gives me "Index(['a', 'b', -1, 34], dtype='object')"

So, I think you could argue that the way Pandas actually works has some ambiguity to it and that the way the book describes it is the way it SHOULD work. But to describe the actual way this part of Pandas works, the following paragraph would be more accurate:

"To keep things consistent, if you have an axis index containing exclusively integers (mixed indexes will match on label first, and fall back to positional indexing), data selection will always be label-oriented. For more precise handling, use loc (for labels) or iloc (for integers):"

Or something of that nature.

Note from the Author or Editor:
will improve

Bob McDonald  Dec 16, 2018 
Mobi

Current
Afer this operation, the variable a is unmodified:

Suggested
After this operation, the variable...

O'Reilly Media
 
Sep 17, 2019 
Other Digital Version
location 1741
top

Found this error on the kindle version, location 1741.

the line: In[76]: seq[3:4] = [6,3]

should be: In[76]: seq[3:4] = [6]

Note from the Author or Editor:
Confirmed. Will fix

Ravi  Nov 18, 2019 
Page page 49
section 2.3.2.7 table 2-5: Datetime格式化详细说明

table 2-5: detetime 格式化详细说明
%F %Y-%m-%d的简写(例如 2012-4-18)
actually 2012-04-18 %m:Double digit months

Note from the Author or Editor:
fixing in text

chengjq  Apr 18, 2022 
Page Ch. 1. Installing Necessary Packages
Note 2, installing packages into conda environment

In Note 2, Ch. 1. Installing Necessary Packages:

Installing packages into conda environment uses activate instead of install:

Should be
conda install lxml beautifulsoup4 html5lib tables openpyxl /
requests sqlalchemy seaborn scipy statsmodels /
patsy sklearn

Alexander de la Paz  Jul 24, 2022 
Ch5
Indexing, selection, and filtering; Table 5-6. Indexing options with DataFrame

`df.iloc[where]` Selects single row or subset of rows from the DataFrame by label.

Should probably be "...from the DataFrame by *integer position*."

Yung-Jin (Joey) Hu  Feb 14, 2017  Sep 25, 2017
Other Digital Version
Ch5
Integer Indexes; 4th paragraph

"an axis index containing *itnegerse*, data selection"

"integers" is spelled incorrectly.

Yung-Jin (Joey) Hu  Feb 14, 2017  Sep 25, 2017
Ch5
Handling Missing Data; 2nd paragraph

"The way that missing data is represented in pandas object is somewhat imperfect, but it is functional for a lot of *usres*."

*users* is spelled incorrectly.

Yung-Jin (Joey) Hu  Feb 14, 2017  Sep 25, 2017
Ch5
Sorting and ranking; within the code examples

It looks like `.sort_values(by=...) is deprecated.

In [203]: frame.sort_index(by='b')
FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)

In [207]: frame.sort_index(by=['a','b'])
FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)

In [205]: frame.sort_values(by='b')

fixed the problem.

In [211]: pd.__version__
Out[211]: '0.19.2'

Yung-Jin (Joey) Hu  Feb 14, 2017  Sep 25, 2017
Ch5
Summarizing and Computing Descriptive Statistics; code block 3

The input variable df is:

In [187]: df
Out[187]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3

The code in the book gives this result:

In [204]: df.sum(axis=1)
Out[204]:
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64

but shouldn't row "c" be "NaN" since we're summing together two NaNs? Here's what I get from my interpreter:

In [186]: df.sum(axis=1)
Out[186]:
a 1.40
b 2.60
c NaN
d -0.55
dtype: float64

In [188]: pd.__version__
Out[188]: '0.19.2'

Yung-Jin (Joey) Hu  Feb 14, 2017  Sep 25, 2017
Ch6
Note within "Indentation, not braces" section

"I strongly recommend that you use 4 spaces *to* as your default indentation..."

Should probably be:

"I strongly recommend that you use 4 spaces as your default indentation..."

by removing the word "to" before the second half of the sentence "... as your default indentation".

Yung-Jin (Joey) Hu  Feb 26, 2017  Sep 25, 2017
Ch6
Slicing Section, 3rd paragraph

"While element at the start index is included, the stop..."

Should probably be:

"While *the* element at the start index is included, the stop..."

Yung-Jin (Joey) Hu  Feb 26, 2017  Sep 25, 2017
?
Table 3-4

There are two rows in the table that describe the readlines function.

Daniel Walter  Aug 02, 2017  Sep 25, 2017
Ch. 4
Table 4-2. NumPy data types

the fourth row on the Table 4-2: Signed and unsigned """32"""-bit integer types
32 is the third row. This must be changed to 64.

Kim, Jin  Sep 10, 2017  Sep 25, 2017
3
Boolean Indexing 5th paragraph

The line of code:

data[-(names == 'Bob')]

Gives the deprecation warning:

DeprecationWarning: numpy boolean negative, the `-` operator, is deprecated, use the `~` operator or the logical_not function instead.

using numpy version 1.12.0

Using the tilde operator, as recommended silences the warning.

Yung-Jin (Joey) Hu  Jan 31, 2017  Sep 25, 2017
3.1
ZIP section

Code works as it should, but the first name and last name are reversed for Curt Schilling. Need to be a Red Sox fan to pick up on this one.

('Schilling', 'Curt') should read ('Curt', 'Schilling') in the following code:


In [96]: pitchers = [('Nolan', 'Ryan'), ('Roger', 'Clemens'),
....: ('Schilling', 'Curt')]

Note from the Author or Editor:
Confirmed. Will fix

Anonymous  Oct 01, 2018 
5
statsmodel section

I am not sure of the page number since I am using Safari books online which doesn't do pagination.

In the statsmodel section of chapter 1, line 6, the word "grown" is misspelled (gornw).

Bala Ganeshan  Dec 10, 2016  Sep 25, 2017
PDF
Page 9
Last line of Windows discussion

Text states: To exit the shell, press Ctrl-D or type the command exit() and press return.

On Windows, Ctrl-Z should be used.

David Welden  Sep 16, 2017  Sep 25, 2017
Printed
Page 17
2nd Paragraph

The reference to the IPython should be "Appendix B", not "Appendix A".

Note from the Author or Editor:
Confirmed. Will fix

John Boersma  Nov 11, 2018 
18
Top of second page of Chapter 2

The example uses 1.usa.gov data. This service has been shut down. It would be a pain to craft a whole new opening example, but you might want to. Even if you don't, you might want to let people know it's no longer online so they don't look for it.

https://blog.usa.gov/decommissioning-1-usa-gov

https://github.com/usagov/1.USA.gov-Data

Note from the Author or Editor:
You are right. I added a note that it is decommissioned.

John Transue  Dec 06, 2016  Sep 25, 2017
Printed
Page 19
command

$jupyter notebook

fails under Windows 10 Command Prompt: "Error executing Jupyter command 'notebook': [Errno 'jupyter-notebook' not found] 2"

version 4.4.0: "Available subcommands: kernel kernelspec migrate run troubleshoot"
Trying 'run', there was no response - the command simply hung

Note from the Author or Editor:
will add clarifying comment how to install the notebook

Gregory Sherman  Jan 14, 2019 
Printed
Page 29
top text and commands

Magic functions can be used by default without the percent sign ...

Some magic functions behave like Pyton functions and their output can be assigned to a variable:

In [22]: %pwd
Out [22]: '/home/west/code/pydata-book/

In [23]: foo = %pwd

----------------------------------------------------------------------------------

First, a single quote is missing from Out[22]

With ipython 6.3.1, although In [22] works using pwd without the leading percent sign, In [23] fails with "NameError: name 'pwd' is not defined"

Note from the Author or Editor:
Fixing this typo

Gregory Sherman  Apr 13, 2018  Sep 21, 2018
Printed
Page 29
first sentence

I previously reported this issue, but it's a problem beyond the typo that was addressed in the reply.

"Magic functions can be used by default without the percent sign..."

This is not completely true.
For example, this variation of In[23] will not work:
foo = pwd

The % in front of the magic command can be skipped (by default) if the command is the first "word" on an IPython line. I have found that leading whitespace is not a problem.

Note from the Author or Editor:
Thanks. I will clarify that this is the only scenario where the % can be omitted

Gregory Sherman  Apr 23, 2019 
Printed
Page 30
Figure 2-6

Running the matplotlib code exactly as printed inside Figure 2-6 gives a Type error:

TypeError: float() argument must be a string or a number, not 'builtin_function_or_method'

Note from the Author or Editor:
Thank you. Will fix

James Shenton  Apr 15, 2020 
PDF
Page 37-38
Table 2-3

Table 2-3. Binary operators
Missing the modulo (%) operator.

Note from the Author or Editor:
Thanks. Will add for the 3rd edition

Ali Tobah  Sep 02, 2020 
PDF
Page 38
table 2-3

inconsistent description of a <= b, a < b (compared to the next line), should be for a < b, a <= b

Note from the Author or Editor:
Confirmed, will fix (in 3rd edition)

B. Goas  Feb 13, 2019 
PDF
Page 38
First paragraph of "Mutable and immutable objects"

Text says "modifiedK", should be "modified:"

David Welden  Sep 17, 2017  Sep 25, 2017
Printed
Page 39
Third paragraph under "Numeric Types"

"Integer division not resulting in a whole number will always yield a floating-point number."

Actually, this is true for whole numbers too:

In [1]: 4/2
Out [1] : 2.0

Note from the Author or Editor:
Quite right. Will fix (in the 3rd edition)

John Boersma  Nov 11, 2018 
Printed
Page 44
Last paragraph of "None" section

"but also a unique instance of NoneType" should be "but also the unique instance of NoneType".

Note from the Author or Editor:
Thanks. Will fix

John Boersma  Nov 11, 2018 
Printed
Page 46
Code blocks in 2nd and 3rd paragraphs

Two illegal print statements:

print('It's negative')

Since the strings contain single quote characters, they should be delimited by double quotes.

Note from the Author or Editor:
Thank you, fixed

Michael Clark  Nov 05, 2017  Sep 21, 2018
Printed
Page 46
Table 2-5

2012-4-18 should be 2012-04-18

Note from the Author or Editor:
Confirmed. Will fix

John Boersma  Nov 11, 2018 
Printed
Page 47
2

2nd edition: under the "for loops" section, 1st line, "iterater" --> "iterator"
(otherwise not consistent with pg 50 where 'iterator' is also mentioned).

Note from the Author or Editor:
Confirmed typo.

E G  Mar 08, 2020 
PDF, ePub
Page 51
2nd sentence below 'pass' heading

" ... to be taken (or as a placeholder for code not yet implemetned); ..."

"implemented" is spelled incorrectly~

Greg Graham  Jun 06, 2017  Sep 25, 2017
Printed
Page 56
First sentence

Sentence should read: "...which locates the first such value and removes it from the list..."

Note from the Author or Editor:
Thanks! Fixing the typo

Thomas Koundakjian  Nov 09, 2017  Sep 21, 2018
PDF
Page 60
4th line

Although it's never explicitly described as such, the output of the dictionary on this page is showing the dictionary as unordered. This would be incorrect, as the book is utilizing Python 3.6, the version in which dictionaries changed to insertion-ordered.

Note from the Author or Editor:
Quite right. I'll correct this with the 3rd edition changes

David Bankson  Jun 04, 2020 
Printed
Page 61
Top

In the example demonstrating zip, you use the names of three pitchers: Nolan Ryan, Roger Clemens, and Curt Schilling. In the example, you use zip to show first names and last names; the first_names has ('Nolan', 'Roger', 'Schilling') and last_names has ('Ryan', 'Clemens', 'Curt')

Curt is his first name and Schilling is his last name, so Curt should be in first_names and Schilling in last_names.

Note from the Author or Editor:
Fixing this mistake

Jon Ernster  Nov 28, 2017  Sep 21, 2018
PDF
Page 65
2nd

Hello my friend.
This is related to usage and definition
of a set with reference to the 'set' function
in python.
While you may or may not agree whether this
is a minor technical mistake, it is a mistake in
terms of accuracy/precision. While a set is an
unordered collection of unique elements,
set as defined in python seems to be a 'sorted unordered
collection of unique elements.'
Thus, depending on input for set(), the actual output would
vary under those definitions -- subtle as they might be.

Note from the Author or Editor:
I will clarify.

E G  Mar 10, 2020 
Printed
Page 66
Table 3-1. Python set operations

The alternative syntax for a.issubset (b) and a.issuperset (b) shoule be "<=" and "=>" respectively (not N/A).

Note from the Author or Editor:
Fixing this.

Daniel Andersson  Feb 03, 2018  Sep 21, 2018
Printed
Page 66
Last paragraph.

"Like dicts, set elements generally must be immutable." should be "Like dict keys, set elements generally must be immutable."

Note from the Author or Editor:
Confirmed, will fix (in 3rd edition). Thank you

John Boersma  Nov 11, 2018 
Printed
Page 70
bottom of page

Suppose instead we had declared a as follows:
a = []
def func():
for i in range(5):
a.append(i)
========================================
The sentence implies that an explanation of what happens will follow, but there is none.

Note from the Author or Editor:
Good catch. I'm adding a code example to show how the alternate example works

Gregory Sherman  Apr 14, 2018  Sep 21, 2018
Printed
Page 80
Top line

"As you will see later in the chapter, you can step into the stack (using the %debug or %pdb magics)..." should start with "As you will see in Appendix B..."

Note from the Author or Editor:
Correct. Will fix

John Boersma  Nov 11, 2018 
Printed
Page 82
bottom of page; first entry of Table 3-4

Method Description
read([size]) Return data from a string, with optional size
argument indicating the number of bytes to read
=============================================================

The "number of bytes" assertion is contradicted on the next page - "Python reads enough bytes ... to decode that many characters" and in the read() docstring - "Read at most n characters from stream."

Note from the Author or Editor:
Adding language to indicate that whether bytes or unicode are read depends on the mode of the file

Gregory Sherman  Apr 15, 2018  Sep 21, 2018
Page 89
3rd paragraph

I can run following syntax at jupyter notebook:

import numpy as np
my_arr= np.arange(100000)
my_list = list(range(100000))
%time for _ in range(10): my_arr2 = my_arr*2

However, if I run it at other python IDE, it gives following error. Can you help me about this? What is syntax for other python IDE? Thanks.

>>> time for _ in range(10): my_arr2 = my_arr*2
File "<stdin>", line 1
time for _ in range(10): my_arr2 = my_arr*2
^^^
SyntaxError: invalid syntax

Note from the Author or Editor:
Adding clarification that %timeit only works within IPython or Jupyter

Anonymous  Mar 29, 2022 
Printed
Page 100
Warning box mid page

The warning claims boolean selection will not fail if the boolean array is not the correct length. I think this was changed in Numpy 1.13, but is definitely not true in Numpy 1.14.2

For example:

x = np.random.randn(5,5)
y = np.array(['a','b','c', 'a', 'b', 'c', 'd', 'd', 'd'])
x[y == 'a']

IndexError: boolean index did not match indexed array along dimension 0; dimension is 5 but corresponding boolean dimension is 9

Note from the Author or Editor:
Removing the caution box

Mladen Kolovic  Apr 08, 2018  Sep 21, 2018
Printed
Page 103
3rd paragraph

1st release print copy says, “...the result of fancy indexing is always one-dimensional.” However, there are example outputs in this section with more than one dimension. Is that because some of the examples in the section are not fancy indexing? If that’s the case, it’s unclear where the section is building up to a fancy indexing example as opposed to every example being fancy indexing. The number of dimensions in the output seems to be the number of array dimensions plus one minus the number of dimensions indexed.

Note from the Author or Editor:
The text is unclear, I will clarify

Stephen Frost  Feb 19, 2018  Sep 21, 2018
PDF
Page 108
Table 4-3. Unary ufuncs

Missing term: "Natural logarithm (base e), log base 10, log base 2, and , respectively".

A. Jesse Jiryu Davis  May 23, 2017  Sep 25, 2017
Printed
Page 112,121
first sentence of 112. "Simulating ..." on 121

pg 112 first sentence:
Here, arr.mean(1) means "compute mean across the columns" where arr.sum(0) means "compute sum down the rows"

conflicts with pg. 121 "Simulating Many Random Walks at Once":
"we can compute the cumulative sum across the rows"
.
.
.
In [262]: walks = steps.cumsum(1)

Note from the Author or Editor:
Thanks. Will clarify language

Gregory Sherman  Jan 04, 2019 
Printed
Page 114
1

May want to specify arr.mean(1) is the same as arr.mean(axis=1).

Less assumptions the readers has to make, the better?

Note from the Author or Editor:
I agree this would be clearer. I'll clarify in the 3rd edition

Shivan Sivakumaran  Oct 03, 2020 
PDF
Page 123
1st Paragraph

"operations" is misspelled at the location, "Using NumPy functions or NumPy-like oeprations..."

Ryan Shuhart  Jan 05, 2017  Sep 25, 2017
Printed
Page 126
first paragraph

The book states when you are only passing a dict, the index in the resulting Series will have the dict's key in sorted order. However, this is not always the case. Running the code on my system I have the output pasted below. Looking at the output we see that returned series is not in sorted order.

sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}

obj3=pd.Series(sdata)

obj3
Out[49]:
Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64

Note from the Author or Editor:
Correct, pandas now obeys the insertion order of the dict. I will fix in the 3rd edition

Howard Smith  Aug 30, 2018 
Printed
Page 126
2nd paragraph

"You can override this by passing the dict keys in the order you want them to appear in the resulting Series"
However, given the [29]- [31] commands, the actual result is
Out[31]:
Oregon 16000.0
California NaN
Texas 71000.0
Ohio 35000.0

Note from the Author or Editor:
will review

Gregory Sherman  Jan 04, 2019 
Printed
Page 128
Ch5.1: Introduction to pandas daa Structures: Series - 8th Para

The text says "When you are only passing a dict, the resulting Series will have the dict's keys in sorted order".

This doesn't appear to be true, either with the example given in the book, or with a repro (which proves the example is not the error). These keys seem _un_sorted to me when only passing in a dict.

>>> import pandas as pd
>>> sdata = { 'Ohio':35000, 'Texas':71000, 'Oregon':16000,'Utah': 5000 }
>>> obj3 = pd.Series(sdata)
>>> obj3
Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
>>> obj3.index
Index(['Ohio', 'Texas', 'Oregon', 'Utah'], dtype='object')

Note from the Author or Editor:
Confirmed that pandas now respects the "insertion order" of keys when creating a Series from a dictionary. Will fix in the 3rd edition

Gavin Draper  Mar 16, 2021 
Printed
Page 138
In[104]

As presented, this line leads to "FutureWarning: Passing list-likes to .loc with missing label will raise KeyError in future." Revise to avoid warning.

Note from the Author or Editor:
Confirmed, will fix

John Boersma  Nov 11, 2018 
PDF
Page 141
Sentence beginning with word Setting in italics

Word "section" is misspelled as "sectino"

Anonymous  Sep 20, 2017  Sep 25, 2017
PDF
Page 145
last paragraph

"To keep things consistent, if you have an axis index containing integers, data selection will always be label-oriented."

I think this should be integer-oriented.

Note from the Author or Editor:
The code example does not illustrate the intended behavior. I am changing the example to be "ser[-1]" instead of "ser[:2]" and added a note that slicing with integers ignores the integer labels

Yang Yang  Oct 18, 2017  Sep 21, 2018
Printed
Page 145, 146
final paragraph & code following

[similar to a previously reported issue]

"... if you have an axis index containing integers, data selection will
always be label-oriented. For more precise handling, use loc (for labels) ..."

In [147]: ser[:1]
Out[147]:
0 0.0
dtype: float64

In[148]: ser.loc[:1]
Out[148]:
0 0.0
1 1.0
dtype: float64

In[149]: ser.iloc[:1]
Out[149]:
0 0.0
dtype: float64


The series "ser" is indexed by integers, so - according to the text - data selection should be label-oriented (in the absence of loc or iloc). However, Out[147] is identical to Out[149], which results from using iloc, so the "ser[:1]" data selection appears to be integer-oriented.

Note from the Author or Editor:
will clarify

Gregory Sherman  Apr 30, 2019 
Printed
Page 158
In [233]

Row 'c' is populated with 0.0, not NaN.

Note from the Author or Editor:
Confirmed. Will fix in the 3rd edition revision.

John Boersma  Nov 11, 2018 
PDF
Page 160
Table 5-8

The text says: "argmin, argmax - Compute index locations (integers) at which minimum or maximum value obtained, respectively"

Should be: "argmin, argmax - Compute index labels for Series at which minimum or maximum value obtained, respectively"
-----------------------------------------------------------------------
Example from this chaptert - returns label, not integer
In [115]: df.loc['d'].argmin()
Out[115]: 'two'

Note from the Author or Editor:
Per https://github.com/pandas-dev/pandas/issues/16830 this is supposed to return the positional values but did not for a while because of some changes in pandas. In the future, it will do the right thing (what the book says now), so I'm not going to change the book

Andrey Dubinchak  Dec 14, 2017  Sep 21, 2018
PDF
Page 164
table 5-9

Looks like instead of method "match" there should be "get_indexer"

Note from the Author or Editor:
Fixing this to "get_indexer"

Aivar Annamaa  Nov 18, 2017  Sep 21, 2018
Printed
Page 172
Table 6-2

For the argument "names", combining with "header=None" is not needed. Using the parameter "names" implies this.

Note from the Author or Editor:
Right. Will fix

John Boersma  Nov 11, 2018 
PDF
Page 173
in Table 6-2

The 'iterator' option of pandas.read_csv returns a TextFileReader object, not a TextParser object, since 2012.

Note from the Author or Editor:
I am changing the test to correspond to changes in the latest version of pandas

Noritada Kobayashi  Nov 05, 2017  Sep 21, 2018
PDF
Page 174
after Out[38]

As Out[38] shows, the 'iterator' option of pandas.read_csv returns a TextFileReader object, not a TextParser object, since 2012.

Note from the Author or Editor:
I am changing the text to correspond with changes in pandas

Noritada Kobayashi  Nov 05, 2017  Sep 21, 2018
PDF
Page 175
top

As Out[38] show, the 'iterator' option of pandas.read_csv returns a TextFileReader object, not a TextParser object, since 2012.

Note from the Author or Editor:
I am changing the text to correspond to the current version of pandas

Noritada Kobayashi  Nov 05, 2017  Sep 21, 2018
Printed
Page 176
Bottom of page

"tuples of values" should be "lists of values".

Note from the Author or Editor:
Confirmed. Thanks

John Boersma  Nov 11, 2018 
PDF
Page 179
Out[64]: result

In [64]: result
Out[64]:
{'name': 'Wes',
'pet': None,
'places_lived': ['United States', 'Spain', 'Germany'],
'siblings': [{'age': 30, 'name': 'Scott', 'pets': ['Zeus', 'Zuko']},
{'age': 38, 'name': 'Katie', 'pets': ['Sixes', 'Stache', 'Cisco']}]}

'pet': None,
Should print after the line:
'places_lived': ['United States', 'Spain', 'Germany'],

Note from the Author or Editor:
will fix

Shaahin Riazi  Apr 18, 2020 
Printed
Page 180
1st paragraph

Refers to the USDA Food Database example in Chapter 7; in second edition, this example is in Chapter 14.4 (page 436-442)

Note from the Author or Editor:
Fixing this reference to point to Ch 14

Laura Hughes  Jan 31, 2018  Sep 21, 2018
PDF
Page 182
first block of code for getroot

Code says the example file is in path:

path='examples/mta_perf/Performance_MNR.xml'

Actual path from git repository is:

path='datasets/mta_perf/Performance_MNR.xml'

Note from the Author or Editor:
Correct, thank you. This will need to be fixed in the source files

David Welden  Sep 25, 2017  Oct 20, 2017
PDF, ePub
Page 184
Link to Apache Arrow in 'Feather' Section

URL for 'Apache Arrow' points to 'apache.arrow.org' instead of 'arrow.apache.org'

Joel A  Oct 08, 2017  Oct 20, 2017
Printed
Page 184
last paragraph & [92]

In[92]: frame = pd.DataFrame({'a': np.random.randn(100)})
fails:
ImportError: HDFStore requires PyTables, "No module named 'tables'" problem importing
-------------------------
Although PyTables is mentioned in the previous text, there is no indication that this library needs to be installed.

I tried "pip install PyTables", but it failed with:
Collecting PyTables
Could not find a version that satisfies the requirement PyTables (from versions: )
No matching distribution found for PyTables


So, I'm still without PyTables and don't know how to get it.

Note from the Author or Editor:
will fix

Gregory Sherman  Jan 09, 2019 
Printed
Page 185
In [96]

The command "store" at this point produces only the first two lines of the indicated output - the rest is not produced.

Note from the Author or Editor:
I think this was caused by a malformed build where the "mydata.h5" persisted between builds. I will see that it's fixed

John Boersma  Nov 11, 2018 
PDF
Page 185
First sentence of final paragraph

Text reads "...how they can sunit your needs"

Should be "...how they can suit your needs"

Note from the Author or Editor:
This typo is fixed in the final 2nd edition

David Welden  Sep 25, 2017  Sep 25, 2017
Printed
Page 186-187
Under heading on 186, second code block on 187

This is not so much an error, per se, but a comment on a "may" clause in the book. I'm writing this incase you like to track these sorts of issues.

On page 186, the text says "Internally these tools use the add-on packages xlrd and openpyxl to read XLS and XLSX files, respectively. You may need to install these manually with pip or conda."

This is very true as the example line on the next page (187) "writer = pd.ExcelWriter('examples/ex2.xlsx')" threw an error on my system. I'm using pandas 0.21.0 within a python 3.6.2 virtual environment.

Manually installing the packages in question via pip solved my problems.

Thanks!

Note from the Author or Editor:
I'm changing the language to say "These must be installed separately"

Jim Sam  Dec 08, 2017  Sep 21, 2018
Printed
Page 186
[105]

The text and command conflict:
"Data stored in a sheet can then be read into DataFrame with parse:

In [105]: pd.read_excel(xlsx, 'Sheet1')"

Note from the Author or Editor:
will fix

Gregory Sherman  Jan 09, 2019 
PDF
Page 192
Table 7-1

The left column is named as "Argument", which should be "Method".

Note from the Author or Editor:
Making suggested change

Noritada Kobayashi  Nov 25, 2017  Sep 21, 2018
Printed
Page 195
In [35]

In [35]: _ = df.fillna(0, inplace=True)

The assignment "_ =" is unnecessary.

Note from the Author or Editor:
will fix

Gregory Sherman  May 11, 2019 
Printed
Page 204
middle

In [85]: data = np.random.randn(20)

In [86]: pd.cut(data, 4, precision = 2)
.
.
.
The precision = 2 option limits the decimal precision to two digits.

-----------

However, one of the bins I get is (0.031, 0.27]

Note from the Author or Editor:
will investigate

Gregory Sherman  Jan 10, 2019 
Printed
Page 206
last sentence

"Calling permutation with the length of the axis you want to permute ..."

According to what I have seen (an example is below) it seems that the phrase should be "the length of axis 0" or "the number of rows". Calling permutation() with the number of columns can result in rows being dropped or an IndexError.
The question arises: can permutation() or a similar function randomly order columns?


In [225]: df=DataFrame(np.arange(12).reshape((4,3)))

In [226]: df
Out[226]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11

In [227]: s=np.random.permutation(4)

In [228]: df.take(s)
Out[228]:
0 1 2
2 6 7 8
1 3 4 5
0 0 1 2
3 9 10 11

In [229]: s=np.random.permutation(3)

In [230]: df.take(s)
Out[230]:
0 1 2
1 3 4 5
0 0 1 2
2 6 7 8

.
.
.
In [258]: df=DataFrame(np.arange(12).reshape((3,4)))

In [259]: s=np.random.permutation(4)

In [260]: df.take(s)
.
.
.

IndexError: indices are out-of-bounds

Note from the Author or Editor:
will add example of permuting columns

Gregory Sherman  Jan 10, 2019 
PDF
Page 208
1st paragraph of a section named "Computing Indicator/Dummy Variables"

The paragraph says "Let’s return to an earlier example DataFrame". However, since that example is contained in section 8.2 in the 2nd edition, "earlier" is not an appropriate word.

Note from the Author or Editor:
Fixing language to "Let's consider an example DataFrame..."

Noritada Kobayashi  Nov 27, 2017  Sep 21, 2018
Printed
Page 209
In [115]

The parameter "engine='python'" is needed in this command. Without this, a ParserWarning is produced due to the two character separator.

John Boersma  Nov 11, 2018 
PDF
Page 213
Table 7-3

The left column is named as "Argument", which should be "Method".

Note from the Author or Editor:
Making suggested change

Noritada Kobayashi  Nov 25, 2017  Sep 21, 2018
Printed
Page 213
Table 7-3

The method "strip" is described as "equivalent to x.strip(). Isn't it exactly the same thing, not just equivalent?

John Boersma  Nov 11, 2018 
PDF
Page 217
bottom

It's really not clear what In [176]: matches.str.get(1) is supposed to be returning here. Similarly with In [177]: matches.str[0] and matches.str[0].

I would expect to be shown a method to retrieve the regex matched groups for each email address string, but this clearly isn't what happens with this syntax. Was something else meant?

Note from the Author or Editor:
I am fixing this example. The erratum was reported by many others

Anonymous  Mar 09, 2018  Sep 21, 2018
PDF
Page 219
Table 7- 5

Book say: "match - Use re.match with the passed regular expression on each element, returning matched groups as list"

Should say: "... returning Series/array of boolean values"

And commands on pp 217 - 218 are not correct, because they return boolean values and there is no "access elements" at all.
Instead of:
In [174]: matches = data.str.match(pattern, flags=re.IGNORECASE)
In [175]: matches
Out[175]:
Dave True
Rob True
Steve True
Wes NaN
dtype: object

In [176]: matches.str.get(1)
Out[176]:Dave NaN
Rob NaN
Steve NaN
Wes NaN
dtype: float64

In [177]: matches.str[0]
Out[177]:
Dave NaN
Rob NaN
Steve NaN
Wes NaN
dtype: float64

it may be better to use:
In [174]: matches = data.str.extract(pattern, flags=re.IGNORECASE)
In [175]: matches
Out[175]:
0 1 2
Dave dave google com
Rob rob gmail com
Steve steve gmail com
Wes NaN NaN NaN

In [176]: matches[0]
Out[176]:
Dave dave
Rob rob
Steve steve
Wes NaN
Name: 0, dtype: object

In [177]: matches.iloc[:, 0]
Out[177]:
Dave dave
Rob rob
Steve steve
Wes NaN
Name: 0, dtype: object

Note from the Author or Editor:
This behavior changed in pandas. I'm correcting the code examples and the language in the text

Andrey Dubinchak  Dec 27, 2017  Sep 21, 2018
Printed
Page 219
&.4 Conclusion

"Effective data preparation can significantly improve productive by ..." should read "Effective data preparation can significantly improve productivity by ..."

Note from the Author or Editor:
Fixing the typo

Francis Lewis  Jan 10, 2018  Sep 21, 2018
Printed
Page 224
The line before the section "Reordering and Sorting Levels"

The code

MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
names=['state', 'color'])

should be

pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
names=['state', 'color'])

Note from the Author or Editor:
Adding the "pd."

Klaus Wang  May 17, 2018  Sep 21, 2018
PDF
Page 229
Table 8-1 Different join types with how argument

Final entry in table is 'output' join.
It should be 'outer' join.

Note from the Author or Editor:
This needs to be changed from "output" to "outer"

David Welden  Sep 26, 2017  Oct 20, 2017
Printed
Page 237
In [86]

The command as it stands produces a FutureWarning. Either sort=True or sort=False should be added as parameters.

Note from the Author or Editor:
will fix

John Boersma  Nov 11, 2018 
PDF
Page 241
Final paragraph

The example of Series method combine_first is a bit vague. Although it apparently produces the desired output, the choice of b[:-2] and a[2:] for arguments is not obvious. It appears that it was chosen in order to reorder the index as well as combining data values, but this is not explained.

Note from the Author or Editor:
I am changing the code example to omit the slicing, and instead make "a" and "b" have their index labels in different order. This will definitely be clearer to the reader. Thanks for pointing this out

David Welden  Sep 27, 2017  Oct 20, 2017
Printed
Page 242
code examples with combine_first

The operation at the bottom of page 241:
In [112]: np.where(pd.isnull(a), b, a)
will take elements from a where available and from b where not available in a.

The analogous operation using combine_first should then probably be:
a.combine_first(b)
rather than:
b.combine_first(a)

Note from the Author or Editor:
will fix

Artem Glebov  Dec 26, 2018 
Printed
Page 242
Third example on the page "In [93]:"

The example describes the use of optional argument "join_axes", this argument, as of 4/5/21, has depreciated and now results in a TypeError. It can be replaced with reindex function now.

Note from the Author or Editor:
Confirmed. Will fix in 3rd edition

Dennis L Gonzales  Apr 06, 2021 
PDF
Page 244
After Out[131]:


In [132] And Out[132]
are the repetitions of:
In[131] And Out[131]

In [132] And Out[132] should be removed!

Note from the Author or Editor:
Confirmed. Will remove

Shaahin Riazi  Oct 08, 2020 
Printed
Page 255
In [18]

As it stands, this line produces a "MatplotlibDeprecationWarning: In future re-calling will create a new instance." Best to revise to avoid a warning.

Note from the Author or Editor:
will fix

John Boersma  Nov 11, 2018 
Printed
Page 255
explanation of [17]

"In IPython, an empty plot window will appear"

No window appeared in 7.0.1 after running [11], "%matplotlib", [16], [17], [18] and [19]

Note from the Author or Editor:
will fix

Gregory Sherman  Jan 14, 2019 
Printed
Page 259
Middle

In:

subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=None)

None does not adjust. Use 0.

Note from the Author or Editor:
will fix

John Boersma  Nov 11, 2018 
PDF
Page 274
after Figure 9-17

Although mentioned that "tipping dataset used earlier in the book", the tipping dataset does not seem to be used earlier. That dataset is used first in this section and later in Ch. 10.

Note from the Author or Editor:
This is also used in chapter 9, but the language there was also incorrect. I am tweaking the language in both chapters 9 and 10 to reflect that these are the first times that readers will have seen this dataset

Noritada Kobayashi  Nov 11, 2017  Sep 21, 2018
PDF
Page 279
After Figure 9-22.

"distplot" method has been deprecated and removed in newer versions.

Note from the Author or Editor:
Thanks. Will fix

Shaahin Riazi  Oct 22, 2020 
PDF
Page 283
In [108]: And In[109]:


The `factorplot` function has been renamed to `catplot`.

Note from the Author or Editor:
Confirmed. Will fix in 3rd edition

Shaahin Riazi  Oct 22, 2020 
PDF
Page 300
In [66]:

result = grouped['tip_pct', 'total_bill'].agg(functions) —-> needs an extra pair of []

Correct ——> result = grouped[['tip_pct', 'total_bill']].agg(functions)

Note from the Author or Editor:
Confirmed. Will fix in 3rd edition based on latest version of pandas

Shaahin Riazi  Oct 30, 2020 
Printed
Page 301
2nd Paragraph

"indepedently" should be "independently".

Note from the Author or Editor:
will fix

John Boersma  Nov 11, 2018 
Printed
Page 311
Top code block

"for suit in ['H','S','C','D']: " should be "for suit in suits:". Otherwise, there is not point in defining "suits" earlier in the code block.

Note from the Author or Editor:
will fix

John Boersma  Nov 11, 2018 
Printed
Page 335
ts.shift(1, freq='90T') exampe

This method with 90T parameter should lag the data by 90 minutes at 90 min frequency. Instead, it seems to preserve the monthly frequency and only lag every timestamp by 1:30hr. Am I reading this correctly or is this by design? Clarification would be helpful.

Note from the Author or Editor:
I will add a note to the text to clarify that the "freq" parameter does not change the frequency of the data (if any)

Serge  Jan 25, 2018  Sep 21, 2018
Printed
Page 339
First whole paragraph

"EST" should be "Eastern Time". The point is that the interval straddles the standard time - daylight savings time boundary.

Note from the Author or Editor:
will fix

John Boersma  Nov 11, 2018 
Printed
Page 340
The source codes which shows Timestamp arithmetic before DST transition

At the source code, which shows arithmetic before DST transition,
the book uses '2012-3-12 01:30', tz='US/Eastern'.

But, in the 2012 US/Eastern, DST starts at 2012-3-11, so the code here shows arithmetic not over the DST, it may not make sense for readers.

In the first edition of this book used '2012-03-11' not '2012-03-12', and was correct.

Note from the Author or Editor:
Confirmed. Fixing

Masato Setoyama  Mar 02, 2018  Sep 21, 2018
Printed
Page 347
[197] - [199]

"To convert back to timestamps, use to_timestamp:"

There is no apparent change to the Series 'ts' by [197] & [199] - what is being demonstrated?

Note from the Author or Editor:
will clarify

Gregory Sherman  Jan 19, 2019 
Printed
Page 351
Table 11-5, last row

convention defaults to 'start', not 'end'.

Note from the Author or Editor:
Fixing.

Hengni Cai  Mar 29, 2018  Sep 21, 2018
Printed
Page 352
1st and 2nd code examples.

The 2 code examples are the same.

In[216]: ts.resample('5min', closed='right').sum()

In[217]: ts.resample('5min', closed='right').sum()

216 should be WITHOUT the `closed='right'`

Note from the Author or Editor:
Confirmed. Will fix

Charbel Sarkis  Sep 27, 2018 
Printed, PDF, ePub
Page 358
Figure 11-5

Fig 11-5 caption says: Apple 250-day daily return standard deviation.

However the calc is based on price, so it's the price standard deviation, which is not really what one looks at usually.

The correct call to plot the return standard deviation (add pct_change()) would be (e.g.):

close_px.AAPL.pct_change().rolling(252, min_periods=np.int(252/2)).std().plot()

Standard in finance is to show the annualized vol, which would be:
(close_px.AAPL.pct_change().rolling(252, min_periods=np.int(252/2)).std()*np.sqrt(252)).plot()

Note from the Author or Editor:
will fix

Anonymous  May 23, 2019 
Printed
Page 370
In [50]

The use of outer parentheses to facilitate line breaks, which is explained on page 381, should really be explained here at the first use.

Note from the Author or Editor:
will fix

John Boersma  Nov 11, 2018 
Printed
Page 378
Text

The meaning of "unwrapped" here is really unclear. Does this refer to an internal process? The example is the same as on page 376, where "unwrapped" is not mentioned.

Also, is "fast past" correct? Not sure what this means. Should it be "fast pass" or "fast path"?

Note from the Author or Editor:
will fix

John Boersma  Nov 11, 2018 
Printed
Page 384
1st Paragraph

The Class 'Pandas.TimeGrouper' does not exist anymore. It has been replaced by ''pandas.Grouper'. The code should be changed with the following –

time_key = pd.Grouper(freq='5min')

Note from the Author or Editor:
Confirmed. Will fix

Ben B  Sep 17, 2020 
Printed
Page 390
In [38]

Need parameter "rcond=None" to suppress FutureWarning.

Note from the Author or Editor:
will fix

John Boersma  Nov 11, 2018 
PDF
Page 437
Between 1st paragraph and 2nd paragraph

After the last sentence "Then, these can be concatenated together with concat:", it looks some python codes would be needed to make sense. These codes are found in https://github.com/wesm/pydata-book/blob/2nd-edition/ch14.ipynb , the below:

nutrients = []

for rec in db:
fnuts = pd.DataFrame(rec['nutrients'])
fnuts['id'] = rec['id']
nutrients.append(fnuts)

nutrients = pd.concat(nutrients, ignore_index=True)

Note from the Author or Editor:
Thanks -- I am restoring the code to the text (it was being accidentally suppressed in the output)

Haruyoshi TAKIGUCHI  Apr 03, 2018  Sep 21, 2018
PDF
Page 452
1st paragraph

The paragraph states that "the result is shown in Figure A-3", but Figure A-3 is "illustration", not "result" (just a cosmetic issue).

Note from the Author or Editor:
Changing language to "this is illustrated in Figure A-3"

Noritada Kobayashi  Nov 26, 2017  Sep 21, 2018
PDF
Page 467
center of the page

The paragraph states that "the output of outer will have a dimension that is the sum of the dimensions of the inputs".

Since the result of outer for (3, 4) and (5,) is (3, 4, 5), is it better to replace the word "sum" with "concatenation"?

Note from the Author or Editor:
Making suggested change

Noritada Kobayashi  Nov 26, 2017  Sep 21, 2018
PDF
Page 473
Code example 188

It would be better to make a zipped result more pretty for the last code example as follows:

In [188]: zip(last_name[sorter], first_name[sorter])
Out[188]: <zip at 0x7fa203eda1c8>

Note from the Author or Editor:
Adding "list(...)" to make the example prettier

Noritada Kobayashi  Nov 27, 2017  Sep 21, 2018
Printed
Page 479
[214] through [215]

In [214]: numba_mean_distance = nb.jit(mean_distance)

We could also have written this as a decorator:

@nb.jit
def mean_distance(x, y):
.
.
.
In [215]: %timeit numba_mean_distance(x, y)

To be consistent, I would make the definition begin with "def numba_mean_distance(x, y):"

Note from the Author or Editor:
will do

Gregory Sherman  Feb 01, 2019 
Printed
Page 482
Top

"mmap" is a fairly large file on disk. It would be good to add a command to delete it when done here.

Note from the Author or Editor:
will fix

John Boersma  Nov 11, 2018 
Printed
Page 483
[230] and [231], plus preceding text

"In this example, summing the rows of these arrays should, in theory, be faster for arr_c than arr_f ..."

Runs on my Windows 10 PC consistently show the opposite, like:

In [46]: %timeit arr_c.sum(1)
1.65 ms ± 9.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [47]: %timeit arr_f.sum(1)
994 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


I have carefully checked: "C_CONTIGUOUS : True" for arr_c and "F_CONTIGUOUS : True" for arr_f

Any idea what's going on?

Note from the Author or Editor:
will improve the example

Gregory Sherman  Jan 30, 2019 
Printed
Page 483
preceding text and [230] and [231]

[more on same issue]

On my PC, I found that sum(0) runs faster on arr_c :

In [17]: %timeit arr_c.sum(0)
953 µs ± 9.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [18]: %timeit arr_f.sum(0)
1.6 ms ± 2.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


I wonder if the output in [230] and [231] does not actually result from what was built in [225] and [226].

Note from the Author or Editor:
will review and make the difference more stark / consistent

Gregory Sherman  Feb 02, 2019 
PDF
Page 485
The first paragraph and code

Original:

Since the input variables are strings they can be executed again with the Python exec keyword:
In [30]: exec(_i27)

I propose the following:

Since the input variables are strings they can be evaluated again with the Python eval keyword:
In [30]: eval(_i27)
Out[30]: 'bar'



It looks "exec" does not make sense in this context because _i27 is not a statement or a code.

Note from the Author or Editor:
It's not a mistake but "eval" makes the example more illustrative. Changing

Haruyoshi TAKIGUCHI  Apr 28, 2018  Sep 21, 2018
Printed
Page 487
Scorpion comment

The comment that deleting a variable does not free up memory appears to be incorrect. After using del I had a decrease in memory used on my mac as shown on activity monitor.

Note from the Author or Editor:
will fix

John Boersma  Nov 11, 2018 
Printed
Page 491
Middle

"works_fine" method should be "works_fine function".

Note from the Author or Editor:
will fix

John Boersma  Nov 11, 2018 
PDF
Page 494
The first code quote in the section "Basic Pro ling: %prun and %run -p"

Found two syntax errors in Python3.

1)
for _ in xrange(niter):
needed to be replaced by like
for _ in range(niter):

2)
print 'Largest one we saw: %s' % np.max(some_results)
needed to be replaced by like
print('Largest one we saw: {0}'.format(np.max(some_results)))

Note from the Author or Editor:
Fixing this

Haruyoshi TAKIGUCHI  Apr 08, 2018  Sep 21, 2018
Printed
Page 495
In [561] and In [562]

Reported Wall times are way off. More like 250ms and 100ms.

Note from the Author or Editor:
will add comment

John Boersma  Nov 11, 2018 
Other Digital Version
2255
Functions Are Objects (section)

In Amazon Kindle version, Chapter 3: Section "Functions Are Object", the text explains that the code:

import re

def clean_strings(strings):
result = []
for value in strings:
value = value.strip()
value = re.sub('[!#?]', '', value)
value = value.title()
result.append(value)
return result

Should clean the data
FROM:
states = [ ' Alabama ', 'Georgia!' , 'Georgia', 'georgia', 'Fl0rida', ... ]
TO:
['Alabama', 'Georgia', 'Georgia', 'Georgia', 'Florida', ... ]

When running the code, none of the methods change 'Fl0rida' to 'Florida' as mentioned in the text. All the other data entry is working.

Note from the Author or Editor:
Thanks. I will fix the code example.

Kyle Jeffreys  May 16, 2020 
Mobi
Page 2621

"For large DataFrames, the head method is useful to get see the first 5 rows:"

'get' should be removed

Bridgeland  Mar 29, 2017  Sep 25, 2017