Errata

Errata for Python for Data Analysis, Second Edition

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted By	Date submitted	Date corrected
	Chapter 2 Subsection :- Duck Typing	"this means it has a __iter__ “magic method,” though an alternative" The comma after method should actually be after ". Note from the Author or Editor: Changing this in the source material	Naman Bhalla	Nov 11, 2017	Sep 21, 2018
	Ch11 subsection "Converting between string and datetime"	In the part discussing converting datetime objects from strings, you say that strptime uses the same format codes as strftime, but that's not quite right: value = '2011-01-03' stamp = datetime.strptime(value, '%Y-%m-%d') # works datetime.strptime(value, '%F') # ValueError: 'F' is a bad directive in format '%F' datetime.strftime(stamp, '%F') # works Note from the Author or Editor: Quite right. Fixing the language to say "many of the same"	Alex Branham	Dec 04, 2017	Sep 21, 2018
	?? Integer Indexes Section Paragraph 4	In the Integer Indexes section of Chapter 5 the following paragraph is ambiguous: "To keep things consistent, if you have an axis index containing integers, data selection will always be label-oriented. For more precise handling, use loc (for labels) or iloc (for integers):" To test this out I defined the following object: `ser3 = Series(np.arange(4.), index=['a', 'b', -1, 34])` and ran these two commands, both of which return 2.0: `ser3[-1]` `ser3[-2]` `ser3.index` gives me "Index(['a', 'b', -1, 34], dtype='object')" So, I think you could argue that the way Pandas actually works has some ambiguity to it and that the way the book describes it is the way it SHOULD work. But to describe the actual way this part of Pandas works, the following paragraph would be more accurate: "To keep things consistent, if you have an axis index containing exclusively integers (mixed indexes will match on label first, and fall back to positional indexing), data selection will always be label-oriented. For more precise handling, use loc (for labels) or iloc (for integers):" Or something of that nature. Note from the Author or Editor: will improve	Bob McDonald	Dec 16, 2018
Mobi		Current Afer this operation, the variable a is unmodified: Suggested After this operation, the variable...	O'Reilly Media	Sep 17, 2019
Other Digital Version	location 1741 top	Found this error on the kindle version, location 1741. the line: In[76]: seq[3:4] = [6,3] should be: In[76]: seq[3:4] = [6] Note from the Author or Editor: Confirmed. Will fix	Ravi	Nov 18, 2019
	Page page 49 section 2.3.2.7 table 2-5: Datetime格式化详细说明	table 2-5: detetime 格式化详细说明 %F %Y-%m-%d的简写（例如 2012-4-18） actually 2012-04-18 %m:Double digit months Note from the Author or Editor: fixing in text	chengjq	Apr 18, 2022
	Page Ch. 1. Installing Necessary Packages Note 2, installing packages into conda environment	In Note 2, Ch. 1. Installing Necessary Packages: Installing packages into conda environment uses activate instead of install: Should be conda install lxml beautifulsoup4 html5lib tables openpyxl / requests sqlalchemy seaborn scipy statsmodels / patsy sklearn	Alexander de la Paz	Jul 24, 2022
	Ch5 Indexing, selection, and filtering; Table 5-6. Indexing options with DataFrame	`df.iloc[where]` Selects single row or subset of rows from the DataFrame by label. Should probably be "...from the DataFrame by integer position."	Yung-Jin (Joey) Hu	Feb 14, 2017	Sep 25, 2017
Other Digital Version	Ch5 Integer Indexes; 4th paragraph	"an axis index containing itnegerse, data selection" "integers" is spelled incorrectly.	Yung-Jin (Joey) Hu	Feb 14, 2017	Sep 25, 2017
	Ch5 Handling Missing Data; 2nd paragraph	"The way that missing data is represented in pandas object is somewhat imperfect, but it is functional for a lot of usres." users is spelled incorrectly.	Yung-Jin (Joey) Hu	Feb 14, 2017	Sep 25, 2017
	Ch5 Sorting and ranking; within the code examples	It looks like `.sort_values(by=...) is deprecated. In [203]: frame.sort_index(by='b') FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...) In [207]: frame.sort_index(by=['a','b']) FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...) In [205]: frame.sort_values(by='b') fixed the problem. In [211]: pd.__version__ Out[211]: '0.19.2'	Yung-Jin (Joey) Hu	Feb 14, 2017	Sep 25, 2017
	Ch5 Summarizing and Computing Descriptive Statistics; code block 3	The input variable df is: In [187]: df Out[187]: one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3 The code in the book gives this result: In [204]: df.sum(axis=1) Out[204]: a 1.40 b 2.60 c 0.00 d -0.55 dtype: float64 but shouldn't row "c" be "NaN" since we're summing together two NaNs? Here's what I get from my interpreter: In [186]: df.sum(axis=1) Out[186]: a 1.40 b 2.60 c NaN d -0.55 dtype: float64 In [188]: pd.__version__ Out[188]: '0.19.2'	Yung-Jin (Joey) Hu	Feb 14, 2017	Sep 25, 2017
	Ch6 Note within "Indentation, not braces" section	"I strongly recommend that you use 4 spaces to as your default indentation..." Should probably be: "I strongly recommend that you use 4 spaces as your default indentation..." by removing the word "to" before the second half of the sentence "... as your default indentation".	Yung-Jin (Joey) Hu	Feb 26, 2017	Sep 25, 2017
	Ch6 Slicing Section, 3rd paragraph	"While element at the start index is included, the stop..." Should probably be: "While the element at the start index is included, the stop..."	Yung-Jin (Joey) Hu	Feb 26, 2017	Sep 25, 2017
	? Table 3-4	There are two rows in the table that describe the readlines function.	Daniel Walter	Aug 02, 2017	Sep 25, 2017
	Ch. 4 Table 4-2. NumPy data types	the fourth row on the Table 4-2: Signed and unsigned """32"""-bit integer types 32 is the third row. This must be changed to 64.	Kim, Jin	Sep 10, 2017	Sep 25, 2017
	3 Boolean Indexing 5th paragraph	The line of code: data[-(names == 'Bob')] Gives the deprecation warning: DeprecationWarning: numpy boolean negative, the `-` operator, is deprecated, use the `~` operator or the logical_not function instead. using numpy version 1.12.0 Using the tilde operator, as recommended silences the warning.	Yung-Jin (Joey) Hu	Jan 31, 2017	Sep 25, 2017
	3.1 ZIP section	Code works as it should, but the first name and last name are reversed for Curt Schilling. Need to be a Red Sox fan to pick up on this one. ('Schilling', 'Curt') should read ('Curt', 'Schilling') in the following code: In [96]: pitchers = [('Nolan', 'Ryan'), ('Roger', 'Clemens'), ....: ('Schilling', 'Curt')] Note from the Author or Editor: Confirmed. Will fix	Anonymous	Oct 01, 2018
	5 statsmodel section	I am not sure of the page number since I am using Safari books online which doesn't do pagination. In the statsmodel section of chapter 1, line 6, the word "grown" is misspelled (gornw).	Bala Ganeshan	Dec 10, 2016	Sep 25, 2017
PDF	Page 9 Last line of Windows discussion	Text states: To exit the shell, press Ctrl-D or type the command exit() and press return. On Windows, Ctrl-Z should be used.	David Welden	Sep 16, 2017	Sep 25, 2017
Printed	Page 17 2nd Paragraph	The reference to the IPython should be "Appendix B", not "Appendix A". Note from the Author or Editor: Confirmed. Will fix	John Boersma	Nov 11, 2018
	18 Top of second page of Chapter 2	The example uses 1.usa.gov data. This service has been shut down. It would be a pain to craft a whole new opening example, but you might want to. Even if you don't, you might want to let people know it's no longer online so they don't look for it. https://blog.usa.gov/decommissioning-1-usa-gov https://github.com/usagov/1.USA.gov-Data Note from the Author or Editor: You are right. I added a note that it is decommissioned.	John Transue	Dec 06, 2016	Sep 25, 2017
Printed	Page 19 command	$jupyter notebook fails under Windows 10 Command Prompt: "Error executing Jupyter command 'notebook': [Errno 'jupyter-notebook' not found] 2" version 4.4.0: "Available subcommands: kernel kernelspec migrate run troubleshoot" Trying 'run', there was no response - the command simply hung Note from the Author or Editor: will add clarifying comment how to install the notebook	Gregory Sherman	Jan 14, 2019
Printed	Page 29 top text and commands	Magic functions can be used by default without the percent sign ... Some magic functions behave like Pyton functions and their output can be assigned to a variable: In [22]: %pwd Out [22]: '/home/west/code/pydata-book/ In [23]: foo = %pwd ---------------------------------------------------------------------------------- First, a single quote is missing from Out[22] With ipython 6.3.1, although In [22] works using pwd without the leading percent sign, In [23] fails with "NameError: name 'pwd' is not defined" Note from the Author or Editor: Fixing this typo	Gregory Sherman	Apr 13, 2018	Sep 21, 2018
Printed	Page 29 first sentence	I previously reported this issue, but it's a problem beyond the typo that was addressed in the reply. "Magic functions can be used by default without the percent sign..." This is not completely true. For example, this variation of In[23] will not work: foo = pwd The % in front of the magic command can be skipped (by default) if the command is the first "word" on an IPython line. I have found that leading whitespace is not a problem. Note from the Author or Editor: Thanks. I will clarify that this is the only scenario where the % can be omitted	Gregory Sherman	Apr 23, 2019
Printed	Page 30 Figure 2-6	Running the matplotlib code exactly as printed inside Figure 2-6 gives a Type error: TypeError: float() argument must be a string or a number, not 'builtin_function_or_method' Note from the Author or Editor: Thank you. Will fix	James Shenton	Apr 15, 2020
PDF	Page 37-38 Table 2-3	Table 2-3. Binary operators Missing the modulo (%) operator. Note from the Author or Editor: Thanks. Will add for the 3rd edition	Ali Tobah	Sep 02, 2020
PDF	Page 38 table 2-3	inconsistent description of a <= b, a < b (compared to the next line), should be for a < b, a <= b Note from the Author or Editor: Confirmed, will fix (in 3rd edition)	B. Goas	Feb 13, 2019
PDF	Page 38 First paragraph of "Mutable and immutable objects"	Text says "modifiedK", should be "modified:"	David Welden	Sep 17, 2017	Sep 25, 2017
Printed	Page 39 Third paragraph under "Numeric Types"	"Integer division not resulting in a whole number will always yield a floating-point number." Actually, this is true for whole numbers too: In [1]: 4/2 Out [1] : 2.0 Note from the Author or Editor: Quite right. Will fix (in the 3rd edition)	John Boersma	Nov 11, 2018
Printed	Page 44 Last paragraph of "None" section	"but also a unique instance of NoneType" should be "but also the unique instance of NoneType". Note from the Author or Editor: Thanks. Will fix	John Boersma	Nov 11, 2018
Printed	Page 46 Code blocks in 2nd and 3rd paragraphs	Two illegal print statements: print('It's negative') Since the strings contain single quote characters, they should be delimited by double quotes. Note from the Author or Editor: Thank you, fixed	Michael Clark	Nov 05, 2017	Sep 21, 2018
Printed	Page 46 Table 2-5	2012-4-18 should be 2012-04-18 Note from the Author or Editor: Confirmed. Will fix	John Boersma	Nov 11, 2018
Printed	Page 47 2	2nd edition: under the "for loops" section, 1st line, "iterater" --> "iterator" (otherwise not consistent with pg 50 where 'iterator' is also mentioned). Note from the Author or Editor: Confirmed typo.	E G	Mar 08, 2020
PDF, ePub	Page 51 2nd sentence below 'pass' heading	" ... to be taken (or as a placeholder for code not yet implemetned); ..." "implemented" is spelled incorrectly~	Greg Graham	Jun 06, 2017	Sep 25, 2017
Printed	Page 56 First sentence	Sentence should read: "...which locates the first such value and removes it from the list..." Note from the Author or Editor: Thanks! Fixing the typo	Thomas Koundakjian	Nov 09, 2017	Sep 21, 2018
PDF	Page 60 4th line	Although it's never explicitly described as such, the output of the dictionary on this page is showing the dictionary as unordered. This would be incorrect, as the book is utilizing Python 3.6, the version in which dictionaries changed to insertion-ordered. Note from the Author or Editor: Quite right. I'll correct this with the 3rd edition changes	David Bankson	Jun 04, 2020
Printed	Page 61 Top	In the example demonstrating zip, you use the names of three pitchers: Nolan Ryan, Roger Clemens, and Curt Schilling. In the example, you use zip to show first names and last names; the first_names has ('Nolan', 'Roger', 'Schilling') and last_names has ('Ryan', 'Clemens', 'Curt') Curt is his first name and Schilling is his last name, so Curt should be in first_names and Schilling in last_names. Note from the Author or Editor: Fixing this mistake	Jon Ernster	Nov 28, 2017	Sep 21, 2018
PDF	Page 65 2nd	Hello my friend. This is related to usage and definition of a set with reference to the 'set' function in python. While you may or may not agree whether this is a minor technical mistake, it is a mistake in terms of accuracy/precision. While a set is an unordered collection of unique elements, set as defined in python seems to be a 'sorted unordered collection of unique elements.' Thus, depending on input for set(), the actual output would vary under those definitions -- subtle as they might be. Note from the Author or Editor: I will clarify.	E G	Mar 10, 2020
Printed	Page 66 Table 3-1. Python set operations	The alternative syntax for a.issubset (b) and a.issuperset (b) shoule be "<=" and "=>" respectively (not N/A). Note from the Author or Editor: Fixing this.	Daniel Andersson	Feb 03, 2018	Sep 21, 2018
Printed	Page 66 Last paragraph.	"Like dicts, set elements generally must be immutable." should be "Like dict keys, set elements generally must be immutable." Note from the Author or Editor: Confirmed, will fix (in 3rd edition). Thank you	John Boersma	Nov 11, 2018
Printed	Page 70 bottom of page	Suppose instead we had declared a as follows: a = [] def func(): for i in range(5): a.append(i) ======================================== The sentence implies that an explanation of what happens will follow, but there is none. Note from the Author or Editor: Good catch. I'm adding a code example to show how the alternate example works	Gregory Sherman	Apr 14, 2018	Sep 21, 2018
Printed	Page 80 Top line	"As you will see later in the chapter, you can step into the stack (using the %debug or %pdb magics)..." should start with "As you will see in Appendix B..." Note from the Author or Editor: Correct. Will fix	John Boersma	Nov 11, 2018
Printed	Page 82 bottom of page; first entry of Table 3-4	Method Description read([size]) Return data from a string, with optional size argument indicating the number of bytes to read ============================================================= The "number of bytes" assertion is contradicted on the next page - "Python reads enough bytes ... to decode that many characters" and in the read() docstring - "Read at most n characters from stream." Note from the Author or Editor: Adding language to indicate that whether bytes or unicode are read depends on the mode of the file	Gregory Sherman	Apr 15, 2018	Sep 21, 2018
	Page 89 3rd paragraph	I can run following syntax at jupyter notebook: import numpy as np my_arr= np.arange(100000) my_list = list(range(100000)) %time for _ in range(10): my_arr2 = my_arr2 However, if I run it at other python IDE, it gives following error. Can you help me about this? What is syntax for other python IDE? Thanks. >>> time for _ in range(10): my_arr2 = my_arr2 File "<stdin>", line 1 time for _ in range(10): my_arr2 = my_arr2 ^^^ SyntaxError: invalid syntax Note from the Author or Editor:* Adding clarification that %timeit only works within IPython or Jupyter	Anonymous	Mar 29, 2022
Printed	Page 100 Warning box mid page	The warning claims boolean selection will not fail if the boolean array is not the correct length. I think this was changed in Numpy 1.13, but is definitely not true in Numpy 1.14.2 For example: x = np.random.randn(5,5) y = np.array(['a','b','c', 'a', 'b', 'c', 'd', 'd', 'd']) x[y == 'a'] IndexError: boolean index did not match indexed array along dimension 0; dimension is 5 but corresponding boolean dimension is 9 Note from the Author or Editor: Removing the caution box	Mladen Kolovic	Apr 08, 2018	Sep 21, 2018
Printed	Page 103 3rd paragraph	1st release print copy says, “...the result of fancy indexing is always one-dimensional.” However, there are example outputs in this section with more than one dimension. Is that because some of the examples in the section are not fancy indexing? If that’s the case, it’s unclear where the section is building up to a fancy indexing example as opposed to every example being fancy indexing. The number of dimensions in the output seems to be the number of array dimensions plus one minus the number of dimensions indexed. Note from the Author or Editor: The text is unclear, I will clarify	Stephen Frost	Feb 19, 2018	Sep 21, 2018
PDF	Page 108 Table 4-3. Unary ufuncs	Missing term: "Natural logarithm (base e), log base 10, log base 2, and , respectively".	A. Jesse Jiryu Davis	May 23, 2017	Sep 25, 2017
Printed	Page 112,121 first sentence of 112. "Simulating ..." on 121	pg 112 first sentence: Here, arr.mean(1) means "compute mean across the columns" where arr.sum(0) means "compute sum down the rows" conflicts with pg. 121 "Simulating Many Random Walks at Once": "we can compute the cumulative sum across the rows" . . . In [262]: walks = steps.cumsum(1) Note from the Author or Editor: Thanks. Will clarify language	Gregory Sherman	Jan 04, 2019
Printed	Page 114 1	May want to specify arr.mean(1) is the same as arr.mean(axis=1). Less assumptions the readers has to make, the better? Note from the Author or Editor: I agree this would be clearer. I'll clarify in the 3rd edition	Shivan Sivakumaran	Oct 03, 2020
PDF	Page 123 1st Paragraph	"operations" is misspelled at the location, "Using NumPy functions or NumPy-like oeprations..."	Ryan Shuhart	Jan 05, 2017	Sep 25, 2017
Printed	Page 126 first paragraph	The book states when you are only passing a dict, the index in the resulting Series will have the dict's key in sorted order. However, this is not always the case. Running the code on my system I have the output pasted below. Looking at the output we see that returned series is not in sorted order. sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000} obj3=pd.Series(sdata) obj3 Out[49]: Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 dtype: int64 Note from the Author or Editor: Correct, pandas now obeys the insertion order of the dict. I will fix in the 3rd edition	Howard Smith	Aug 30, 2018
Printed	Page 126 2nd paragraph	"You can override this by passing the dict keys in the order you want them to appear in the resulting Series" However, given the [29]- [31] commands, the actual result is Out[31]: Oregon 16000.0 California NaN Texas 71000.0 Ohio 35000.0 Note from the Author or Editor: will review	Gregory Sherman	Jan 04, 2019
Printed	Page 128 Ch5.1: Introduction to pandas daa Structures: Series - 8th Para	The text says "When you are only passing a dict, the resulting Series will have the dict's keys in sorted order". This doesn't appear to be true, either with the example given in the book, or with a repro (which proves the example is not the error). These keys seem _un_sorted to me when only passing in a dict. >>> import pandas as pd >>> sdata = { 'Ohio':35000, 'Texas':71000, 'Oregon':16000,'Utah': 5000 } >>> obj3 = pd.Series(sdata) >>> obj3 Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 dtype: int64 >>> obj3.index Index(['Ohio', 'Texas', 'Oregon', 'Utah'], dtype='object') Note from the Author or Editor: Confirmed that pandas now respects the "insertion order" of keys when creating a Series from a dictionary. Will fix in the 3rd edition	Gavin Draper	Mar 16, 2021
Printed	Page 138 In[104]	As presented, this line leads to "FutureWarning: Passing list-likes to .loc with missing label will raise KeyError in future." Revise to avoid warning. Note from the Author or Editor: Confirmed, will fix	John Boersma	Nov 11, 2018
PDF	Page 141 Sentence beginning with word Setting in italics	Word "section" is misspelled as "sectino"	Anonymous	Sep 20, 2017	Sep 25, 2017
PDF	Page 145 last paragraph	"To keep things consistent, if you have an axis index containing integers, data selection will always be label-oriented." I think this should be integer-oriented. Note from the Author or Editor: The code example does not illustrate the intended behavior. I am changing the example to be "ser[-1]" instead of "ser[:2]" and added a note that slicing with integers ignores the integer labels	Yang Yang	Oct 18, 2017	Sep 21, 2018
Printed	Page 145, 146 final paragraph & code following	[similar to a previously reported issue] "... if you have an axis index containing integers, data selection will always be label-oriented. For more precise handling, use loc (for labels) ..." In [147]: ser[:1] Out[147]: 0 0.0 dtype: float64 In[148]: ser.loc[:1] Out[148]: 0 0.0 1 1.0 dtype: float64 In[149]: ser.iloc[:1] Out[149]: 0 0.0 dtype: float64 The series "ser" is indexed by integers, so - according to the text - data selection should be label-oriented (in the absence of loc or iloc). However, Out[147] is identical to Out[149], which results from using iloc, so the "ser[:1]" data selection appears to be integer-oriented. Note from the Author or Editor: will clarify	Gregory Sherman	Apr 30, 2019
Printed	Page 158 In [233]	Row 'c' is populated with 0.0, not NaN. Note from the Author or Editor: Confirmed. Will fix in the 3rd edition revision.	John Boersma	Nov 11, 2018
PDF	Page 160 Table 5-8	The text says: "argmin, argmax - Compute index locations (integers) at which minimum or maximum value obtained, respectively" Should be: "argmin, argmax - Compute index labels for Series at which minimum or maximum value obtained, respectively" ----------------------------------------------------------------------- Example from this chaptert - returns label, not integer In [115]: df.loc['d'].argmin() Out[115]: 'two' Note from the Author or Editor: Per https://github.com/pandas-dev/pandas/issues/16830 this is supposed to return the positional values but did not for a while because of some changes in pandas. In the future, it will do the right thing (what the book says now), so I'm not going to change the book	Andrey Dubinchak	Dec 14, 2017	Sep 21, 2018
PDF	Page 164 table 5-9	Looks like instead of method "match" there should be "get_indexer" Note from the Author or Editor: Fixing this to "get_indexer"	Aivar Annamaa	Nov 18, 2017	Sep 21, 2018
Printed	Page 172 Table 6-2	For the argument "names", combining with "header=None" is not needed. Using the parameter "names" implies this. Note from the Author or Editor: Right. Will fix	John Boersma	Nov 11, 2018
PDF	Page 173 in Table 6-2	The 'iterator' option of pandas.read_csv returns a TextFileReader object, not a TextParser object, since 2012. Note from the Author or Editor: I am changing the test to correspond to changes in the latest version of pandas	Noritada Kobayashi	Nov 05, 2017	Sep 21, 2018
PDF	Page 174 after Out[38]	As Out[38] shows, the 'iterator' option of pandas.read_csv returns a TextFileReader object, not a TextParser object, since 2012. Note from the Author or Editor: I am changing the text to correspond with changes in pandas	Noritada Kobayashi	Nov 05, 2017	Sep 21, 2018
PDF	Page 175 top	As Out[38] show, the 'iterator' option of pandas.read_csv returns a TextFileReader object, not a TextParser object, since 2012. Note from the Author or Editor: I am changing the text to correspond to the current version of pandas	Noritada Kobayashi	Nov 05, 2017	Sep 21, 2018
Printed	Page 176 Bottom of page	"tuples of values" should be "lists of values". Note from the Author or Editor: Confirmed. Thanks	John Boersma	Nov 11, 2018
PDF	Page 179 Out[64]: result	In [64]: result Out[64]: {'name': 'Wes', 'pet': None, 'places_lived': ['United States', 'Spain', 'Germany'], 'siblings': [{'age': 30, 'name': 'Scott', 'pets': ['Zeus', 'Zuko']}, {'age': 38, 'name': 'Katie', 'pets': ['Sixes', 'Stache', 'Cisco']}]} 'pet': None, Should print after the line: 'places_lived': ['United States', 'Spain', 'Germany'], Note from the Author or Editor: will fix	Shaahin Riazi	Apr 18, 2020
Printed	Page 180 1st paragraph	Refers to the USDA Food Database example in Chapter 7; in second edition, this example is in Chapter 14.4 (page 436-442) Note from the Author or Editor: Fixing this reference to point to Ch 14	Laura Hughes	Jan 31, 2018	Sep 21, 2018
PDF	Page 182 first block of code for getroot	Code says the example file is in path: path='examples/mta_perf/Performance_MNR.xml' Actual path from git repository is: path='datasets/mta_perf/Performance_MNR.xml' Note from the Author or Editor: Correct, thank you. This will need to be fixed in the source files	David Welden	Sep 25, 2017	Oct 20, 2017
PDF, ePub	Page 184 Link to Apache Arrow in 'Feather' Section	URL for 'Apache Arrow' points to 'apache.arrow.org' instead of 'arrow.apache.org'	Joel A	Oct 08, 2017	Oct 20, 2017
Printed	Page 184 last paragraph & [92]	In[92]: frame = pd.DataFrame({'a': np.random.randn(100)}) fails: ImportError: HDFStore requires PyTables, "No module named 'tables'" problem importing ------------------------- Although PyTables is mentioned in the previous text, there is no indication that this library needs to be installed. I tried "pip install PyTables", but it failed with: Collecting PyTables Could not find a version that satisfies the requirement PyTables (from versions: ) No matching distribution found for PyTables So, I'm still without PyTables and don't know how to get it. Note from the Author or Editor: will fix	Gregory Sherman	Jan 09, 2019
Printed	Page 185 In [96]	The command "store" at this point produces only the first two lines of the indicated output - the rest is not produced. Note from the Author or Editor: I think this was caused by a malformed build where the "mydata.h5" persisted between builds. I will see that it's fixed	John Boersma	Nov 11, 2018
PDF	Page 185 First sentence of final paragraph	Text reads "...how they can sunit your needs" Should be "...how they can suit your needs" Note from the Author or Editor: This typo is fixed in the final 2nd edition	David Welden	Sep 25, 2017	Sep 25, 2017
Printed	Page 186-187 Under heading on 186, second code block on 187	This is not so much an error, per se, but a comment on a "may" clause in the book. I'm writing this incase you like to track these sorts of issues. On page 186, the text says "Internally these tools use the add-on packages xlrd and openpyxl to read XLS and XLSX files, respectively. You may need to install these manually with pip or conda." This is very true as the example line on the next page (187) "writer = pd.ExcelWriter('examples/ex2.xlsx')" threw an error on my system. I'm using pandas 0.21.0 within a python 3.6.2 virtual environment. Manually installing the packages in question via pip solved my problems. Thanks! Note from the Author or Editor: I'm changing the language to say "These must be installed separately"	Jim Sam	Dec 08, 2017	Sep 21, 2018
Printed	Page 186 [105]	The text and command conflict: "Data stored in a sheet can then be read into DataFrame with parse: In [105]: pd.read_excel(xlsx, 'Sheet1')" Note from the Author or Editor: will fix	Gregory Sherman	Jan 09, 2019
PDF	Page 192 Table 7-1	The left column is named as "Argument", which should be "Method". Note from the Author or Editor: Making suggested change	Noritada Kobayashi	Nov 25, 2017	Sep 21, 2018
Printed	Page 195 In [35]	In [35]: _ = df.fillna(0, inplace=True) The assignment "_ =" is unnecessary. Note from the Author or Editor: will fix	Gregory Sherman	May 11, 2019
Printed	Page 204 middle	In [85]: data = np.random.randn(20) In [86]: pd.cut(data, 4, precision = 2) . . . The precision = 2 option limits the decimal precision to two digits. ----------- However, one of the bins I get is (0.031, 0.27] Note from the Author or Editor: will investigate	Gregory Sherman	Jan 10, 2019
Printed	Page 206 last sentence	"Calling permutation with the length of the axis you want to permute ..." According to what I have seen (an example is below) it seems that the phrase should be "the length of axis 0" or "the number of rows". Calling permutation() with the number of columns can result in rows being dropped or an IndexError. The question arises: can permutation() or a similar function randomly order columns? In [225]: df=DataFrame(np.arange(12).reshape((4,3))) In [226]: df Out[226]: 0 1 2 0 0 1 2 1 3 4 5 2 6 7 8 3 9 10 11 In [227]: s=np.random.permutation(4) In [228]: df.take(s) Out[228]: 0 1 2 2 6 7 8 1 3 4 5 0 0 1 2 3 9 10 11 In [229]: s=np.random.permutation(3) In [230]: df.take(s) Out[230]: 0 1 2 1 3 4 5 0 0 1 2 2 6 7 8 . . . In [258]: df=DataFrame(np.arange(12).reshape((3,4))) In [259]: s=np.random.permutation(4) In [260]: df.take(s) . . . IndexError: indices are out-of-bounds Note from the Author or Editor: will add example of permuting columns	Gregory Sherman	Jan 10, 2019
PDF	Page 208 1st paragraph of a section named "Computing Indicator/Dummy Variables"	The paragraph says "Let’s return to an earlier example DataFrame". However, since that example is contained in section 8.2 in the 2nd edition, "earlier" is not an appropriate word. Note from the Author or Editor: Fixing language to "Let's consider an example DataFrame..."	Noritada Kobayashi	Nov 27, 2017	Sep 21, 2018
Printed	Page 209 In [115]	The parameter "engine='python'" is needed in this command. Without this, a ParserWarning is produced due to the two character separator.	John Boersma	Nov 11, 2018
PDF	Page 213 Table 7-3	The left column is named as "Argument", which should be "Method". Note from the Author or Editor: Making suggested change	Noritada Kobayashi	Nov 25, 2017	Sep 21, 2018
Printed	Page 213 Table 7-3	The method "strip" is described as "equivalent to x.strip(). Isn't it exactly the same thing, not just equivalent?	John Boersma	Nov 11, 2018
PDF	Page 217 bottom	It's really not clear what In [176]: matches.str.get(1) is supposed to be returning here. Similarly with In [177]: matches.str[0] and matches.str[0]. I would expect to be shown a method to retrieve the regex matched groups for each email address string, but this clearly isn't what happens with this syntax. Was something else meant? Note from the Author or Editor: I am fixing this example. The erratum was reported by many others	Anonymous	Mar 09, 2018	Sep 21, 2018
PDF	Page 219 Table 7- 5	Book say: "match - Use re.match with the passed regular expression on each element, returning matched groups as list" Should say: "... returning Series/array of boolean values" And commands on pp 217 - 218 are not correct, because they return boolean values and there is no "access elements" at all. Instead of: In [174]: matches = data.str.match(pattern, flags=re.IGNORECASE) In [175]: matches Out[175]: Dave True Rob True Steve True Wes NaN dtype: object In [176]: matches.str.get(1) Out[176]:Dave NaN Rob NaN Steve NaN Wes NaN dtype: float64 In [177]: matches.str[0] Out[177]: Dave NaN Rob NaN Steve NaN Wes NaN dtype: float64 it may be better to use: In [174]: matches = data.str.extract(pattern, flags=re.IGNORECASE) In [175]: matches Out[175]: 0 1 2 Dave dave google com Rob rob gmail com Steve steve gmail com Wes NaN NaN NaN In [176]: matches[0] Out[176]: Dave dave Rob rob Steve steve Wes NaN Name: 0, dtype: object In [177]: matches.iloc[:, 0] Out[177]: Dave dave Rob rob Steve steve Wes NaN Name: 0, dtype: object Note from the Author or Editor: This behavior changed in pandas. I'm correcting the code examples and the language in the text	Andrey Dubinchak	Dec 27, 2017	Sep 21, 2018
Printed	Page 219 &.4 Conclusion	"Effective data preparation can significantly improve productive by ..." should read "Effective data preparation can significantly improve productivity by ..." Note from the Author or Editor: Fixing the typo	Francis Lewis	Jan 10, 2018	Sep 21, 2018
Printed	Page 224 The line before the section "Reordering and Sorting Levels"	The code MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']], names=['state', 'color']) should be pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']], names=['state', 'color']) Note from the Author or Editor: Adding the "pd."	Klaus Wang	May 17, 2018	Sep 21, 2018
PDF	Page 229 Table 8-1 Different join types with how argument	Final entry in table is 'output' join. It should be 'outer' join. Note from the Author or Editor: This needs to be changed from "output" to "outer"	David Welden	Sep 26, 2017	Oct 20, 2017
Printed	Page 237 In [86]	The command as it stands produces a FutureWarning. Either sort=True or sort=False should be added as parameters. Note from the Author or Editor: will fix	John Boersma	Nov 11, 2018
PDF	Page 241 Final paragraph	The example of Series method combine_first is a bit vague. Although it apparently produces the desired output, the choice of b[:-2] and a[2:] for arguments is not obvious. It appears that it was chosen in order to reorder the index as well as combining data values, but this is not explained. Note from the Author or Editor: I am changing the code example to omit the slicing, and instead make "a" and "b" have their index labels in different order. This will definitely be clearer to the reader. Thanks for pointing this out	David Welden	Sep 27, 2017	Oct 20, 2017
Printed	Page 242 code examples with combine_first	The operation at the bottom of page 241: In [112]: np.where(pd.isnull(a), b, a) will take elements from a where available and from b where not available in a. The analogous operation using combine_first should then probably be: a.combine_first(b) rather than: b.combine_first(a) Note from the Author or Editor: will fix	Artem Glebov	Dec 26, 2018
Printed	Page 242 Third example on the page "In [93]:"	The example describes the use of optional argument "join_axes", this argument, as of 4/5/21, has depreciated and now results in a TypeError. It can be replaced with reindex function now. Note from the Author or Editor: Confirmed. Will fix in 3rd edition	Dennis L Gonzales	Apr 06, 2021
PDF	Page 244 After Out[131]:	In [132] And Out[132] are the repetitions of: In[131] And Out[131] In [132] And Out[132] should be removed! Note from the Author or Editor: Confirmed. Will remove	Shaahin Riazi	Oct 08, 2020
Printed	Page 255 In [18]	As it stands, this line produces a "MatplotlibDeprecationWarning: In future re-calling will create a new instance." Best to revise to avoid a warning. Note from the Author or Editor: will fix	John Boersma	Nov 11, 2018
Printed	Page 255 explanation of [17]	"In IPython, an empty plot window will appear" No window appeared in 7.0.1 after running [11], "%matplotlib", [16], [17], [18] and [19] Note from the Author or Editor: will fix	Gregory Sherman	Jan 14, 2019
Printed	Page 259 Middle	In: subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=None) None does not adjust. Use 0. Note from the Author or Editor: will fix	John Boersma	Nov 11, 2018
PDF	Page 274 after Figure 9-17	Although mentioned that "tipping dataset used earlier in the book", the tipping dataset does not seem to be used earlier. That dataset is used first in this section and later in Ch. 10. Note from the Author or Editor: This is also used in chapter 9, but the language there was also incorrect. I am tweaking the language in both chapters 9 and 10 to reflect that these are the first times that readers will have seen this dataset	Noritada Kobayashi	Nov 11, 2017	Sep 21, 2018
PDF	Page 279 After Figure 9-22.	"distplot" method has been deprecated and removed in newer versions. Note from the Author or Editor: Thanks. Will fix	Shaahin Riazi	Oct 22, 2020
PDF	Page 283 In [108]: And In[109]:	The `factorplot` function has been renamed to `catplot`. Note from the Author or Editor: Confirmed. Will fix in 3rd edition	Shaahin Riazi	Oct 22, 2020
PDF	Page 300 In [66]:	result = grouped['tip_pct', 'total_bill'].agg(functions) —-> needs an extra pair of [] Correct ——> result = grouped[['tip_pct', 'total_bill']].agg(functions) Note from the Author or Editor: Confirmed. Will fix in 3rd edition based on latest version of pandas	Shaahin Riazi	Oct 30, 2020
Printed	Page 301 2nd Paragraph	"indepedently" should be "independently". Note from the Author or Editor: will fix	John Boersma	Nov 11, 2018
Printed	Page 311 Top code block	"for suit in ['H','S','C','D']: " should be "for suit in suits:". Otherwise, there is not point in defining "suits" earlier in the code block. Note from the Author or Editor: will fix	John Boersma	Nov 11, 2018
Printed	Page 335 ts.shift(1, freq='90T') exampe	This method with 90T parameter should lag the data by 90 minutes at 90 min frequency. Instead, it seems to preserve the monthly frequency and only lag every timestamp by 1:30hr. Am I reading this correctly or is this by design? Clarification would be helpful. Note from the Author or Editor: I will add a note to the text to clarify that the "freq" parameter does not change the frequency of the data (if any)	Serge	Jan 25, 2018	Sep 21, 2018
Printed	Page 339 First whole paragraph	"EST" should be "Eastern Time". The point is that the interval straddles the standard time - daylight savings time boundary. Note from the Author or Editor: will fix	John Boersma	Nov 11, 2018
Printed	Page 340 The source codes which shows Timestamp arithmetic before DST transition	At the source code, which shows arithmetic before DST transition, the book uses '2012-3-12 01:30', tz='US/Eastern'. But, in the 2012 US/Eastern, DST starts at 2012-3-11, so the code here shows arithmetic not over the DST, it may not make sense for readers. In the first edition of this book used '2012-03-11' not '2012-03-12', and was correct. Note from the Author or Editor: Confirmed. Fixing	Masato Setoyama	Mar 02, 2018	Sep 21, 2018
Printed	Page 347 [197] - [199]	"To convert back to timestamps, use to_timestamp:" There is no apparent change to the Series 'ts' by [197] & [199] - what is being demonstrated? Note from the Author or Editor: will clarify	Gregory Sherman	Jan 19, 2019
Printed	Page 351 Table 11-5, last row	convention defaults to 'start', not 'end'. Note from the Author or Editor: Fixing.	Hengni Cai	Mar 29, 2018	Sep 21, 2018
Printed	Page 352 1st and 2nd code examples.	The 2 code examples are the same. In[216]: ts.resample('5min', closed='right').sum() In[217]: ts.resample('5min', closed='right').sum() 216 should be WITHOUT the `closed='right'` Note from the Author or Editor: Confirmed. Will fix	Charbel Sarkis	Sep 27, 2018
Printed, PDF, ePub	Page 358 Figure 11-5	Fig 11-5 caption says: Apple 250-day daily return standard deviation. However the calc is based on price, so it's the price standard deviation, which is not really what one looks at usually. The correct call to plot the return standard deviation (add pct_change()) would be (e.g.): close_px.AAPL.pct_change().rolling(252, min_periods=np.int(252/2)).std().plot() Standard in finance is to show the annualized vol, which would be: (close_px.AAPL.pct_change().rolling(252, min_periods=np.int(252/2)).std()np.sqrt(252)).plot() Note from the Author or Editor:* will fix	Anonymous	May 23, 2019
Printed	Page 370 In [50]	The use of outer parentheses to facilitate line breaks, which is explained on page 381, should really be explained here at the first use. Note from the Author or Editor: will fix	John Boersma	Nov 11, 2018
Printed	Page 378 Text	The meaning of "unwrapped" here is really unclear. Does this refer to an internal process? The example is the same as on page 376, where "unwrapped" is not mentioned. Also, is "fast past" correct? Not sure what this means. Should it be "fast pass" or "fast path"? Note from the Author or Editor: will fix	John Boersma	Nov 11, 2018
Printed	Page 384 1st Paragraph	The Class 'Pandas.TimeGrouper' does not exist anymore. It has been replaced by ''pandas.Grouper'. The code should be changed with the following – time_key = pd.Grouper(freq='5min') Note from the Author or Editor: Confirmed. Will fix	Ben B	Sep 17, 2020
Printed	Page 390 In [38]	Need parameter "rcond=None" to suppress FutureWarning. Note from the Author or Editor: will fix	John Boersma	Nov 11, 2018
PDF	Page 437 Between 1st paragraph and 2nd paragraph	After the last sentence "Then, these can be concatenated together with concat:", it looks some python codes would be needed to make sense. These codes are found in https://github.com/wesm/pydata-book/blob/2nd-edition/ch14.ipynb , the below: nutrients = [] for rec in db: fnuts = pd.DataFrame(rec['nutrients']) fnuts['id'] = rec['id'] nutrients.append(fnuts) nutrients = pd.concat(nutrients, ignore_index=True) Note from the Author or Editor: Thanks -- I am restoring the code to the text (it was being accidentally suppressed in the output)	Haruyoshi TAKIGUCHI	Apr 03, 2018	Sep 21, 2018
PDF	Page 452 1st paragraph	The paragraph states that "the result is shown in Figure A-3", but Figure A-3 is "illustration", not "result" (just a cosmetic issue). Note from the Author or Editor: Changing language to "this is illustrated in Figure A-3"	Noritada Kobayashi	Nov 26, 2017	Sep 21, 2018
PDF	Page 467 center of the page	The paragraph states that "the output of outer will have a dimension that is the sum of the dimensions of the inputs". Since the result of outer for (3, 4) and (5,) is (3, 4, 5), is it better to replace the word "sum" with "concatenation"? Note from the Author or Editor: Making suggested change	Noritada Kobayashi	Nov 26, 2017	Sep 21, 2018
PDF	Page 473 Code example 188	It would be better to make a zipped result more pretty for the last code example as follows: In [188]: zip(last_name[sorter], first_name[sorter]) Out[188]: <zip at 0x7fa203eda1c8> Note from the Author or Editor: Adding "list(...)" to make the example prettier	Noritada Kobayashi	Nov 27, 2017	Sep 21, 2018
Printed	Page 479 [214] through [215]	In [214]: numba_mean_distance = nb.jit(mean_distance) We could also have written this as a decorator: @nb.jit def mean_distance(x, y): . . . In [215]: %timeit numba_mean_distance(x, y) To be consistent, I would make the definition begin with "def numba_mean_distance(x, y):" Note from the Author or Editor: will do	Gregory Sherman	Feb 01, 2019
Printed	Page 482 Top	"mmap" is a fairly large file on disk. It would be good to add a command to delete it when done here. Note from the Author or Editor: will fix	John Boersma	Nov 11, 2018
Printed	Page 483 [230] and [231], plus preceding text	"In this example, summing the rows of these arrays should, in theory, be faster for arr_c than arr_f ..." Runs on my Windows 10 PC consistently show the opposite, like: In [46]: %timeit arr_c.sum(1) 1.65 ms ± 9.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [47]: %timeit arr_f.sum(1) 994 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) I have carefully checked: "C_CONTIGUOUS : True" for arr_c and "F_CONTIGUOUS : True" for arr_f Any idea what's going on? Note from the Author or Editor: will improve the example	Gregory Sherman	Jan 30, 2019
Printed	Page 483 preceding text and [230] and [231]	[more on same issue] On my PC, I found that sum(0) runs faster on arr_c : In [17]: %timeit arr_c.sum(0) 953 µs ± 9.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [18]: %timeit arr_f.sum(0) 1.6 ms ± 2.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) I wonder if the output in [230] and [231] does not actually result from what was built in [225] and [226]. Note from the Author or Editor: will review and make the difference more stark / consistent	Gregory Sherman	Feb 02, 2019
PDF	Page 485 The first paragraph and code	Original: Since the input variables are strings they can be executed again with the Python exec keyword: In [30]: exec(_i27) I propose the following: Since the input variables are strings they can be evaluated again with the Python eval keyword: In [30]: eval(_i27) Out[30]: 'bar' It looks "exec" does not make sense in this context because _i27 is not a statement or a code. Note from the Author or Editor: It's not a mistake but "eval" makes the example more illustrative. Changing	Haruyoshi TAKIGUCHI	Apr 28, 2018	Sep 21, 2018
Printed	Page 487 Scorpion comment	The comment that deleting a variable does not free up memory appears to be incorrect. After using del I had a decrease in memory used on my mac as shown on activity monitor. Note from the Author or Editor: will fix	John Boersma	Nov 11, 2018
Printed	Page 491 Middle	"works_fine" method should be "works_fine function". Note from the Author or Editor: will fix	John Boersma	Nov 11, 2018
PDF	Page 494 The first code quote in the section "Basic Pro ling: %prun and %run -p"	Found two syntax errors in Python3. 1) for _ in xrange(niter): needed to be replaced by like for _ in range(niter): 2) print 'Largest one we saw: %s' % np.max(some_results) needed to be replaced by like print('Largest one we saw: {0}'.format(np.max(some_results))) Note from the Author or Editor: Fixing this	Haruyoshi TAKIGUCHI	Apr 08, 2018	Sep 21, 2018
Printed	Page 495 In [561] and In [562]	Reported Wall times are way off. More like 250ms and 100ms. Note from the Author or Editor: will add comment	John Boersma	Nov 11, 2018
Other Digital Version	2255 Functions Are Objects (section)	In Amazon Kindle version, Chapter 3: Section "Functions Are Object", the text explains that the code: import re def clean_strings(strings): result = [] for value in strings: value = value.strip() value = re.sub('[!#?]', '', value) value = value.title() result.append(value) return result Should clean the data FROM: states = [ ' Alabama ', 'Georgia!' , 'Georgia', 'georgia', 'Fl0rida', ... ] TO: ['Alabama', 'Georgia', 'Georgia', 'Georgia', 'Florida', ... ] When running the code, none of the methods change 'Fl0rida' to 'Florida' as mentioned in the text. All the other data entry is working. Note from the Author or Editor: Thanks. I will fix the code example.	Kyle Jeffreys	May 16, 2020
Mobi	Page 2621	"For large DataFrames, the head method is useful to get see the first 5 rows:" 'get' should be removed	Bridgeland	Mar 29, 2017	Sep 25, 2017