Errata

Python for Data Analysis

Errata for Python for Data Analysis

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
Page Section 7.3, page 227, Table 7-3
Top

StringDtype is missing from the table. Not sure if the table is meant to be exhaustive, but this is an important type that should be included in this table.

Note from the Author or Editor:
I will add it in a subsequent printing

Kerrick Staley  Jun 06, 2023 
Page Table 2.1: Binary operators
Table 2.1: Binary operators

a <= b should be in the inline code, or
`a < b, a <= b ` .

Note from the Author or Editor:
will fix

Alen Softić  Jan 11, 2023 
Page 13 Data Analysis Examples
"Donation Statistics by Occupation and Employer" Section

In the web version of book:

In the below text the "f" in the map method should be "get_emp"


def get_emp(x):
# If no mapping provided, return x
return emp_mapping.get(x, x)

fec["contbr_employer"] = fec["contbr_employer"].map(f)

Note from the Author or Editor:
will fix

Anonymous  Jan 05, 2023 
Page 1.4 Installation and Setup
Installing Necessary Packages

In the part about setting up the enviroment, should change the packeage from jupyter to jupyterlab to avoid some package/dependencies conflits

change "(pydata-book) $ conda install -y pandas jupyter matplotlib" to
"(pydata-book) $ conda install -y pandas jupyterlab matplotlib"

Note from the Author or Editor:
agreed, fixing

Luiz Henrique  Dec 06, 2022 
Page 2 Python Language Basics, IPython, and Jupyter Notebooks
1st paragraph

"Now in 2022, there is now[...]" --> One "now" should be removed

Note from the Author or Editor:
will fix

Anonymous  Nov 30, 2022 
Page 7.3 Extension Data Types
Table 7.3: pandas extension data types

I've found the error on the html, web version one. The table is linked to wrong table. So when I click on the link button, It gives me a wrong link back.

Note from the Author or Editor:
fixing in html version

Anonymous  Nov 05, 2022 
Page Chapter 2, strings
Above output #65

Change "Afer this operation, the variable"
to
"After this operation, the variable"

wesmckinney.com/book/python-basics.html

Note from the Author or Editor:
will fix

Will Beasley  Sep 24, 2022 
Page https://wesmckinney.com/book/python-builtin.html
https://wesmckinney.com/book/images/pda3_0301.png

pda3_0301.png is apparently missing.

Note from the Author or Editor:
This has been fixed

brian piercy  Sep 17, 2022 
Page https://wesmckinney.com/book/python-builtin.html#comprehensions
https://wesmckinney.com/book/python-builtin.html

"we could filter out strings with length 2 or less and convert them to uppercase like this:"

does not tie to the code following thereafter:
[x.upper() for x in strings if len(x) > 2]
['BAT', 'CAR', 'DOVE', 'PYTHON']

Change to e.g. "filter out strings with length more than 2 and convert them..."

Note from the Author or Editor:
will fix

Thomas Pfeiffer  Sep 08, 2022 
Page Acknowledgements for the Third Edition
1st two lines

Text reads: "It has more than a decade since I started writing the first edition of this book and more than 15 years since I originally started my journey as a Python prorammer."

Programmer is missing the 'g' and 'It has more...' should probably read 'It has BEEN more...'

Note from the Author or Editor:
will fix

Laure Robinson  Aug 29, 2022 
Page Filling In Missing Data
3rd paragraph

The ??? is not replaced with the actual reference link.

'The same interpolation methods available for reindexing (see ???) can be used with fillna'

Note from the Author or Editor:
will fix

Junwei Fang  Aug 11, 2022 
Page 9 Plotting and Visualization
Python for Data Analysis, 3E (pre release)

I use "jupyter notebook" and "jupyterlab". The examples in "Figures and Subplots" cannot be reproduced with jupyterlab exactly as in the book.

Suggestion
At the beginning of the book it should be pointed out that the examples could be reproduced with "jupyter notebook", minor adjustments would be necessary with jupyterlab.

Best regards, Robert

Note from the Author or Editor:
I'm adding some language for JupyterLab users

Robert Moser  Aug 10, 2022 
Page Chapter 2 - dates and times
https://wesmckinney.com/book/python-basics.html#scalar_dates

Question marks not replaced with actual reference:

"See ??? for a full list of format specifications."

Note from the Author or Editor:
will fix, this is only on the HTML version

Anonymous  Jul 29, 2022 
Page Data Types for ndarrays
General Note

Digital version of the book at wesmckinney.com/book/numpy-basics.html. In the general note, the wording goes as "A signed integer can represent both positive and negative integers, while an unsigned integer can only represent nonzero integers." Given the code example provided within the note, I think "nonnegative" was meant.

An attempt to pass a sequence with a negative number while specifying unsigned integer data type yields a peculiar result:

In [35]: np.array([-1, 0, 1], dtype="u1")
Out[35]: array([255, 0, 1], dtype=uint8)

Thank You for elaborating on this distinction!

Note from the Author or Editor:
will fix

Semyon Bokhankevich  Jul 27, 2022 
Page Tab Completion
Paragraph 2

Digital version of the book at wesmckinney.com/book/ipython.html. Word "also" repeated twice: "Also, you can also complete methods and attributes on any object after typing a period:"

Note from the Author or Editor:
will fix

Semyon Bokhankevich  Jul 25, 2022 
Page Ch 4: Fancy Indexing
The last paragraph before the 5th code chunk

Many users (myself included) may have expected fancy indexing to return a rectangular sub-matrix.
Here is one way to get that:

Note from the Author or Editor:
will reword

Nicholas Vence  Jul 22, 2022 
Page Python Language Basics > Scalar Types > Strings
https://wesmckinney.com/book/python-basics.html#scalar_strings

A mention of an additional line break is missing in the following text:
"It may surprise you that this string c actually contains four lines of text; the line breaks after """ and after lines are included in the string. We can count the new line characters with the count method on c:"

There are 3 line breaks that should be mentioned:
1) after """
2) after `that`
3) after `lines`

Line breaks 1 and 3 are correctly mentioned, but 2 was omitted and should be included.

Anonymous  Jul 19, 2022 
Page Selection on DataFrame with loc and iloc
Right before Table 5.4 3rd edition online

Note: Resubmitted for clarity
"Boolean arrays can used with loc but not iloc:"

As per the documentation, this is not 100% accurate. A boolean ndarray may be passed for iloc:

Given:

mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
{'a': 100, 'b': 200, 'c': 300, 'd': 400},
{'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
df = pd.DataFrame(mydict)
df
a b c d
0 1 2 3 4
1 100 200 300 400
2 1000 2000 3000 4000

The following produces valid output--------------------------------

df.iloc[:, df.columns.isin(['a','b'])]

However, the following does not------------------------------------

data.iloc[data['c']>=300]

Note from the Author or Editor:
I am clarifying that I mean for selecting rows

Mauricio Ruiz  Jul 14, 2022 
Page 9.2 Plotting with pandas and seabord code for figure
Code for figure 9-24

The code attempts to set the title of the chart with:

In [109]: ax.title("Changes in log(m1) versus log(unemp)")

This raises an exception:
TypeError: 'Text' object is not callable

Perhaps it should be:
ax.set(title="Changes in log(m1) versus log(unemp)")

Note from the Author or Editor:
will fix

Mark Meyer  Jul 13, 2022 
Page Section 8.3 Reshaping and Pivoting
Pivoting “Long” to “Wide” Format

It reads: "Now, ldata looks like:"

I believe it should read: "Now, long_data looks like:"

Note from the Author or Editor:
will fix

Andres Medaglia  Jul 07, 2022 
Page Section 5.2
Selection on DataFrame with loc and iloc

It reads “To select multiple roles“ and it should read “To select multiple rows”.

Note from the Author or Editor:
will fix

Andres Medaglia  Jul 01, 2022 
Page Merging on Index
3rd block of code in the section

There is a format error and below
pd.DataFrame({"event1": pd.Series([0, 2, 4, 6, 8, 10], dtype="Int64",

everything is displayed in red color within the block of code. This makes the reading confusing

Note from the Author or Editor:
will reformat

Enrique M. Muro  Jun 30, 2022 
Page https://wesmckinney.com/book/data-cleaning.html#prep_dummy_vars
section on Computing Indicator/Dummy Variables

You have the following note: "For much larger data, this method of constructing indicator variables with multiple membership is not especially speedy. It would be better to write a lower-level function that writes directly to a NumPy array, and then wrap the result in a DataFrame." What is meant by lower-level? A custom C function or a Python function that is more efficient in some way? Given that pandas is presumably written in C, it's surprising that any type of Python function could be faster than pandas. I think this note could be clarified for the reader by saying "not especially speedy because...". For example, is it because using pandas in this way will do too many memory allocations, too many data copies, etc.

Note from the Author or Editor:
i'm removing this note

Graeme Richardson  Jun 28, 2022 
Page https://wesmckinney.com/book/pandas-basics.html#pandas_summarize
section 5.3 Summarizing and Computing Descriptive Statistics

This statement seems to be incorrect or at least unclear: "When an entire row or column contains all NA values, the sum is 0, whereas if any value is not NA then the result is NA." As seen in the examples, if any value is not NA, the result is a sum.

Note from the Author or Editor:
Confirmed, I am fixing the language to be correct

Graeme Richardson  Jun 23, 2022 
Page 72
clean_strings(states)

On page 72, after applying clean_strings(states), "South Carolina" has unwanted space in between, I do believe this is a print error. Sorry if a false flag, just trying to help.

Note from the Author or Editor:
will improve the example

Mauricio Ruiz  Jun 23, 2022 
Page https://wesmckinney.com/book/preliminaries.html
section 1.4 Installation and Setup

Under section 1.4 Installation and Setup, you have the following subheadings:
Miniconda on Windows
GNU/Linux
Miniconda on macOS

I think that middle one should be "Miniconda on GNU/Linux" for consistency with the other two.

Note from the Author or Editor:
will fix

Graeme Richardson  Jun 07, 2022 
Page NA
4.4 Array-Oriented Programming with Arrays

wesmckinney.com/book/numpy-basics.html

Current: In [169]: points = np.arange(-5, 5, 0.01) # 100 equally spaced points
Proposed fix: In [169]: points = np.arange(-5, 5, 0.01) # 1000 equally spaced points

Note from the Author or Editor:
will fix

Matt Dahlman  Jun 07, 2022 
Page chapter 9
first paragraph

In section Ticks, Labels and Legends. ax.xlim() is no longer working, they changed to ax.set_xlim()

Note from the Author or Editor:
will fix

Levy  May 26, 2022 
n/a

In the open access version, when seaborn histplots are plotted, the kde=true argument seems to be missing: wesmckinney.com/book/plotting-and-visualization.html#fig-vis_series_kde

Note from the Author or Editor:
will fix

Hamed   Apr 22, 2022 
Page New 3/E textbook > Chapter 2 > Variables and argument passing
3rd paragraph

wesmckinney.com/book/python-basics.html#semantics_references

New 3/E textbook > Chapter 2 > Variables and argument passing

This:
"In some languages, the assignment if b will cause the data..."

should probably be:
"In some languages, the assignment of b will cause the data..."

Changed 'if' to 'of'.

Note from the Author or Editor:
will fix

Aaditya Bugga  Apr 09, 2022 
Page 403
example line 'In [85]'

FutureWarning:
statsmodels.tsa.AR has been deprecated in favor of statsmodels.tsa.AutoReg and
statsmodels.tsa.SARIMAX.

AutoReg adds the ability to specify exogenous variables, include time trends,
and add seasonal dummies. The AutoReg API differs from AR since the model is
treated as immutable, and so the entire specification including the lag
length must be specified when creating the model. This change is too
substantial to incorporate into the existing AR api. The function
ar_select_order performs lag length selection for AutoReg models.

AutoReg only estimates parameters using conditional MLE (OLS). Use SARIMAX to
estimate ARX and related models using full MLE via the Kalman Filter.

Note from the Author or Editor:
this is fixed in the 3rd edition

Dennis Gonzales  May 30, 2021 
Page 114
Bottom half

The text (pdf page 114, book pages 134-135) illustrates the creation of a DataFrame from a dict. First, the dict creation is shown:

pop = {'Nevada': {2001: 2.4, 2002: 2.9},
'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

then, the data frame is created:

frame3 = pd.DataFrame(pop)

That's all fine thus far. However, the display of the DataFrame after creation isn't correct, in that the index order as shown isn't what occurs. That is, the book and pdf show:


Nevada Ohio
2000 NaN. 1.5
2001 2.4 1.7
2002 2.9 3.6



But in reality, the DataFrame is displayed thus, if one follows along with the text as shown:


Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5

In other words, the indices are displayed 2001, 2002, 2000, rather than 2000, 2001, 2002. This matters, because the examples that follow immediately (which involve transposing and also index slicing using that first DataFrame) then won't work as shown. The problem lies with the original dict creation. If the order of the "Nevada" and "Ohio" dicts are swapped, with Ohio being first, then the indices will appear in the desired order (i.e., 2000, 2001, 2002). (However, note that the columns in the resulting DataFrame will also then be swapped (with Ohio appearing as the first column, and Nevada second)).

The bottom line is that the whole set of examples doesn't work as shown, unfortunately, and there is a cascading effect - the first example is off, and thus so are the following examples based upon the first.










Note from the Author or Editor:
this was fixed in the 3rd edition

Andrew Boudreau  May 30, 2021 
Printed
Page 357
Second example "In [221]:"

<ipython-input-216-793d385fe06a>:1: FutureWarning: 'loffset' in .resample() and in Grouper() is deprecated.

>>> df.resample(freq="3s", loffset="8H")

becomes:

>>> from pandas.tseries.frequencies import to_offset
>>> df = df.resample(freq="3s").mean()
>>> df.index = df.index.to_timestamp() + to_offset("8H")

ts.resample('5min', closed='right', label='right', loffset='-1s').sum()

Note from the Author or Editor:
will fix

Dennis Gonzales  May 16, 2021 
Printed
Page 287
1st line of example "In [108]"

Update seaborn now requires kwargs "x=" and "y=" for first two arguments in example reference.

Dennis Gonzales  Apr 23, 2021 
Printed
Page 150
Bottom of 2nd code block

df1 - df2 should use '+' operator instead if adding lists. '-' operator still produces the same result.

Shivan Sivakumaran  Oct 08, 2020 
Printed, PDF
Page 373
10th Line

Error in code : In [77]: g = df.groupby('key').value
correction : g = df.groupby('key)


.value after a groupby method would lead to an error as ".value" is not any aggregation function. Given the context i think this should be just
g = df.groupby('key')

Bharath Reddy  Mar 11, 2020 
Printed
Page 378
n/a

missing index and examples for
merge_asof() function which has
existed for a while and seems useful
for financial time series.
That said, is there a specific reason it
has been omitted ? Or can one easily
implement it with some of the documented
functions,etc...?

Note from the Author or Editor:
will document in 3rd edition

E G  Feb 18, 2020 
Printed
Page 378
index for as_ordered

as_ordered methdo, 378 --> as_ordered method

E G  Feb 18, 2020 
Printed
Page 351
Table 11-5. Resample method arguments

The 'freq' argument seems wrong, when trying it explicitly, the following error message is returned:

In [22]: ts2 = ts.resample(freq='M')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-22-ad32b3871b3e> in <module>()
----> 1 ts2 = ts.resample(freq='M')

TypeError: resample() got an unexpected keyword argument 'freq'

Indeed, in the docs (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.resample.html) it is listed as 'rule' argument, which works:

In [25]: ts2 = ts.resample(rule='M').mean()

In [26]: ts2
Out[26]:
2000-01-31 -0.123505
2000-02-29 0.011267
2000-03-31 0.180698
2000-04-30 0.007794

Note from the Author or Editor:
confirmed, will fix

Yonathan Mizrahi  Oct 08, 2018 
Printed
Page 145
Bottom of page & continuing

"...if you have an axis containing integers, data selection will always be label-oriented."

But earlier, on p. 141: "Slicing with labels...the endpoint is inclusive"

So why at bottom of p. 145 does ser[:1] not include the endpoint of the slice (only first row returned)? Shouldn't label-oriented slicing of this "axis containing integers" make ser[:1] return the same two rows as ser.loc[:1]? Shouldn't it be the case that only ser.iloc[:1] is not label-oriented, and therefore only it excludes the endpoint of the slice?

Note from the Author or Editor:
will review

Stephen Frost  Feb 08, 2018 
Printed
Page 103
3rd paragraph

In 1st release of 2nd edition print copy, section on numpy fancy indexing (page 103, 3rd paragraph) says, “...the result of fancy indexing is always one-dimensional.” However, there are example outputs in this section with more than one dimension. Is that because some of the examples in the section are not fancy indexing? If that’s the case, it’s unclear where the section is building up to a fancy indexing example as opposed to every example being fancy indexing. The number of dimensions in the output seems to be number of array dimensions minus number of index dimensions, unless index can also have more dimensions than the array.

Note from the Author or Editor:
will clarify

Stephen Frost  Feb 08, 2018 
Printed
Page 23
first code block after 2nd paragraph


In the users.dat file downloaded from https://grouplens.org/datasets/movielens/1m/ the data for 'gender' is before 'user_id' e.g.

1::F::1::10::48067
2::M::56::16::70072

therefor unames should not be defined as :
unames = [ 'user_id', 'gender', 'age', 'occupation', 'zip']

instead should be :
unames = [ 'gender', 'user_id', 'age', 'occupation', 'zip']

Note from the Author or Editor:
will review

Edward Hope  Jan 18, 2018 
PDF
Page 113
In [17]

In [17]: np.exp(obj2)

numpy needs to be imported before this code. There should be a line of code before this code:
import numpy as np

Note from the Author or Editor:
Will add missing import

Dan Yuan  Sep 13, 2017 
Printed
Page 419
entire example

The example lacks a function to remove extra whitespace in string "south carolina##". Either the output should be altered at top and bottom of page 419 (i.e. "Out[15]" and "Out[22]") or a function should be added to normalize the whitespace between tokens. E.g. value = ' '.join(value.split())

Note from the Author or Editor:
Will fix example and clarify text, since it only strips whitespace from the start and end of the tokens

Craig Murray  Feb 15, 2016 
Mobi
Page 5385
Example code

The location is based on the location information provided by my Kindle reader using the mobi format. I believe this would be page 229 in the physical edition.

In the example code in section "Annotation and Drawing on Subplot," the first element of each tuple in the crisis_data list is of type datetime.datetime. These elements are used as an argument to pandas.asof(). However, this method takes a DateTimeIndex as an argument. Therefore, this date value needs to be converted using pandas.to_datetime() before making the call to asof().

Note from the Author or Editor:
will review for 3e

Patton Bradford  Feb 13, 2016 
PDF
Page 416
defaultdict examples at the top of the page

The two examples illustrating the usage of defaultdict, don't quite work as described in Python 3 (at least not in v. 3.4.3). For the first example, one cannot see the result in the same form as by the techniques on the previous page, by just typing by_letter; one must type dict(by_letter). Next, it is not clear what the example,

counts = defaultdict(lambda: 4)

is supposed to produce. Typing counts at the prompt (in IDLE), simply yields

defaultdict(<function <lambda> at 0x02DD3B70>, {})

while typing dict(counts), yields

{}

It is not clear how one could incorporate this construction into the previous example or for a new example, to see how 4 gets used.

Note from the Author or Editor:
I will confirm that this behaves in the expected way in the 3rd ed

Anonymous  Sep 08, 2015 
PDF
Page 413
3rd IPython display: In[432], Out[434] and Out[435]

The example is correct but you may as well get the names correct too, seeing as the names are those of real people.

On the 2nd line of In[432]:

('Schilling', 'Curt') should be ('Curt', 'Schilling')

The output for Out[434] and Out[435] will then be corrected accordingly, to:

Out[434]: ('Nolan', 'Roger', 'Curt')

Out[435]: ('Ryan', 'Clemens', 'Schilling')

Note from the Author or Editor:
I've fixed this in the book source materials

Anonymous  Sep 07, 2015 
Printed
Page 217
Caption for Figure 7.1

Figure 7-1 displays values by *food* group, not by nutrient group (Zinc is the nutrient in the example). Its captions should hence read something along the lines of "Median Zinc values by food group".

Note from the Author or Editor:
Confirmed the caption is wrong. Will fix

David Garcia Quintas  Sep 07, 2015 
PDF
Page 203
Middle of the page

Splitting the categories from the movie dataset can achieved by using:
movies.genres.str.get_dummies('|')

Note from the Author or Editor:
Awesome. I'll use this feature in the next iteration of the book

Kristof  Sep 05, 2015 
Printed
Page 23
1st code sample

The 2nd and 3rd use of pd.read_table should use the ratings.dat and movies.dat file and not users.dat

Note from the Author or Editor:
Thanks. This has been fixed

Richard White  Mar 30, 2015 
Printed,
Page 194
3rd paragraph

On the 3rd paragraph of "Removing Duplicates" sub-section: the drop_duplicates function returns where it is FALSE although the book says where it is TRUE.


"Relatedly, drop_duplicates returns a DataFrame where the duplicated array is True:
In [129]: data.drop_duplicates()"


So, the 'True' should be replaced by 'False'.

Thanks.
Simone.

Note from the Author or Editor:
This has been fixed in the 2nd edition

Simone Occulate  Dec 15, 2014 
PDF
Page 194
3rd paragraph under "Removing Duplicates"

"Relatedly, drop_duplicates returns a DataFrame where the duplicated array is True:"

The index values from `data.drop_duplicates()` suggest that drop_duplicates returns rows where the duplicated() array is False.

Note from the Author or Editor:
Nice catch, will fix in the upcoming printing.

Chapman  Nov 17, 2014  Dec 12, 2014
Printed
Page 344
1st paragraph, body of the "to_index" function

The given defintion of to_index:

def to_index(rets):
index = (1 + rets).cumprod()
first_loc = max(index.notnull().argmax() - 1, 0)
index.values[first_loc] = 1
return index

doesn't seem to work with Pandas 0.14.1, firstly due to "index.notnull().argmax() - 1", where index.notnull().argmax() is now a Timestamp without an offset, from which one can't substract an int. Morever, one can't compare it against an int, as part of the max() function.

The following version works:
def to_index(rets):
index = (1 + rets).cumprod()
first_loc = index.notnull().argmax()
index[first_loc] = 1
return index

Note from the Author or Editor:
Good catch will fix in the upcoming printing.

David Garcia Quintas  Oct 04, 2014  Dec 12, 2014
Printed
Page 175
top

Due to change to SQLAlchemy the conn object is replaced by an engine object.

The line,

conn = sqlite3.connect(':memory:')

should be replaced by

To use a SQLite :memory: database, specify an empty URL:

engine = create_engine('sqlite://')

Notice that 'sqlite' is in lowercase and without a '3' suffix.

For a relative file path, this requires three slashes:

engine = create_engine('sqlite:///foo.db')

And for an absolute file path, four slashes are used:

engine = create_engine('sqlite:////absolute/path/to/foo.db')

source:
http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html#sqlite

Note from the Author or Editor:
Editors: We are addressing this in the code example review.

Reporter: This will be fixed in the next printing

Jim Callahan  Jul 31, 2014  Dec 12, 2014
Printed
Page 175
United States

Current text "...pandas has a read_frame function in its pandas.io.sql module that simplifies the process."

Warnings when running code:
1. "read_frame is depreciated, use read_sql "
2. "Reading a table with read_sql is not supported"
"for a DBIAPI2 connection. Use a SQLAlchemy"
"engine or specify a SQL query"

This apparently changed with pandas release v0.14.0 (May 31 , 2014). Essentially the SQL function names change and the engine object replaces the connection object.

The SQL changes are documented in:
http://pandas.pydata.org/pandas-docs/stable/pandas.pdf
page 8 "SQL interfaces updated to use sqlalchemy, "
page 18 "The SQL reading and writing functions now support more database flavors through SQLAlchemy...
The new functions read_sql_query() and read_sql_table() are introduced. The function read_sql()
is kept as a convenience wrapper around the other two and will delegate to specific function depending on the provided
input (database table name or sql query).
In practice, you have to provide a SQLAlchemy engine to the sql functions. To connect with SQLAlchemy you use
the create_engine() function to create an engine object from database URI. You only need to create the engine
once per database you are connecting to. For an in-memory sqlite database:
In [43]: from sqlalchemy import create_engine
# Create your connection.
In [44]: engine = create_engine(�sqlite:///:memory:�)
This engine can then be used to write or read data to/from this database:
In [45]: df = pd.DataFrame({�A�: [1,2,3], �B�: [�a�, �b�, �c�]})
In [46]: df.to_sql(�db_table�, engine, index=False)
You can read data from a database by specifying the table name:
In [47]: pd.read_sql_table(�db_table�, engine)
Out[47]:
A B
0 1 a
1 2 b
2 3 c
or by specifying a sql query:
In [48]: pd.read_sql_query(�SELECT * FROM db_table�, engine)
Out[48]:
A B
0 1 a
1 2 b
2 3 c"

Note from the Author or Editor:
We are fixing this in the code example review. Will be fixed in next printing

Jim Callahan  Jul 31, 2014  Dec 12, 2014
PDF
Page 18
India

the following command [json.loads(line) for line in open(path)] produces the following error:

--------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-83-b1e0b494454a> in <module>()
----> 1 records = [json.loads(line) for line in open(path)]

C:\Users\Mrinal\AppData\Local\Enthought\Canopy32\App\appdata\canopy-1.4.1.1975.win-x86\lib\json\__init__.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
336 parse_int is None and parse_float is None and
337 parse_constant is None and object_pairs_hook is None and not kw):
--> 338 return _default_decoder.decode(s)
339 if cls is None:
340 cls = JSONDecoder

C:\Users\Mrinal\AppData\Local\Enthought\Canopy32\App\appdata\canopy-1.4.1.1975.win-x86\lib\json\decoder.pyc in decode(self, s, _w)
363
364 """
--> 365 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
366 end = _w(s, end).end()
367 if end != len(s):

C:\Users\Mrinal\AppData\Local\Enthought\Canopy32\App\appdata\canopy-1.4.1.1975.win-x86\lib\json\decoder.pyc in raw_decode(self, s, idx)
379 """
380 try:
--> 381 obj, end = self.scan_once(s, idx)
382 except StopIteration:
383 raise ValueError("No JSON object could be decoded")

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 6: invalid start byte


Please help and explain the reason for the error

Note from the Author or Editor:
Editors, can you please change "open(path)" to "open(path, 'rb')" ? this will fix this issue for readers using Python 3

Mrinal  Jul 05, 2014  Dec 12, 2014
PDF
Page 345
Signal Frontier Analysis section

The example refers to a mean reverting strategy and not a momentum portfolio because we rank returns in descending order. E.g. the highest return gets the rank 1, which translates in a lower portfolio weight after demeaning and normalizing.

So either we change the text or, if we really want to provide an example of momentum portfolio we change the function calc_mon and use ascending=True, i.e.

ranks = mom_ret.rank(axis=1, ascending=True)

There is another small error in function strat_sr on page 346. Here when we compute the portfolio we use a lag value of 1, meaning that for portfolio at day t we use only information from day t-1 back. This is ok, however, when we then compute the total cumulative returns there is no need to again shift the portfolio by one day, as this implies that we just through away one day of information, so the line:

port = port.shift(1).resample(freq, how='first')

should be:

port = port.resample(freq, how='first')

Note from the Author or Editor:
You're right about the momentum portfolio.

Editors, on page 345 can you replace the two usages of "momentum" with "mean reversion" and on Page 347, in the Figure 11-3 caption can you also make the same substitution.

The second note about the strat_sr function is not errata because the portfolio weights are the portfolio weights: they have to be shifted forward to compute the portfolio returns in the next period, so no changes needed there.

Anonymous  Jul 01, 2014  Dec 12, 2014
Printed
Page 38
United States

After defining the array prop_cumsum you want to call the method searchsorted to search for the 50th percentile. The code supplied is prop_cumsum.searchsorted(0.5), which throws the error Series object has no Attribute searchsorted

I got this to work sort of: numpy.searchsorted(prop_cumsum,0.5), the only problem is the output is every line number in the array followed by the index position. Can you shed any light on the code as written in the text and the code I got to work?

Thanks

Note from the Author or Editor:
This is caused by API changes in pandas. We have fixed the code example in an overall review of the examples, so this will be addressed in the next printing.

Anonymous  Jun 25, 2014  Dec 12, 2014
PDF
Page 246
Example code

The example code on the page 246 (Plotting Maps: Visualizing Haiti Earthquake Crisis Data) no longer works due to change of pandas since v0.13.0 released on 31 Dec 2013.

To make it work,
x, y = m(cat_data.LONGITUDE, cat_data.LATITUDE)
should be
x, y = m(cat_data.LONGITUDE.values, cat_data.LATITUDE.values)

You may find details on http://stackoverflow.com/questions/23136159

Apart from this, it will be also great if we add the following line at the end of the same example code to show the resulting plot.

plt.show()

Note from the Author or Editor:
Editors: please verify that this has been fixed in the overall code example review.

Younghoon Rhiu  Jun 21, 2014  Dec 12, 2014
Printed
Page 95
In [123]: and In [124]:

As in "In [84]:" on page 89, `randn()' should read `np.random.randn()' ...

Note from the Author or Editor:
Editors: can you please make the indicated change?

Replace

randn()

with

np.random.randn()

Kazuyoshi Furutaka  Jun 11, 2014  Dec 12, 2014
PDF
Page 420
Bottom third

The main restriction on function arguments it that the keyword arguments must follow the positional arguments (if any).

'it' should be 'is'

Note from the Author or Editor:
Editors: please change to "The main restriction on function arguments is that"

Nick Carchedi  Jun 06, 2014  Dec 12, 2014
PDF
Page 390
Next to paw prints at the top

"Assignment is also referred to as binding, as we are binding a name to an object. Variables names that have been assigned may occasionally be referred to as bound variables."

At the beginning of the second sentence, I think either 'variables' should be singular or the word 'names' should be removed. :-)

Note from the Author or Editor:
Editors: on Page 390, "Variables names" should be "Variable names"

Nick Carchedi  Jun 05, 2014  Dec 12, 2014
Printed
Page 405
first snippet in page

The code snippet about the "xrange" function needs correction.
Replace "x" with "i" in the following example:

sum = 0
for i in xrange(10000):
# % is the modulo operator:
if x % 3 == 0 or x % 5 == 0:
sum += i


The right code should be:

sum = 0
for i in xrange(10000):
# % is the modulo operator:
if i % 3 == 0 or i % 5 == 0:
sum += i

Note from the Author or Editor:
Good catch. Editors, please change "x" to "i" in the indicated code example as written by the errata reporter

Gaston  Apr 15, 2014  Dec 12, 2014
Printed
Page 324
First paragraph of Exponentially-weighted functions

The formula for the moving average is written as

ma_t = a * ma_{t-1} + (a-1) * x_{-t}

with a the decay factor.

It should be:
ma_t = a * ma_{t-1} + (1-a) * x_{t}

Note from the Author or Editor:
Good catch, please make this change

Bertrand Haut  Mar 06, 2014  Dec 12, 2014
PDF
Page 52
top

the two ways of computing top1000 give different results

Note from the Author or Editor:
I have made a note to look into this since we have made a full review of the book's code examples. There might be a bug in pandas, in which case I will report upstream to the dev team

Anonymous  Dec 07, 2013  Dec 12, 2014
PDF
Page 43
United States

filename m1-1m /users.dat should be movielens/users.dat

Note from the Author or Editor:
Correct -- editors, could you make the indicated change (replace ml-1m with movielens)?

Anonymous  Dec 07, 2013  Dec 12, 2014
Printed
Page 33
middle

I get a ValueError: array dimensions must agree except for d_0 when I run line 371:
names1880.groupby('sex').births.sum().

names1880.groupby('sex')['births'].sum() works.

Note from the Author or Editor:
We have addressed this (I believe) in a review of the code examples. Will follow up with editors to verify that it is fixed

Allen Long  Nov 03, 2013  Dec 12, 2014
Printed
Page 266
top half

This is reference to an issue that Ian Gow has also pointed about above (Jul 06, 2013). A possible solution to the problem is mentioned below.

Define people as in the book. The values are a different since 'randn' gives different numbers.

>>> people
a b c d e
joe 2.011219 0.139871 -0.169945 1.801018 0.560313
steve -0.878164 0.121969 -0.174672 -1.500867 1.548067
wes -0.460175 -0.449552 1.213917 1.250151 0.191200
jim 2.286116 -1.253508 -0.567102 -0.802946 1.432807
travis -0.506323 0.807026 0.960450 -1.266392 0.567154

Define key as in the book:

>>> key
['one', 'two', 'one', 'two', 'one']

However, the error is that the following does not give zero mean:

demeaned = people.groupby(mapc,axis=0).transform(demean)
demeaned.groupby(mapc,axis=0).mean()

>>> demeaned = p.groupby(key).transform(demean)
>>> demeaned.groupby(key).mean()
a b c d e
one -0.269472 -0.205111 0.181926 0.218409 -0.082785
two 0.404208 0.307667 -0.272888 -0.327613 0.124178

A possible solution is to do the following. Define mapc as:

mapc = {'joe':'one', 'steve':'two', 'wes':'one', 'jim':'two', 'travis':'one'}

and now the the following produces zero mean:

>>> demeaned = p.groupby(mapc).transform(demean)
>>> demeaned.groupby(mapc).mean()
a b c d e
one 7.401487e-17 0 3.700743e-17 3.700743e-17 -4.625929e-17
two 0.000000e+00 0 -1.387779e-17 5.551115e-17 0.000000e+00

Note from the Author or Editor:
We are working to address this in pandas:

https://github.com/pydata/pandas/issues/8046

Qasim Iqbal  Oct 25, 2013  Dec 12, 2014
PDF
Page 9
2nd paragraph

In the OS X installation it states that we should type "gcc" at the terminal command line to see if gcc is installed. I'm running Mavericks and it is not installed. I believe it's been depreciated by Apple. Is there a workaround for this issue?

Thanks

Note from the Author or Editor:
Yes, Mavericks now uses clang instead of gcc. Editors, could you add a parenthesis that states "(or clang on newer versions of OS X)"

scottclausen@mac.com  Oct 23, 2013  Dec 12, 2014
Printed
Page 192
Belgique

A reader posted earlier the following comment:

"The section begins:
A common way to store multiple time series in databases and CSV is in so-called long or stacked format:
In [116]: ldata[:10]

However, the variable ldata has not been defined or initialized previously (or later) in the book. "

Perhaps would it be helpful to slightly alter the example to make it immediately testable by the audience of the book:

from pandas.core.reshape import melt, pivot

df = pd.read_csv('ch07/macrodata.csv') # original format

data = df.ix[:,['year', 'quarter', 'realgdp', 'infl', 'unemp']] # selection of variables

data['date'] = 10*data['year']+data['quarter'] # some quick identificator for the 'date' instead of separate year and quarter variables

del data['year']

del data['quarter']

ldata = melt(data, id_vars = ['date']) # long format

pivoted = ldata.pivot('date', 'variable', 'value'); pivoted.head()

# Note: 'item' becomes 'variable' in the rest of the example

Note from the Author or Editor:
OK, sounds good.

Editors, could you remove this text: (code to create this DataFrame omitted for brevity)

then, after the first code example (ldata[:10]), could you put a code block with this code used to create the example:

data = pd.read_csv('ch07/macrodata.csv')
periods = pd.PeriodIndex(year=data.year, quarter=data.quarter, name='date')
data = DataFrame(data.to_records(),
columns=pd.Index(['realgdp', 'infl', 'unemp'], name='item'),
index=periods.to_timestamp('D', 'end'))

ldata = data.stack().reset_index().rename(columns={0: 'value'})

Patrick Jeuniaux  Oct 14, 2013  Dec 12, 2014
PDF
Page 29
2nd paragraph

totals should be titles:

"This produced another DataFrame containing mean ratings with movie totals as row
labels and gender as column labels. "

should read

"This produced another DataFrame containing mean ratings with movie titles as row
labels and gender as column labels. "

Note from the Author or Editor:
Good catch. Editors, please make the indicated change. Thanks

vrajmohan  Sep 26, 2013  Dec 12, 2014
PDF
Page 89
In [84]:

As randn is a function in the numpy.random module, the line should read:
data = np.random.randn(7, 4)

Note from the Author or Editor:
yes: editors, please make the indicated change

vrajmohan  Sep 17, 2013  Dec 12, 2014
PDF
Page 54
Code example at bottom of page

When I try to do 'a' in _ip.user_ns it throws a NameError exception and says "name '_ip' is not defined.

I can use the IPython magic %who to see if the variable is in memory or not.

Note from the Author or Editor:
I should have known better than to use a private IPython API.

editors, could we remove this altogether:

In [8]: 'a' in _ip.user_ns
Out[8]: True

change the line number of the subsequent prompt to 8 (instead of 9)

then, remove the following lines:

In [1]: 'a' in _ip.user_ns
Out[1]: False

and add these lines in its place:

In [10]: a
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-10-60b725f10c9c> in <module>()
----> 1 a

NameError: name 'a' is not defined

thanks

Todd Leonhardt  Sep 15, 2013  Dec 12, 2014
PDF
Page 38
Code on bottom of page 38 and top of page 39

searchsorted() is a method available for NumPy arrays, not Pandas Series. So to get the code in the book to work, I needed to first convert the Series to a NumPy array with array().

In final code, the get_quantile_count() function is as follows:

# Get number of distinct names in the top 50% of births using clever NumPy hack
def get_quantile_count(group, q=0.5):
group = group.sort_index(by='prop', ascending=False)
return array(group.prop.cumsum()).searchsorted(q) + 1

Note from the Author or Editor:
Ah, this is a casualty of some API changes in pandas:

Editors, could you change the indicated line to be instead:

group.prop.cumsum().values.searchsorted(q) + 1

Todd Leonhardt  Sep 14, 2013  Dec 12, 2014
ePub
Page 727
Top of page, 1st code example

For the output to work as intended in the example, the print statement within def squares() needs to be outside the for loop within that generator function.

The way the code is written, the 'Generating squares....' print will occur each time a new number is generated. But if you move the print outside the for, it will print exactly once.

Note from the Author or Editor:
Good catch. Authors could you change the code cited to look like this (mind the 4-space indents):

def squares(n=10):
print 'Generating squares from 1 to %d' % (n ** 2)
for i in xrange(1, n + 1):
yield i ** 2

Todd Leonhardt  Sep 14, 2013  Dec 12, 2014
ePub
Page 712
1st code example, list comprehension for enough_es within for loop

In the first code example for the Nest list comprehensions section, the "if name.count('e') > 2" within the list comprehension should have a ">=" instead of a ">".

Note from the Author or Editor:
You're right. Editors, could you please make the indicated change?

Todd Leonhardt  Sep 14, 2013  Dec 12, 2014
Printed, PDF, ePub
Page 271
bottom

This statement

from shapelib import ShapeFile

asks the shapelib library. I tried to install shapelib and pyshapelib (the binding), but it gave an error

shapelibc.so: undefined symbol: SASetupDefaultHooks

Judging from the fact that pyshapelib was last updated in 2007, we are wondering if it is still compatible with newer version of shapelib. Could you recommend another shapelib binding that will work with the examples of the book?

Note from the Author or Editor:
We may need to remove this example; I know there are various issues with basemap as well. I've made a note and I will follow up with O'Reilly editors

Anonymous  Sep 09, 2013  Dec 12, 2014
PDF, ePub
Page 192
out 116 and out 118

In chapter 7, in the subsection entitled "Pivoting "long" to "wide" Format" . . .

On further examination -- the ldata output in out 116 is only for part of ldata, as in ldata[:10]. This omits five rows of data that should be in ldata based on the rest of the examples in this section:

10 1959-12-31 00:00:00 infl 0.270
11 1959-12-31 00:00:00 unemp 5.600
12 1960-03-31 00:00:00 realgdp 2847.699
13 1960-03-31 00:00:00 infl 2.310
14 1960-03-31 00:00:00 unemp 5.200

Note from the Author or Editor:
I need to look into this, but I am going to try to add the code to generate the ldata table. I replied to your other question, but I didn't realize until further examination that the code was omitted. I made a note to myself and will address separately with the editors

Doug McCaleb  Aug 15, 2013  Dec 12, 2014
Printed
Page 199
Top of page.

The bins are divided into 18 to 25, 26 to 35, 35 to 60 and 60 and older.
Should be 18 to 26, 26 to 35, 35 to 60, 60 and older
or 18 to 25, 25 to 35, 35 to 60, 60 and older.

Note from the Author or Editor:
editors, can you please change the copy to:

18 to 25, 26 to 35, 36 to 60, and finally 61 and older

Arie Ellerbrak  Aug 02, 2013  Dec 12, 2014
Printed
Page 172
Last paragraph, 2nd sentence

Interally -> Internally

Note from the Author or Editor:
Confirmed typo

Arie Ellerbrak  Aug 02, 2013  Dec 12, 2014
Printed
Page 162
Middle op the page

In order for data.to_csv(sys.stdout, sep='|') to work you must
import sys
first

Note from the Author or Editor:
Editors, find this text on the page

(writing to sys.stdout so it just prints the text result)

change it to

(writing to sys.stdout so it just prints the text result; make sure to import sys)

use fixed width font for "import sys"

Arie Ellerbrak  Aug 01, 2013  Dec 12, 2014
Printed
Page 153
bottom of page

pdata.ix['Adj Close', '5/22/2012':, :] refers to Adj Close.

The table below that shows the Close, not the Adj Close.

Note from the Author or Editor:
Very strange. Editors, can you please change the indicated line of code to:

pdata.ix['Adj Close', '5/22/2012':, :]

See also revised code examples for an alternative replacement.

Arie Ellerbrak  Aug 01, 2013  Dec 12, 2014
Printed
Page 266
Top half

demeaned.groupby(key).mean() does not work for me; that is, it yields non-zero values (and not just due to rounding).

I think the issue is that the people DataFrame gets reorganized internally with rows in different order. This doesn't seem to affect the alignment of key within people. But it does affect demean, so the values of key no longer line up with their original position.

import pandas as pd
from pandas import DataFrame
import numpy as np


def demean(arr):
return arr - arr.mean()

# This doesn't work.
people = DataFrame(np.random.randn(5, 5),
columns=['a', 'b', 'c', 'd', 'e'],
index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
key = ['one', 'two', 'one', 'two', 'one']
demeaned = people.groupby(key).transform(demean)

print demeaned

print demeaned.groupby(key).mean()

produces

a b c d e
Jim 0.223861 -2.072542 0.973977 -0.021754 -1.019689
Joe 0.326119 0.671576 0.487932 -0.404353 1.219755
Steve -0.223861 2.072542 -0.973977 0.021754 1.019689
Travis 0.204880 -0.422467 -1.024938 -0.555061 -0.563228
Wes -0.530999 -0.249109 0.537006 0.959414 -0.656527
a b c d e
one -0.177000 -0.083036 0.179002 0.319805 -0.218842
two 0.265499 0.124555 -0.268503 -0.479707 0.328264

Note from the Author or Editor:
This appears to be a bug in pandas unfortunately. I have reported it to the dev team here -- the appropriate action here is to fix the bug rather than changing the book text:

https://github.com/pydata/pandas/issues/8046

Ian Gow  Jul 06, 2013  Dec 12, 2014
Printed
Page 124
table 5.5

Description for argument copy is self contradictory. Appears to say copy true means don't copy

Note from the Author or Editor:
The text could be clearer. Editors, could you change "Otherwise" to read "If False" (use fixed width font for the False) in the table?

gwideman  Jul 03, 2013  Dec 12, 2014
ePub
Page 46
printed text,

Code from Safari:
In [541]: import numpy as np

In [542]: data = {i : randn() for i in range(7)}

This causes an error:
NameError: global name 'randn' is not defined

This works
data = {i : np.random.randn() for i in range(7)}

Appears there is a problem with the 'import numpy as np' being incomplete.

Note from the Author or Editor:
Good catch, and I believe we tried to correct this error in the last revision.

Editors, could you replace the indicated randn with np.random.randn ? thanks

Anonymous  Jun 24, 2013  Dec 12, 2014
Printed
Page 152
Middle

For line [294] of the iget_value code example, the second ")" after the call to reshape(3, 2) is incorrect.

Note from the Author or Editor:
I believe this is already fixed in the second printing

jworeilly  Jun 07, 2013  Dec 12, 2014
Printed
Page 152
Second paragraph

Duplicate colons introduce the second example code block.

Note from the Author or Editor:
Please remove the unnecessary colon

jworeilly  Jun 07, 2013  Dec 12, 2014
Printed, PDF, ePub
Page 6-8
Installation and Setup

Dear Sirs:

I have just purchased Wes McKinney�s Python for Data Analysis.

I am trying to install Python as instructed on pages 6-8 of the book, but I am running into problems.

It appears that the Python package that comes with EDPFree and the Pandas library are both essential for me to use the book.

When I try to install Pandas on top of EDPFree (which is now Canopy Express), I get the error message:

�Python version 2.7 required, which was not found in the registry.�

I am running Windows 7 (32-bit).

The author recommends uninstalling the previous version of Python and then installing EPDFree, which has been changed to Enthought Canopy.

After I do that, Python does not appear in Add or Remove Programs anymore, but Enthought Canopy does.

The Canopy interface works, and it can run a simple script. It says that � contrary to the error message � I do have version 2.7 of Python installed.

The author recommends installing pandas-0.9.0.win32-py2.7.exe. Only version 11 is now available, so I downloaded that.

When I googled the error message, I got a suggestion to add C:\Python27; and C:\Python27\Scripts; to my system path, but that did not help.

Google also gave me a suggestion to uninstall Python (which means Canopy in this case) for all users and re-install for just me.

This also did not help.

As things now stand, I do not think I will be able to make any use of the book.

Is there a forum or an author�s page that addresses this problem?

Thank you,
John Chesnut




Note from the Author or Editor:
Since publishing the book Enthought have changed their Python distribution so that the directions are now incompatible.

If you run into this problem please install the free Anaconda distribution for your platform (which includes pandas) from here:

http://continuum.io/downloads

Anonymous  May 28, 2013  Dec 12, 2014
Printed
Page 77
Top bullet points

The third bullet point in the sample configuration changes is unnecessary: it repeats the first clause of the second bullet point.

Note from the Author or Editor:
good catch. Editors, could you remove the 3rd bullet point?

jworeilly  May 26, 2013  Dec 12, 2014
Printed
Page 67
Last sentence of third paragraph

Text reads "Here is a simple list of 700,000 strings ..." but the sample code produces 600,000 strings.

Note from the Author or Editor:
Good catch. Editors, could you change the copy to say 600,000 instead of 700,000?

James Williamson  May 26, 2013  Dec 12, 2014
Printed, PDF
Page 106
Table 4-7

In description of lstsq, replace "y = Xb" with the more commonly used "Ax = b"

Wes McKinney
Wes McKinney
 
May 13, 2013  May 17, 2013
Printed, PDF
Page 106
Table 4-7

For pinv description remove the word "square" (this function does not require that the matrices be square)

Wes McKinney
Wes McKinney
 
May 13, 2013  May 17, 2013
Printed, PDF
Page 99
Second to last paragraph

"scalers" should be "scalars"

Wes McKinney
Wes McKinney
 
May 13, 2013  May 17, 2013
Printed, PDF, ePub, Mobi, , Other Digital Version
Page 107
Middle of page

Change "See table Table 4-8..." to "See Table 4-8..."

Wes McKinney
Wes McKinney
 
May 12, 2013  May 17, 2013
Printed
Page 363
Bottom of page

In box, The Broadcasting Ru should be The Broadcasting Rule

Wes McKinney
Wes McKinney
 
May 12, 2013  May 17, 2013
Printed, PDF, ePub, Mobi, , Other Digital Version
Page 24
two fifths down the page

Found same problem as CJ:

66

In the following line:

operating_system = np.where(cframe['a'].str.contains('Windows'), 'Windows', 'Not Windows')

np was not defined, so this line gives an error

99

Question: Why don't any of these known errata get confirmed/addressed by the author or staff at O'Reilly?

Note from the Author or Editor:
On page 21 please change the code line In [290]: about halfway down the page from

In [290]: import pandas as pd

to

In [290]: import pandas as pd; import numpy as np

This mistake is fairly minor (all things considered) as these code examples are intended to be run in IPython in "pylab" mode (ipython --pylab) which will have imported NumPy and created the np alias. Sorry about that

Moritz Heukamp  May 11, 2013  May 17, 2013
Printed, PDF, ePub, Mobi, , Other Digital Version
Page 432
Last line in Table A-6

IS:

True is the file is closed.

SHOULD BE:

True if the file is closed.

Note from the Author or Editor:
Please make change as submitter described (replace is with if)

Jose Manuel Martí  May 10, 2013  May 17, 2013
Printed, PDF, ePub, Mobi, , Other Digital Version
Page 427
Last code example in section "Currying: Partial Argument Application"

In the code comment:

# Take the 60-day moving average of of all time series in data

"of" is repeated.

Note from the Author or Editor:
Please fix typo as described (remove duplicate "of")

Jose Manuel Martí  May 09, 2013  May 17, 2013
Printed, PDF, ePub, Mobi, , Other Digital Version
Page 418
last line

IT IS:

loc_mapping = dict((val, idx) for idx, val in enumerate(strings)}

SHOULD BE:

loc_mapping = dict((val, idx) for idx, val in enumerate(strings))

NOTE: Last character of code line should be ) and not }... probably from wrong copy&paster of previous code line. It's obvious, but I checked this with IPython.

Note from the Author or Editor:
Please fix typo as submitter described (replace curly brace with parenthesis)

Thanks!

Jose Manuel Martí  May 09, 2013  May 17, 2013
Printed, PDF, ePub, Mobi, , Other Digital Version
Page 324
bottom of page

In[570]: spx_px is has not been defined in the chapter yet

Note from the Author or Editor:
Please add code line just above In [570]:

In [569]: spx_px = close_px_all['SPX']

Make sure there is a blank line between that code line and the next one to keep the styling consistent

Anonymous  Apr 18, 2013  May 17, 2013
Printed
Page 308
middle of page

Out[470] should be 'Period('2007-06', 'M')'

Note from the Author or Editor:
Confirmed, please make change as described

There is also a formatting mistake right before "Out [470]:" , please fix that also

Anonymous  Apr 18, 2013  May 17, 2013
Printed, PDF, ePub, Mobi, , Other Digital Version
Page 241-242
Fig 8-23

Fig 8-23 appears to be identical to Fig 8-22

Note from the Author or Editor:
Not sure what happened here, 8-23 is supposed to be a different figure if you read the text closely.

Here is a figure to replace 8-23 (should just be a drop-in replacement), editors please contact me if you need any changes to this:

https://www.dropbox.com/s/annqtoank0snrwu/scatter_matrix_fix_20130512.pdf

Anonymous  Apr 18, 2013  May 17, 2013
Page 235
paragraph1, sentence 1

par 1 sentence 1: should read '... is as simple as ...'

Note from the Author or Editor:
Please fix typo as described. thanks!

Anonymous  Apr 18, 2013  May 17, 2013
Page 223
Table 8-1

Table 8-1: the description for 'subplot_kw' is cut off

Note from the Author or Editor:
Please change the description for subplot_kw to

Dict of keywords passed to <literal>add_subplot</literal> call used to create each subplot.

Anonymous  Apr 18, 2013  May 17, 2013
Printed
Page 125
Last sentence

last sentence: should read 'Here are some examples of this:'

Note from the Author or Editor:
please fix as described. thanks!

Anonymous  Apr 18, 2013  May 17, 2013
Page 107
Table 4-8

Table 4-8: the description for binomial should read 'Draw samples
from a binomial distribution'

Note from the Author or Editor:
Please fix as described. thanks!

Anonymous  Apr 18, 2013  May 17, 2013
Printed
Page 90
paragraph 1, sentence 2

par 1, sentence 2 is a fragment

Note from the Author or Editor:
Change the first two sentences of that paragraph to

Suppose each name corresponds to a row in the <literal>data</literal> array, and we wanted to select all the rows with corresponding name <literal>'Bob'</literal>.

Anonymous  Apr 18, 2013  May 17, 2013
Printed
Page 75
paragraph 2, sentence 2

'willl' should be 'will'

Note from the Author or Editor:
Confirmed. thanks

Anonymous  Apr 18, 2013  May 17, 2013
Printed
Page 69
Paragraph 4, last sentence

'while' should be 'whole'

Note from the Author or Editor:
Confirmed, thanks

Anonymous  Apr 18, 2013  May 17, 2013
Printed
Page 65
Paragraph 1

what is referred to as Table 3-3 in the text is actually displayed as Table 3-4

Note from the Author or Editor:
Confirmed. Please fix reference to Table 3-4

Anonymous  Apr 18, 2013  May 17, 2013
Printed, PDF
Page 23
middle of page

In the PDF version, the url overshoots the page

Note from the Author or Editor:
Editors please insert a line break like so in the console output

Out[304]: u'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P925/V10e Build/FRG83G) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'

Anonymous  Apr 18, 2013  May 17, 2013
PDF, ePub, Mobi
Page 192
Beginning of section Pivoting ?long? to ?wide? Format

The section begins:
A common way to store multiple time series in databases and CSV is in so-called long or stacked format:
In [116]: ldata[:10]

However, the variable ldata has not been defined or initialized previously (or later) in the book.

Note from the Author or Editor:
Yeah, I left the code to make that DataFrame out as it was derived in a mungy way from the macrodata used earlier.

Editors: please put a note in parentheses after "stacked format" that says

"... or stacked format (code to create this DataFrame omitted for brevity):" or something. pretty trivial for the user to type this in

David Kimery  Apr 17, 2013  May 17, 2013
Printed
Page 400
middle of page

The text currently says:
"When aggregating of otherwise grouping time series data, ..."
It probably should say
"When aggregating or otherwise grouping time series data"

Note from the Author or Editor:
Please fix typo as described, thanks

Anonymous  Apr 15, 2013  May 17, 2013
Printed, PDF
Page 100
United States

1 * cond1 + 2 * cond2 + 3 * -(cond1 | cond2)

is not equivalent to the two other code examples offered. In particular, if cond1 and cond2 are both False, the result is 0, not 3.

Note from the Author or Editor:
Oops.

Please change that line of code to

1 * (cond1 & -cond2) + 2 * (cond2 & -cond1) + 3 * -(cond1 | cond2)

Aaron Schumacher  Apr 07, 2013  May 17, 2013
Printed, PDF
Page 152
Final code block

The line currently is:

frame = DataFrame(np.arange(6).reshape(3, 2)), index=[2, 0, 1])

It should instead be:

frame = DataFrame(np.arange(6).reshape(3, 2), index=[2, 0, 1])

Note from the Author or Editor:
Confirmed. please change as described

Joshua Lande  Mar 14, 2013  May 17, 2013
Printed
Page 83
Last line in table 4-2 on this page

"float64, float128" should read "float64" only. "float128" already correctly appears on the next line in the table (on page 84).

Note from the Author or Editor:
Correct. Please delete the ", float128" there

Dan Grossman  Jan 25, 2013  May 17, 2013
Printed
Page 86
Final paragraph, first sentence.

"... especially if they have used ..."

should read

"... especially if you have used ..."

Note from the Author or Editor:
Thanks, please correct typo as described

Dan Grossman  Jan 25, 2013  May 17, 2013
Printed, PDF, ePub, Mobi, , Other Digital Version
Page 358
In Figure 12-3

arr.reshape((3,4), order=?)

should read

arr.reshape((4,3), order=?)

Note from the Author or Editor:
Correct, please fix figure text as described. Surprised this one evaded me but it's obvious once you see it =)

Dan Grossman  Jan 25, 2013  May 17, 2013
PDF
Page 170
Middle

The Output of perf = DataFrame(data) is not correct. As printed:

In [928]: perf
Out[928]:
Empty DataFrame
Columns: array([], dtype=int64)
Index: array([], dtype=int64)

But should be:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 648 entries, 0 to 647
Data columns:
AGENCY_NAME 648 non-null values
CATEGORY 648 non-null values
DESCRIPTION 648 non-null values
FREQUENCY 648 non-null values
INDICATOR_NAME 648 non-null values
INDICATOR_UNIT 648 non-null values
MONTHLY_ACTUAL 648 non-null values
MONTHLY_TARGET 648 non-null values
PERIOD_MONTH 648 non-null values
PERIOD_YEAR 648 non-null values
YTD_ACTUAL 648 non-null values
YTD_TARGET 648 non-null values
dtypes: int64(2), object(10)

Note from the Author or Editor:
Confirmed. Please change the text of Out[928]: to

<class 'pandas.core.frame.DataFrame'>
Int64Index: 648 entries, 0 to 647
Data columns:
AGENCY_NAME 648 non-null values
CATEGORY 648 non-null values
DESCRIPTION 648 non-null values
FREQUENCY 648 non-null values
INDICATOR_NAME 648 non-null values
INDICATOR_UNIT 648 non-null values
MONTHLY_ACTUAL 648 non-null values
MONTHLY_TARGET 648 non-null values
PERIOD_MONTH 648 non-null values
PERIOD_YEAR 648 non-null values
YTD_ACTUAL 648 non-null values
YTD_TARGET 648 non-null values
dtypes: int64(2), object(10)

Thomas Maloney  Jan 04, 2013  May 17, 2013
PDF
Page 160
United States

keep_date_col description is inconsistent with the pandas documention. Should be:

If joining columns to parse date, keep the joined columns. Default False

Note from the Author or Editor:
Confirmed. Please change as described

Thomas Maloney  Jan 04, 2013  May 17, 2013
PDF
Page 53
Table 3-1

Commands are given as 'Ctrl-P', 'CTRL-A', etc. with the letter in UPPERCASE, which is potentially confusing, since the keys are to be pressed without the shift key (except 'Ctrl-Shift-v'). In fact, without the example containing a 'Shift', I would not be sure this is an error.

Note from the Author or Editor:
A fair point.

Editors: Please change the single letters in the command shortcuts in Table 3-1 to lowercase. E.g.

Ctrl-Shift-V

should be

Ctrl-Shift-v

and Ctrl-B should be Ctrl-b

Thanks

Steven Pav  Dec 27, 2012  May 17, 2013
PDF
Page 23

For the code example following:

In [301]: tz_counts[:10].plot(kind='barh', rot=0)

The 'plot' function has no visible effect. Should be in iPython? (which also doesn't work.)

Note from the Author or Editor:
There should be a note at the beginning of the chapter to run IPython in pylab mode.

Editors: please place a note at the end of the opening paragraph that says:

"To follow along with these examples, you should run IPython in Pylab mode by running <literal>ipython --pylab</literal> at the command prompt."

Brian Piercy  Dec 04, 2012  May 17, 2013
Printed
Page 54
2nd paragraph

... designed to faciliate common tasks ...

Note from the Author or Editor:
Please fix facilitate typo

Frans Koning  Nov 22, 2012  May 17, 2013
PDF
Page 282
somewhere

Should be return totals.order(ascending=False)[:n] (was [-n:])

Note from the Author or Editor:
Correct. Please fix code typo as described (replace [-n:] with [:n])

Miki Tebeka  Nov 09, 2012  May 17, 2013
PDF
Page 241
somewhere

scatter_matrix(trans_data, diagonal='kde', color='k', alpha=0.3)

should be

pd.scatter_matrix(trans_data, diagonal='kde', color='k', alpha=0.3)

Note from the Author or Editor:
Thanks. Please change code as described (add pd. to start of statement)

Miki Tebeka  Nov 09, 2012  May 17, 2013
PDF
Page 204
somewhere

ch07/movies.dat is not there (is in ch02/movielens)

Note from the Author or Editor:
Thanks.

please change 'ch07/movies.dat' to 'ch02/movielens/movies.dat' in the code

Miki Tebeka  Nov 09, 2012  May 17, 2013
Printed
Page vi
United States

The technical editor Hugh Brown is listed as Hugh White.

Not sure of the page number.

Note from the Author or Editor:
Yes, many apologies. His name is Hugh Brown (and he was a great editor!)

Hugh Brown  Nov 05, 2012  May 17, 2013
PDF
Page 365
image

Quote from page 364:
"See Figure 12-6 for another illustration, this time subtracting a two-dimensional array from a three-dimensional one across axis 0."

Figure 12-6 does not show subtraction nor numbers representing numpy data make any sense

Note from the Author or Editor:
The figure and text needs fixing

The text: change "subtracting... from ..." to "adding...to..."

In the Figure 12-6, change the numbers in the result to be double what they are, so instead of 0, 1, 2, 3, 4, 5, 6, 7, make then in the corresponding order double that, 0, 2, 4, 6, ...

klo  Oct 31, 2012  May 17, 2013
Mobi
Page 1
On Kindle: "Location 325 of 13301"

Sorry, don't know the proper page number (I'm on a kindle), so I entered 1.

In Chapter 1, under the numpy description, one of the bullet points has a minor grammatical error. It reads"

"Tools for integrating connecting C, C + +, and Fortran code to Python"

I assume "integrating connecting" was not intended as is.

Note from the Author or Editor:
on page 4 of the print text / PDF change "integrating connecting C, C++, ..." to "integrating C, C++, ..."

Anonymous  Oct 24, 2012  May 17, 2013
PDF
Page 123
Table 5-6, 2nd row

"Selects single row of subset of rows from the DataFrame."

shoud probably be

"Selects single row or subset of rows from the DataFrame."

Note from the Author or Editor:
Confirmed typo as described

Guan Yang  Aug 16, 2012  May 17, 2013
PDF
Page 40
in [3]

While executing the code from the book:

In [3]: data = {i : randn() for i in range(7)}

I got the following error: NameError: global name 'randn' is not defined.

I solved it by using "from scipy import randn".

(Perhaps included packages depend on ipython configuration.)

Note from the Author or Editor:
Page 46 in the printed text, please insert the line

In [541]: import numpy as np

right above the In [542]: ...

and make sure there is a blank line for consistent formatting

Anonymous  Aug 15, 2012  May 17, 2013
PDF
Page 119
Table 5-5

The description of the copy option for reindex in table 5-5 of the current (as of 8/2/12) preprint version may be wrong. It says that copy is "Do not copy underlying data if new index is equivalent to old index."

I believe this is the opposite of copy's behavior, and the words "Do not" should be removed.

Note from the Author or Editor:
Change text to

If True, always copy underlying data even if new index is equivalent to old index. Otherwise, do not copy the data when the indexes are equivalent.

Dan Becker  Aug 02, 2012  May 17, 2013