The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".
The following errata were submitted by our customers and approved as valid errors by the author or editor.
| Version |
Location |
Description |
Submitted By |
Date submitted |
Date corrected |
|
Page p.158
l. -2 |
'optimal' should be 'optional'
Note from the Author or Editor: Correct, should be 'optional'
|
HIDEMOTO NAKADA |
Aug 23, 2025 |
|
|
Page p.343
in table 15-4 |
The description of the `include_keys` parameter seems to be wrong.
The documentation says:
Include the columns used to partition the DataFrame in the output.
Note from the Author or Editor: This method was updated after the book was published. 3 descriptions need to be updated:
maintain_order
Ensure that the order of the groups is consistent with the input data. This is slower than a default partition by operation.
include_key
Include the columns used to partition the DataFrame in the output.
as_dict
Return a dictionary instead of a list. The dictionary keys are tuples of the distinct group values that identify each group.
|
HIDEMOTO NAKADA |
Aug 23, 2025 |
|
|
Page p.421
l.8 |
"This means that in order to match data across two languages,
you first have to deserialize the data from one format,
then serialize it to the format of the other language."
I guess we need to serialize first, and then deserialize it.
Note from the Author or Editor: Correct, it should be:
"When transferring data between two programming languages, you first serialize the data in the source language into a common format, then deserialize it in the target language to reconstruct the data in memory."
|
HIDEMOTO NAKADA |
Sep 02, 2025 |
|
|
Page p.421
l. -5 |
a computer with multiple computation cores
can process the same instruction on multiple data points at the same time.
SIMD instructions do not require multiple cores.
'a computer with SIMD capable processing units' might be better.
Note from the Author or Editor: Correct.
Instead of:
"By lining up your data in memory, or vectorizing it, a computer with multiple computation cores can process the same instruction on multiple data points at the same time."
it should be:
"By laying out data contiguously in memory, or vectorizing it, the processor’s SIMD-capable units can apply the same instruction to multiple data points simultaneously, greatly improving performance."
|
HIDEMOTO NAKADA |
Sep 02, 2025 |
|
|
Page p.11
Part IV |
> Chapter 15 shows how to reshape data, through (un)pivoting, stacking, and
extending.
stacking and extending are not explained in chapter 15, but in chapter 14.
Note from the Author or Editor: "Chapter 14 explains how to combine different DataFrames using joins and concatenations. Chapter 15 shows how to reshape data, through (un)pivoting, stacking, and
extending."
Should be:
"Chapter 14 explains how to combine different DataFrames using joins and concatenations. Chapter 15 shows how to reshape data, through (un)pivoting, transposing, and exploding."
|
HIDEMOTO NAKADA |
Sep 09, 2025 |
|
|
Page p.325
Takeaways, first item. |
> Combine DataFrames with exact matches in their join columns using df.join().
You can fine-tune the join with the tolerance and by arguments and by selecting
the appropriate strategy.
The second sentence is explaining join_asof, not join.
Note from the Author or Editor: "Combine DataFrames with exact matches in their join columns using df.join(). You can fine-tune the join with the tolerance and by arguments and by selecting the appropriate strategy.
• Combine numerical or temporal columns in DataFrames on their nearest values using df.join_asof()."
Should be:
"Combine DataFrames with exact matches in their join columns using df.join().
• Combine numerical or temporal columns in DataFrames on their nearest values using df.join_asof(). You can fine-tune the join with the `tolerance` and `by` arguments and by selecting the appropriate strategy."
|
HIDEMOTO NAKADA |
Sep 09, 2025 |
|
|
Page p.448
Table A-1 |
The table includes Dask, DuckDB, and PySpark, while they are not mentioned in this chapter. I think it would be better to remove them from the table.
Note from the Author or Editor: Agreed. While they were relevant for the general benchmarking we did, we only compare GPU packages in the appendix.
The table should only feature:
- cuDF
- pandas
- Polars
- Polars on GPU
|
HIDEMOTO NAKADA |
Sep 09, 2025 |
|
|
Page Chapter 11, p240: Slicing
second bullet point |
The example implies regular python convention of `df.slice(start, stop)`, but the second argument is actually the length of the slice. This is particularly important to remember when using `pl.col().list.slice()` as the error can easily go unnoticed.
Note from the Author or Editor: Correct.
Should be:
"For example, keep from the third to the seventh row with df.slice(2, 4). Here the first argument is the starting index, and the second argument is the length of the slice."
|
Ben Hardcastle |
Sep 09, 2025 |
|
|
Page p.76
table 4-1 |
> Int8 8-bit signed Integer type. -128 to 128
the range should be -128 to 127.
Note from the Author or Editor: Correct, the range is indeed -128 to 127 (inclusive)
|
HIDEMOTO NAKADA |
Sep 14, 2025 |
|
|
Page p. 335
Table 15-2 |
Arguments `id_vars` and `value_vars` should be `index` and `on` respectively.
I think the code outside the table is correct. I guess these were changes in the API when writing the book.
I think the Lazy Pivot box on p. 334 is out of date.
Note from the Author or Editor: The API has indeed changed since the book was published.
|
Ian Gow |
Mar 13, 2026 |
|
|
Page 13
First code block, below second paragraph |
The end of the URL for the citibike data on S3 should read:
202403-citibike-tripdata.zip
The print and notebook versions contain a .csv suffix:
202403-citibike-tripdata.csv.zip
Note from the Author or Editor: Good find! This has been fixed in the repo that comes with the book, and we'll fix the print in the next edition
|
Andrew Campbell |
Jul 06, 2025 |
|
|
Page 134
2nd key note |
The 2nd note states " method is one the many" but should state "method is one of the many"
Note from the Author or Editor: Good find, it should be "The `Expr.str.ends_with()` method is one of the many String methods..."
|
Thomas Hefferman |
Aug 18, 2025 |
|
|
Page 330
Table 15-1 |
Argument `columns` should be `on`.
Note from the Author or Editor: API has changed since the book was published.
|
Ian Gow |
Mar 13, 2026 |
|