The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".
The following errata were submitted by our customers and approved as valid errors by the author or editor.
Version |
Location |
Description |
Submitted By |
Date submitted |
Date corrected |
|
Figure 3-5 |
On the online-oreilly version, in Figure 3-5, in the "Physical Plan" part, I think the "filter" should be on the events, not on the users. Because the query is filter(events("date") > "2015-01-01")
Note from the Author or Editor: This is a very astute reader! The filter should be on the events, not the users.
The "Physical Plan" portion of the diagram needs to be redrawn. The "filter" box should be on the left, above "scan (events)" instead of on the right above "scan (users)".
|
Anonymous |
Nov 21, 2020 |
Sep 30, 2022 |
|
Page ch8
Chapter 8 - Section: "Five Steps to define a streaming query" - "Putting it all together - Python code"" |
On chapter 8, section "putting it all together". Below listed code is not resulting in the expected output of wordcount
========
words = lines.select(split(col("value"), "\\s").alias("word"))
counts = words.groupBy("word").count()
=========
Above code results in an array value being counted. The first line needs to use "explode" function to have desired effect.
Note from the Author or Editor: Excellent catch!
For the Python code on 214, please update words = ... with this line:
words = lines.select(explode(split(col("value"), "\\s")).alias("word"))
For the Scala code on 214, please update val words =... with this line:
val words = lines.select(explode(split(col("value"), "\\s")).as("word"))
For the Python code on 218, please update words =... with this line:
words = lines.select(explode(split(col("value"), "\\s")).alias("word"))
For the Scala code on 219, please update words =... with this line:
val words = lines.select(explode(split(col("value"), "\\s")).as("word"))
For the Python code on 222, please update words = ... with this line:
words = lines.select(explode(split(col("value"), "\\s")).alias("word"))
For the Scala code on 222, please update val words =... with this line:
val words = lines.select(explode(split(col("value"), "\\s")).as("word"))
|
Kamal |
Sep 02, 2021 |
Sep 30, 2022 |
|
Page "The DataFrame API" in chapter 3
1st paragraph |
The paragraph starts with "Inspired by pandas DataFrames", which contains a broken hyperlink to pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe
The updated link seems to be pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
Note from the Author or Editor: Yes! The documentation page has changed as you pointed out. We are fixing this, as well as a few other broken links in the next release.
|
Alin Mindroc |
Mar 03, 2022 |
Sep 30, 2022 |
|
Page Table 5-4. Map functions
First row, first column (excluding headers) |
map_form_arrays(array<K>, array<V>): map<K, V>
Creates a map from the given pair of key/value arrays; elements in keys should not be null
map_form_arrays
should be
map_from_arrays
Note from the Author or Editor: Great catch!
We need to change "map_form_arrays" on Table 5-4 -> "map_from_arrays"
|
Caitlin Daigle |
Jan 31, 2024 |
|
|
Page 30
last paragraph |
It mentions some count example using groupBy() and orderBy(), but previous example doesn't use them at all.
Note from the Author or Editor: The last paragraph could certainly be clarified. Here is what I suggest in its place:
However, transformations such as groupBy() or orderBy() instruct Spark to perform wide transformations, where data from other partitions is read in, combined, and written to disk. If we were to sort the `filtered` DataFrame from the example above by calling .orderBy(), each partition will be locally sorted, but we need to force a shuffle of data from each of the executor’s partitions across the cluster to sort all of the records. In contrast to narrow transformations, wide transformations require output from other partitions to compute the final aggregation.
|
Jongyoung Park |
Mar 03, 2021 |
Sep 30, 2022 |
|
Page 36
Example 2-1 |
The code in Examples 2-1 and 2-2 should use the sum() aggregate function instead of count(). This has already been fixed in the book's GitHub repo. In short: update the example source code to match the repo.
Places to fix:
— Example 2-1 (two usages)
— program output (p. 37)
— Example 2-2 (two usages)
Note from the Author or Editor: Excellent catch! This is a problem for pages 36-40
On page 36:
* from pyspark.sql.functions import count -> from pyspark.sql.functions import sum
* .agg(count("Count").alias("Total")) -> .agg(sum("Count").alias("Total"))
On page 37:
* .agg(count("Count").alias("Total")) -> .agg(sum("Count").alias("Total"))
On page 40:
* .agg(count("Count").alias("Total")) -> .agg(sum("Count").alias("Total")) [occurs twice]
The output for pages 37-39 should be as follows:
+-----+------+------+
|State|Color |Total |
+-----+------+------+
|CA |Yellow|100956|
|WA |Green |96486 |
|CA |Brown |95762 |
|TX |Green |95753 |
|TX |Red |95404 |
|CO |Yellow|95038 |
|NM |Red |94699 |
|OR |Orange|94514 |
|WY |Green |94339 |
|NV |Orange|93929 |
|TX |Yellow|93819 |
|CO |Green |93724 |
|CO |Brown |93692 |
|CA |Green |93505 |
|NM |Brown |93447 |
|CO |Blue |93412 |
|WA |Red |93332 |
|WA |Brown |93082 |
|WA |Yellow|92920 |
|NM |Yellow|92747 |
|NV |Brown |92478 |
|TX |Orange|92315 |
|AZ |Brown |92287 |
|AZ |Green |91882 |
|WY |Red |91768 |
|AZ |Orange|91684 |
|CA |Red |91527 |
|WA |Orange|91521 |
|NV |Yellow|91390 |
|UT |Orange|91341 |
|NV |Green |91331 |
|NM |Orange|91251 |
|NM |Green |91160 |
|WY |Blue |91002 |
|UT |Red |90995 |
|CO |Orange|90971 |
|AZ |Yellow|90946 |
|TX |Brown |90736 |
|OR |Blue |90526 |
|CA |Orange|90311 |
|OR |Red |90286 |
|NM |Blue |90150 |
|AZ |Red |90042 |
|NV |Blue |90003 |
|UT |Blue |89977 |
|AZ |Blue |89971 |
|WA |Blue |89886 |
|OR |Green |89578 |
|CO |Red |89465 |
|NV |Red |89346 |
|UT |Yellow|89264 |
|OR |Brown |89136 |
|CA |Blue |89123 |
|UT |Brown |88973 |
|TX |Blue |88466 |
|UT |Green |88392 |
|OR |Yellow|88129 |
|WY |Orange|87956 |
|WY |Yellow|87800 |
|WY |Brown |86110 |
+-----+------+------+
Total Rows = 60
+-----+------+------+
|State|Color |Total |
+-----+------+------+
|CA |Yellow|100956|
|CA |Brown |95762 |
|CA |Green |93505 |
|CA |Red |91527 |
|CA |Orange|90311 |
|CA |Blue |89123 |
+-----+------+------+
|
Herman |
Dec 22, 2021 |
Sep 30, 2022 |
Printed, PDF |
Page 64
Towards the page end |
There is no day() function in pyspark.sql.functions . There are only
dayofweek(), dayofmonth(), dayofyear().
Note from the Author or Editor: Yes, this is a minor technical mistake. The year() function exists but day() and month() should be dayofweek() and dayofmonth() respectively.
|
Chengyin Eng |
Sep 18, 2020 |
Nov 20, 2020 |
Printed |
Page 74
The 6th bullet point: Tungsten/Encoders |
Double comma after the word 'Encoders'.
|
Josh Fry |
Jan 05, 2022 |
Sep 30, 2022 |
|
Page 79
SQL code example |
GROUP BY State, Color
Instead
GROUP BY State, Color, Count
Note from the Author or Editor: We made a similar mistake in the Python code as well. The updated code should be as follows:
# In Python
count_mnm_df = (mnm_df
.select("State", "Color", "Count")
.groupBy("State", "Color")
.agg(sum("Count")
.alias("Total"))
.orderBy("Total", ascending=False))
-- In SQL
SELECT State, Color, sum(Count) AS Total
FROM MNM_TABLE_NAME
GROUP BY State, Color
ORDER BY Total DESC
In the ouptut on pg 79, we also need to replace:
* count(Count#12) -> sum(Count#12) [occurs 4 times]
* partial_count(Count#12) -> partial_sum(Count#12) [occurs 1 time]
|
stefano fantini |
Nov 21, 2021 |
Sep 30, 2022 |
PDF |
Page 88
First code sample |
The first code sample contains a CASE WHEN statement that does not fully cover the range of values, e.g., it has cases for values greater than and less than 120, but not equal to 120.
It should read as follows:
spark.sql("""SELECT delay, origin, destination, CASE
WHEN delay >= 360 THEN 'Very Long Delays'
WHEN delay >= 120 AND delay < 360 THEN 'Long Delays'
WHEN delay >= 60 AND delay < 120 THEN 'Short Delays'
WHEN delay > 0 and delay < 60 THEN 'Tolerable Delays'
WHEN delay = 0 THEN 'No Delays'
ELSE 'Early'
END AS Flight_Delays
FROM us_delay_flights_tbl
ORDER BY origin, delay DESC""").show(10)
Note from the Author or Editor: Thanks for filing this technical nuance. If you read the text on page 87, we consider Long Delays as anything (> 6 hours) not (>= 6 hours). So for this condition, we don' t need (>= 360).
But for Long Delays (2-6 hours) makes sense to be inclusive. If we take the text on page 87, then I believe it should read as follows:
spark.sql("""SELECT delay, origin, destination, CASE
WHEN delay > 360 THEN 'Very Long Delays'
WHEN delay >= 120 AND delay <= 360 THEN 'Long Delays'
WHEN delay >= 60 AND delay < 120 THEN 'Short Delays'
WHEN delay > 0 and delay < 60 THEN 'Tolerable Delays'
WHEN delay = 0 THEN 'No Delays'
ELSE 'Early'
END AS Flight_Delays
FROM us_delay_flights_tbl
ORDER BY origin, delay DESC""").show(10)
|
Chris Abernethy |
Jul 29, 2020 |
Nov 20, 2020 |
|
Page 131
2nd paragraph |
The stride is equal to 900 and the ten equivalent queries are not these that are the book. Actually, there are two queries at the beginning and the ending that catch all numbers bellow lowerBound and higher that upperBound parameters respectively. If you want to confirm what I'm pointing out just call the methods of in your dataframe (which you read from using jdbc format): df.rdd.glow().map(len).collect() and then you will notice my point.
Note from the Author or Editor: This is indeed a mistake! The lowerBound should be set to 0, not 1000 (needs to be updated in 4 occurrences on page 131)
In the second paragraph, also should be:
SELECT * FROM table WHERE partitionColumn BETWEEN 0 and 1000
SELECT * FROM table WHERE partitionColumn BETWEEN 1000 and 2000
....
SELECT * FROM table WHERE partitionColumn BETWEEN 9000 and 10000
|
Carlos Alva |
Jan 19, 2022 |
Sep 30, 2022 |
Printed |
Page 138
Option 2 |
The code for option 2 is exactly the same as the option 1. But the code example in the ebook version are correctly showing the map function example
Note from the Author or Editor: Thank for catching this, we will get this fixed. Much appreciated!
|
Chengyin Eng |
Sep 18, 2020 |
Nov 20, 2020 |
|
Page 139
paragraph above Table 5-3 |
"For the full list, refer to this notebook in the Databricks documentation"
The hyperlink in this sentence is no longer an active site, and this sentence should be removed from the book.
Note from the Author or Editor: Yes! This doc link was unfortunately removed. We are removing this sentence from the book now.
|
O'Reilly Media |
Apr 15, 2021 |
Sep 30, 2022 |
Printed, PDF |
Page 145-149
Minor typos throughout |
I'm unable to respond to the Errata posted on Apr 12, 2021 as they included a URL in their post. So I'm copying my response here:
There are a few minor typos we made:
Pg 145:
* airportsna -> airports
* airports_na -> airports
Pg 146:
* airportsnaFilePath -> airportsFilePath [occurs twice]
* airportsna -> airports [occurs twice]
* airports_na -> airports [occurs twice]
Pg 149:
* airports_na -> airports
|
Brooke Wenig |
Sep 25, 2022 |
Sep 30, 2022 |
PDF |
Page 146
2nd Python code block |
github.com/databricks/LearningSparkV2/issues/67
Should use airports instead of airportsna
|
Anonymous |
Apr 12, 2021 |
Sep 30, 2022 |
Printed, PDF |
Page 150
Bottom |
Below the -- In SQL, the code should be replaced with the following:
SELECT origin, destination, sum(TotalDelays) as sumTotalDelays
FROM departureDelaysWindow
WHERE origin = 'SEA'
GROUP BY origin, destination
ORDER BY sumTotalDelays DESC
LIMIT 3
|
Brooke Wenig |
Sep 25, 2022 |
Sep 30, 2022 |
PDF |
Page 153
"Renaming Columns" |
Text states: "You can rename a column using the rename() method:", then shows an example using the withColumnRenamed() method. I believe the intent would be to state: "You can rename a column using the withColumnRenamed() method:"
Note from the Author or Editor: Thanks for catching this, we will be updating it. Much appreciated!
|
Nathan Knox |
Jul 20, 2020 |
Nov 20, 2020 |