Errata

Learning Spark

Errata for Learning Spark, Second Edition

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
Figure 3-5

On the online-oreilly version, in Figure 3-5, in the "Physical Plan" part, I think the "filter" should be on the events, not on the users. Because the query is filter(events("date") > "2015-01-01")

Note from the Author or Editor:
This is a very astute reader! The filter should be on the events, not the users.

The "Physical Plan" portion of the diagram needs to be redrawn. The "filter" box should be on the left, above "scan (events)" instead of on the right above "scan (users)".

Anonymous  Nov 21, 2020  Sep 30, 2022
Page ch8
Chapter 8 - Section: "Five Steps to define a streaming query" - "Putting it all together - Python code""

On chapter 8, section "putting it all together". Below listed code is not resulting in the expected output of wordcount
========
words = lines.select(split(col("value"), "\\s").alias("word"))
counts = words.groupBy("word").count()
=========
Above code results in an array value being counted. The first line needs to use "explode" function to have desired effect.

Note from the Author or Editor:
Excellent catch!

For the Python code on 214, please update words = ... with this line:
words = lines.select(explode(split(col("value"), "\\s")).alias("word"))

For the Scala code on 214, please update val words =... with this line:
val words = lines.select(explode(split(col("value"), "\\s")).as("word"))

For the Python code on 218, please update words =... with this line:
words = lines.select(explode(split(col("value"), "\\s")).alias("word"))

For the Scala code on 219, please update words =... with this line:
val words = lines.select(explode(split(col("value"), "\\s")).as("word"))

For the Python code on 222, please update words = ... with this line:
words = lines.select(explode(split(col("value"), "\\s")).alias("word"))

For the Scala code on 222, please update val words =... with this line:
val words = lines.select(explode(split(col("value"), "\\s")).as("word"))

Kamal  Sep 02, 2021  Sep 30, 2022
Page "The DataFrame API" in chapter 3
1st paragraph

The paragraph starts with "Inspired by pandas DataFrames", which contains a broken hyperlink to pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe

The updated link seems to be pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

Note from the Author or Editor:
Yes! The documentation page has changed as you pointed out. We are fixing this, as well as a few other broken links in the next release.

Alin Mindroc  Mar 03, 2022  Sep 30, 2022
Page Table 5-4. Map functions
First row, first column (excluding headers)

map_form_arrays(array<K>, array<V>): map<K, V>
Creates a map from the given pair of key/value arrays; elements in keys should not be null


map_form_arrays
should be
map_from_arrays

Note from the Author or Editor:
Great catch!

We need to change "map_form_arrays" on Table 5-4 -> "map_from_arrays"

Caitlin Daigle  Jan 31, 2024 
Page 30
last paragraph

It mentions some count example using groupBy() and orderBy(), but previous example doesn't use them at all.

Note from the Author or Editor:
The last paragraph could certainly be clarified. Here is what I suggest in its place:

However, transformations such as groupBy() or orderBy() instruct Spark to perform wide transformations, where data from other partitions is read in, combined, and written to disk. If we were to sort the `filtered` DataFrame from the example above by calling .orderBy(), each partition will be locally sorted, but we need to force a shuffle of data from each of the executor’s partitions across the cluster to sort all of the records. In contrast to narrow transformations, wide transformations require output from other partitions to compute the final aggregation.

Jongyoung Park  Mar 03, 2021  Sep 30, 2022
Page 36
Example 2-1

The code in Examples 2-1 and 2-2 should use the sum() aggregate function instead of count(). This has already been fixed in the book's GitHub repo. In short: update the example source code to match the repo.
Places to fix:
— Example 2-1 (two usages)
— program output (p. 37)
— Example 2-2 (two usages)

Note from the Author or Editor:
Excellent catch! This is a problem for pages 36-40

On page 36:
* from pyspark.sql.functions import count -> from pyspark.sql.functions import sum
* .agg(count("Count").alias("Total")) -> .agg(sum("Count").alias("Total"))

On page 37:
* .agg(count("Count").alias("Total")) -> .agg(sum("Count").alias("Total"))

On page 40:
* .agg(count("Count").alias("Total")) -> .agg(sum("Count").alias("Total")) [occurs twice]


The output for pages 37-39 should be as follows:
+-----+------+------+
|State|Color |Total |
+-----+------+------+
|CA |Yellow|100956|
|WA |Green |96486 |
|CA |Brown |95762 |
|TX |Green |95753 |
|TX |Red |95404 |
|CO |Yellow|95038 |
|NM |Red |94699 |
|OR |Orange|94514 |
|WY |Green |94339 |
|NV |Orange|93929 |
|TX |Yellow|93819 |
|CO |Green |93724 |
|CO |Brown |93692 |
|CA |Green |93505 |
|NM |Brown |93447 |
|CO |Blue |93412 |
|WA |Red |93332 |
|WA |Brown |93082 |
|WA |Yellow|92920 |
|NM |Yellow|92747 |
|NV |Brown |92478 |
|TX |Orange|92315 |
|AZ |Brown |92287 |
|AZ |Green |91882 |
|WY |Red |91768 |
|AZ |Orange|91684 |
|CA |Red |91527 |
|WA |Orange|91521 |
|NV |Yellow|91390 |
|UT |Orange|91341 |
|NV |Green |91331 |
|NM |Orange|91251 |
|NM |Green |91160 |
|WY |Blue |91002 |
|UT |Red |90995 |
|CO |Orange|90971 |
|AZ |Yellow|90946 |
|TX |Brown |90736 |
|OR |Blue |90526 |
|CA |Orange|90311 |
|OR |Red |90286 |
|NM |Blue |90150 |
|AZ |Red |90042 |
|NV |Blue |90003 |
|UT |Blue |89977 |
|AZ |Blue |89971 |
|WA |Blue |89886 |
|OR |Green |89578 |
|CO |Red |89465 |
|NV |Red |89346 |
|UT |Yellow|89264 |
|OR |Brown |89136 |
|CA |Blue |89123 |
|UT |Brown |88973 |
|TX |Blue |88466 |
|UT |Green |88392 |
|OR |Yellow|88129 |
|WY |Orange|87956 |
|WY |Yellow|87800 |
|WY |Brown |86110 |
+-----+------+------+

Total Rows = 60

+-----+------+------+
|State|Color |Total |
+-----+------+------+
|CA |Yellow|100956|
|CA |Brown |95762 |
|CA |Green |93505 |
|CA |Red |91527 |
|CA |Orange|90311 |
|CA |Blue |89123 |
+-----+------+------+

Herman  Dec 22, 2021  Sep 30, 2022
Printed, PDF
Page 64
Towards the page end

There is no day() function in pyspark.sql.functions . There are only
dayofweek(), dayofmonth(), dayofyear().

Note from the Author or Editor:
Yes, this is a minor technical mistake. The year() function exists but day() and month() should be dayofweek() and dayofmonth() respectively.

Chengyin Eng  Sep 18, 2020  Nov 20, 2020
Printed
Page 74
The 6th bullet point: Tungsten/Encoders

Double comma after the word 'Encoders'.

Josh Fry  Jan 05, 2022  Sep 30, 2022
Page 79
SQL code example

GROUP BY State, Color
Instead
GROUP BY State, Color, Count

Note from the Author or Editor:
We made a similar mistake in the Python code as well. The updated code should be as follows:

# In Python
count_mnm_df = (mnm_df
.select("State", "Color", "Count")
.groupBy("State", "Color")
.agg(sum("Count")
.alias("Total"))
.orderBy("Total", ascending=False))

-- In SQL
SELECT State, Color, sum(Count) AS Total
FROM MNM_TABLE_NAME
GROUP BY State, Color
ORDER BY Total DESC

In the ouptut on pg 79, we also need to replace:
* count(Count#12) -> sum(Count#12) [occurs 4 times]
* partial_count(Count#12) -> partial_sum(Count#12) [occurs 1 time]

stefano fantini  Nov 21, 2021  Sep 30, 2022
PDF
Page 88
First code sample

The first code sample contains a CASE WHEN statement that does not fully cover the range of values, e.g., it has cases for values greater than and less than 120, but not equal to 120.

It should read as follows:

spark.sql("""SELECT delay, origin, destination, CASE
WHEN delay >= 360 THEN 'Very Long Delays'
WHEN delay >= 120 AND delay < 360 THEN 'Long Delays'
WHEN delay >= 60 AND delay < 120 THEN 'Short Delays'
WHEN delay > 0 and delay < 60 THEN 'Tolerable Delays'
WHEN delay = 0 THEN 'No Delays'
ELSE 'Early'
END AS Flight_Delays
FROM us_delay_flights_tbl
ORDER BY origin, delay DESC""").show(10)

Note from the Author or Editor:
Thanks for filing this technical nuance. If you read the text on page 87, we consider Long Delays as anything (> 6 hours) not (>= 6 hours). So for this condition, we don' t need (>= 360).

But for Long Delays (2-6 hours) makes sense to be inclusive. If we take the text on page 87, then I believe it should read as follows:

spark.sql("""SELECT delay, origin, destination, CASE
WHEN delay > 360 THEN 'Very Long Delays'
WHEN delay >= 120 AND delay <= 360 THEN 'Long Delays'
WHEN delay >= 60 AND delay < 120 THEN 'Short Delays'
WHEN delay > 0 and delay < 60 THEN 'Tolerable Delays'
WHEN delay = 0 THEN 'No Delays'
ELSE 'Early'
END AS Flight_Delays
FROM us_delay_flights_tbl
ORDER BY origin, delay DESC""").show(10)

Chris Abernethy  Jul 29, 2020  Nov 20, 2020
Page 131
2nd paragraph

The stride is equal to 900 and the ten equivalent queries are not these that are the book. Actually, there are two queries at the beginning and the ending that catch all numbers bellow lowerBound and higher that upperBound parameters respectively. If you want to confirm what I'm pointing out just call the methods of in your dataframe (which you read from using jdbc format): df.rdd.glow().map(len).collect() and then you will notice my point.

Note from the Author or Editor:
This is indeed a mistake! The lowerBound should be set to 0, not 1000 (needs to be updated in 4 occurrences on page 131)

In the second paragraph, also should be:
SELECT * FROM table WHERE partitionColumn BETWEEN 0 and 1000
SELECT * FROM table WHERE partitionColumn BETWEEN 1000 and 2000
....
SELECT * FROM table WHERE partitionColumn BETWEEN 9000 and 10000

Carlos Alva  Jan 19, 2022  Sep 30, 2022
Printed
Page 138
Option 2

The code for option 2 is exactly the same as the option 1. But the code example in the ebook version are correctly showing the map function example

Note from the Author or Editor:
Thank for catching this, we will get this fixed. Much appreciated!

Chengyin Eng  Sep 18, 2020  Nov 20, 2020
Page 139
paragraph above Table 5-3

"For the full list, refer to this notebook in the Databricks documentation"

The hyperlink in this sentence is no longer an active site, and this sentence should be removed from the book.

Note from the Author or Editor:
Yes! This doc link was unfortunately removed. We are removing this sentence from the book now.

O'Reilly Media
 
Apr 15, 2021  Sep 30, 2022
Printed, PDF
Page 145-149
Minor typos throughout

I'm unable to respond to the Errata posted on Apr 12, 2021 as they included a URL in their post. So I'm copying my response here:

There are a few minor typos we made:

Pg 145:
* airportsna -> airports
* airports_na -> airports

Pg 146:
* airportsnaFilePath -> airportsFilePath [occurs twice]
* airportsna -> airports [occurs twice]
* airports_na -> airports [occurs twice]

Pg 149:
* airports_na -> airports

Brooke Wenig
 
Sep 25, 2022  Sep 30, 2022
PDF
Page 146
2nd Python code block

github.com/databricks/LearningSparkV2/issues/67

Should use airports instead of airportsna

Anonymous  Apr 12, 2021  Sep 30, 2022
Printed, PDF
Page 150
Bottom

Below the -- In SQL, the code should be replaced with the following:

SELECT origin, destination, sum(TotalDelays) as sumTotalDelays
FROM departureDelaysWindow
WHERE origin = 'SEA'
GROUP BY origin, destination
ORDER BY sumTotalDelays DESC
LIMIT 3

Brooke Wenig
 
Sep 25, 2022  Sep 30, 2022
PDF
Page 153
"Renaming Columns"

Text states: "You can rename a column using the rename() method:", then shows an example using the withColumnRenamed() method. I believe the intent would be to state: "You can rename a column using the withColumnRenamed() method:"

Note from the Author or Editor:
Thanks for catching this, we will be updating it. Much appreciated!

Nathan Knox  Jul 20, 2020  Nov 20, 2020