Errata

Errata for Learning Spark, Second Edition

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted By	Date submitted	Date corrected
	Figure 3-5	On the online-oreilly version, in Figure 3-5, in the "Physical Plan" part, I think the "filter" should be on the events, not on the users. Because the query is filter(events("date") > "2015-01-01") Note from the Author or Editor: This is a very astute reader! The filter should be on the events, not the users. The "Physical Plan" portion of the diagram needs to be redrawn. The "filter" box should be on the left, above "scan (events)" instead of on the right above "scan (users)".	Anonymous	Nov 21, 2020	Sep 30, 2022
	Page ch8 Chapter 8 - Section: "Five Steps to define a streaming query" - "Putting it all together - Python code""	On chapter 8, section "putting it all together". Below listed code is not resulting in the expected output of wordcount ======== words = lines.select(split(col("value"), "\\s").alias("word")) counts = words.groupBy("word").count() ========= Above code results in an array value being counted. The first line needs to use "explode" function to have desired effect. Note from the Author or Editor: Excellent catch! For the Python code on 214, please update words = ... with this line: words = lines.select(explode(split(col("value"), "\\s")).alias("word")) For the Scala code on 214, please update val words =... with this line: val words = lines.select(explode(split(col("value"), "\\s")).as("word")) For the Python code on 218, please update words =... with this line: words = lines.select(explode(split(col("value"), "\\s")).alias("word")) For the Scala code on 219, please update words =... with this line: val words = lines.select(explode(split(col("value"), "\\s")).as("word")) For the Python code on 222, please update words = ... with this line: words = lines.select(explode(split(col("value"), "\\s")).alias("word")) For the Scala code on 222, please update val words =... with this line: val words = lines.select(explode(split(col("value"), "\\s")).as("word"))	Kamal	Sep 02, 2021	Sep 30, 2022
	Page "The DataFrame API" in chapter 3 1st paragraph	The paragraph starts with "Inspired by pandas DataFrames", which contains a broken hyperlink to pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe The updated link seems to be pandas.pydata.org/docs/reference/api/pandas.DataFrame.html Note from the Author or Editor: Yes! The documentation page has changed as you pointed out. We are fixing this, as well as a few other broken links in the next release.	Alin Mindroc	Mar 03, 2022	Sep 30, 2022
	Page Table 5-4. Map functions First row, first column (excluding headers)	map_form_arrays(array<K>, array<V>): map<K, V> Creates a map from the given pair of key/value arrays; elements in keys should not be null map_form_arrays should be map_from_arrays Note from the Author or Editor: Great catch! We need to change "map_form_arrays" on Table 5-4 -> "map_from_arrays"	Caitlin Daigle	Jan 31, 2024
	Page 30 last paragraph	It mentions some count example using groupBy() and orderBy(), but previous example doesn't use them at all. Note from the Author or Editor: The last paragraph could certainly be clarified. Here is what I suggest in its place: However, transformations such as groupBy() or orderBy() instruct Spark to perform wide transformations, where data from other partitions is read in, combined, and written to disk. If we were to sort the `filtered` DataFrame from the example above by calling .orderBy(), each partition will be locally sorted, but we need to force a shuffle of data from each of the executor’s partitions across the cluster to sort all of the records. In contrast to narrow transformations, wide transformations require output from other partitions to compute the final aggregation.	Jongyoung Park	Mar 03, 2021	Sep 30, 2022
	Page 36 Example 2-1	The code in Examples 2-1 and 2-2 should use the sum() aggregate function instead of count(). This has already been fixed in the book's GitHub repo. In short: update the example source code to match the repo. Places to fix: — Example 2-1 (two usages) — program output (p. 37) — Example 2-2 (two usages) Note from the Author or Editor: Excellent catch! This is a problem for pages 36-40 On page 36: * from pyspark.sql.functions import count -> from pyspark.sql.functions import sum * .agg(count("Count").alias("Total")) -> .agg(sum("Count").alias("Total")) On page 37: * .agg(count("Count").alias("Total")) -> .agg(sum("Count").alias("Total")) On page 40: * .agg(count("Count").alias("Total")) -> .agg(sum("Count").alias("Total")) [occurs twice] The output for pages 37-39 should be as follows: +-----+------+------+ \|State\|Color \|Total \| +-----+------+------+ \|CA \|Yellow\|100956\| \|WA \|Green \|96486 \| \|CA \|Brown \|95762 \| \|TX \|Green \|95753 \| \|TX \|Red \|95404 \| \|CO \|Yellow\|95038 \| \|NM \|Red \|94699 \| \|OR \|Orange\|94514 \| \|WY \|Green \|94339 \| \|NV \|Orange\|93929 \| \|TX \|Yellow\|93819 \| \|CO \|Green \|93724 \| \|CO \|Brown \|93692 \| \|CA \|Green \|93505 \| \|NM \|Brown \|93447 \| \|CO \|Blue \|93412 \| \|WA \|Red \|93332 \| \|WA \|Brown \|93082 \| \|WA \|Yellow\|92920 \| \|NM \|Yellow\|92747 \| \|NV \|Brown \|92478 \| \|TX \|Orange\|92315 \| \|AZ \|Brown \|92287 \| \|AZ \|Green \|91882 \| \|WY \|Red \|91768 \| \|AZ \|Orange\|91684 \| \|CA \|Red \|91527 \| \|WA \|Orange\|91521 \| \|NV \|Yellow\|91390 \| \|UT \|Orange\|91341 \| \|NV \|Green \|91331 \| \|NM \|Orange\|91251 \| \|NM \|Green \|91160 \| \|WY \|Blue \|91002 \| \|UT \|Red \|90995 \| \|CO \|Orange\|90971 \| \|AZ \|Yellow\|90946 \| \|TX \|Brown \|90736 \| \|OR \|Blue \|90526 \| \|CA \|Orange\|90311 \| \|OR \|Red \|90286 \| \|NM \|Blue \|90150 \| \|AZ \|Red \|90042 \| \|NV \|Blue \|90003 \| \|UT \|Blue \|89977 \| \|AZ \|Blue \|89971 \| \|WA \|Blue \|89886 \| \|OR \|Green \|89578 \| \|CO \|Red \|89465 \| \|NV \|Red \|89346 \| \|UT \|Yellow\|89264 \| \|OR \|Brown \|89136 \| \|CA \|Blue \|89123 \| \|UT \|Brown \|88973 \| \|TX \|Blue \|88466 \| \|UT \|Green \|88392 \| \|OR \|Yellow\|88129 \| \|WY \|Orange\|87956 \| \|WY \|Yellow\|87800 \| \|WY \|Brown \|86110 \| +-----+------+------+ Total Rows = 60 +-----+------+------+ \|State\|Color \|Total \| +-----+------+------+ \|CA \|Yellow\|100956\| \|CA \|Brown \|95762 \| \|CA \|Green \|93505 \| \|CA \|Red \|91527 \| \|CA \|Orange\|90311 \| \|CA \|Blue \|89123 \| +-----+------+------+	Herman	Dec 22, 2021	Sep 30, 2022
Printed, PDF	Page 64 Towards the page end	There is no day() function in pyspark.sql.functions . There are only dayofweek(), dayofmonth(), dayofyear(). Note from the Author or Editor: Yes, this is a minor technical mistake. The year() function exists but day() and month() should be dayofweek() and dayofmonth() respectively.	Chengyin Eng	Sep 18, 2020	Nov 20, 2020
Printed	Page 74 The 6th bullet point: Tungsten/Encoders	Double comma after the word 'Encoders'.	Josh Fry	Jan 05, 2022	Sep 30, 2022
	Page 79 SQL code example	GROUP BY State, Color Instead GROUP BY State, Color, Count Note from the Author or Editor: We made a similar mistake in the Python code as well. The updated code should be as follows: # In Python count_mnm_df = (mnm_df .select("State", "Color", "Count") .groupBy("State", "Color") .agg(sum("Count") .alias("Total")) .orderBy("Total", ascending=False)) -- In SQL SELECT State, Color, sum(Count) AS Total FROM MNM_TABLE_NAME GROUP BY State, Color ORDER BY Total DESC In the ouptut on pg 79, we also need to replace: * count(Count#12) -> sum(Count#12) [occurs 4 times] * partial_count(Count#12) -> partial_sum(Count#12) [occurs 1 time]	stefano fantini	Nov 21, 2021	Sep 30, 2022
PDF	Page 88 First code sample	The first code sample contains a CASE WHEN statement that does not fully cover the range of values, e.g., it has cases for values greater than and less than 120, but not equal to 120. It should read as follows: spark.sql("""SELECT delay, origin, destination, CASE WHEN delay >= 360 THEN 'Very Long Delays' WHEN delay >= 120 AND delay < 360 THEN 'Long Delays' WHEN delay >= 60 AND delay < 120 THEN 'Short Delays' WHEN delay > 0 and delay < 60 THEN 'Tolerable Delays' WHEN delay = 0 THEN 'No Delays' ELSE 'Early' END AS Flight_Delays FROM us_delay_flights_tbl ORDER BY origin, delay DESC""").show(10) Note from the Author or Editor: Thanks for filing this technical nuance. If you read the text on page 87, we consider Long Delays as anything (> 6 hours) not (>= 6 hours). So for this condition, we don' t need (>= 360). But for Long Delays (2-6 hours) makes sense to be inclusive. If we take the text on page 87, then I believe it should read as follows: spark.sql("""SELECT delay, origin, destination, CASE WHEN delay > 360 THEN 'Very Long Delays' WHEN delay >= 120 AND delay <= 360 THEN 'Long Delays' WHEN delay >= 60 AND delay < 120 THEN 'Short Delays' WHEN delay > 0 and delay < 60 THEN 'Tolerable Delays' WHEN delay = 0 THEN 'No Delays' ELSE 'Early' END AS Flight_Delays FROM us_delay_flights_tbl ORDER BY origin, delay DESC""").show(10)	Chris Abernethy	Jul 29, 2020	Nov 20, 2020
	Page 131 2nd paragraph	The stride is equal to 900 and the ten equivalent queries are not these that are the book. Actually, there are two queries at the beginning and the ending that catch all numbers bellow lowerBound and higher that upperBound parameters respectively. If you want to confirm what I'm pointing out just call the methods of in your dataframe (which you read from using jdbc format): df.rdd.glow().map(len).collect() and then you will notice my point. Note from the Author or Editor: This is indeed a mistake! The lowerBound should be set to 0, not 1000 (needs to be updated in 4 occurrences on page 131) In the second paragraph, also should be: SELECT * FROM table WHERE partitionColumn BETWEEN 0 and 1000 SELECT * FROM table WHERE partitionColumn BETWEEN 1000 and 2000 .... SELECT * FROM table WHERE partitionColumn BETWEEN 9000 and 10000	Carlos Alva	Jan 19, 2022	Sep 30, 2022
Printed	Page 138 Option 2	The code for option 2 is exactly the same as the option 1. But the code example in the ebook version are correctly showing the map function example Note from the Author or Editor: Thank for catching this, we will get this fixed. Much appreciated!	Chengyin Eng	Sep 18, 2020	Nov 20, 2020
	Page 139 paragraph above Table 5-3	"For the full list, refer to this notebook in the Databricks documentation" The hyperlink in this sentence is no longer an active site, and this sentence should be removed from the book. Note from the Author or Editor: Yes! This doc link was unfortunately removed. We are removing this sentence from the book now.	O'Reilly Media	Apr 15, 2021	Sep 30, 2022
Printed, PDF	Page 145-149 Minor typos throughout	I'm unable to respond to the Errata posted on Apr 12, 2021 as they included a URL in their post. So I'm copying my response here: There are a few minor typos we made: Pg 145: * airportsna -> airports * airports_na -> airports Pg 146: * airportsnaFilePath -> airportsFilePath [occurs twice] * airportsna -> airports [occurs twice] * airports_na -> airports [occurs twice] Pg 149: * airports_na -> airports	Brooke Wenig	Sep 25, 2022	Sep 30, 2022
PDF	Page 146 2nd Python code block	github.com/databricks/LearningSparkV2/issues/67 Should use airports instead of airportsna	Anonymous	Apr 12, 2021	Sep 30, 2022
Printed, PDF	Page 150 Bottom	Below the -- In SQL, the code should be replaced with the following: SELECT origin, destination, sum(TotalDelays) as sumTotalDelays FROM departureDelaysWindow WHERE origin = 'SEA' GROUP BY origin, destination ORDER BY sumTotalDelays DESC LIMIT 3	Brooke Wenig	Sep 25, 2022	Sep 30, 2022
PDF	Page 153 "Renaming Columns"	Text states: "You can rename a column using the rename() method:", then shows an example using the withColumnRenamed() method. I believe the intent would be to state: "You can rename a column using the withColumnRenamed() method:" Note from the Author or Editor: Thanks for catching this, we will be updating it. Much appreciated!	Nathan Knox	Jul 20, 2020	Nov 20, 2020