Errata

High Performance Spark

Errata for High Performance Spark

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
Other Digital Version
22%
Figure 5-1

in this figure a set of partitions is displayed, across 3 different Spark transformations.

1) rdd1

2) rdd2 = rdd1.map(x=(x,1))

3) rdd3 = rdd2.groupByKey

The error consists on referring to rdd3 as "rdd1 child of rdd2" to the left of the image.

(Kindle version, location 2344 of 10460)

Note from the Author or Editor:
The left bottom of figure 1 needs to be updated to "rdd3 child of rdd2"
I haven't fixed myself since it's in a figure, but hopefully the production team can fix this if/when we do an update.

Pablo Rodriguez Bertorello  Jul 02, 2017  Oct 20, 2017
PDF
Page 39
table 3-3

In table 3-3, 'gt' of last row should be 'geq'

Note from the Author or Editor:
Thank you, I've fixed this in atlas.

Jongyoung Park  Jul 28, 2017  Oct 20, 2017
PDF
Page 116
1st paragraph

In the sentence "checkpointing or off_heap persistence or checkpointing", one of two 'checkpoint' should be removed.

Note from the Author or Editor:
Thank you, I've fixed this in atlas.

Jongyoung Park  Aug 19, 2017  Oct 20, 2017
PDF
Page 121
2nd line in 'LRU caching'

Intead -> Instead

Note from the Author or Editor:
Thank you, I've fixed this in atlas.

Jongyoung Park  Aug 20, 2017  Oct 20, 2017
PDF
Page 130
2nd paragraph from bottom

'of of' must be 'of'

Note from the Author or Editor:
I've fixed this in atlas, thank you.

Jongyoung Park  Aug 26, 2017  Oct 20, 2017
PDF
Page 131
TIP

IMO, "an ordering an an object" shold be "an ordering of an object"

Note from the Author or Editor:
Thank you, I've fixed this in atlas.

Jongyoung Park  Aug 26, 2017  Oct 20, 2017
PDF
Page 161
last paragraph

"(value, column index pairs)" should be "(value, column index) pairs".

Note from the Author or Editor:
Thank you, I've fixed this in the development copy in atlas.

Jongyoung Park  Sep 07, 2017  Oct 20, 2017
PDF
Page 187
"Installing PySpark" section

1. In the second paragraph, last right parenthesis looks useless.

2. First 'Its' if the third paragraph must be 'It's' or 'It is'.

Jongyoung Park  Sep 18, 2017  Oct 20, 2017
Other Digital Version
2091
Example 4-4

The author says "you can prevent the shuffle [...] and persisting the RDD before the join." However, in Example 4-4, the RDD is not persisted before the join. In addition, the author does not explain the difference between persisting and not persisting, do they really affect the performance of the join?

(Kindle version, location 2091 of 10460)

Note from the Author or Editor:
Thanks! We've already changed the text for this in atlas and it should be included in the next update.

Yong-Siang Shih  Jul 15, 2017  Oct 20, 2017
Other Digital Version
2141
Example 4-5

Although a broadcast variable of smallRDDLocal is created, the the original smallRDDLocal is used. This seems like a mistake as official document points out:

After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once.

Note from the Author or Editor:
Thank you, that's correct. I've updated the example on github and it will show up in the updated e-book whenever we next get a chance for a refresh :)

Yong-Siang Shih  Jul 15, 2017  Oct 20, 2017
Other Digital Version
2912
TIP of Example 5-14

The tip says: "calling distinct will cause a shuffle if the partitioner is not known." However, since the distinct function is implemented by

map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)

Even if a partitioner is known, the map operation does not preserve the partitioner, and therefore a shuffle might be unavoidable?

(Kindle version, location 2912 of 10460)

Note from the Author or Editor:
This is true, a shuffle will occur in either case - however if the partioner is known in advance the reduce step will be able remove all duplicates prior to the shuffle. I've clarified the text for this in our repo (although it may be awhile before this makes it into the kindle version).

Yong-Siang Shih  Jul 15, 2017  Oct 20, 2017
Other Digital Version
3209
Example 5-23

The author claims that by persisting rddA, the "sort stage" will occur only once. This is incorrect. In fact, the "sorted" RDD should be persisted instead. Also, it should be persisted before the count action rather than after that.
(Kindle version, location 3209 of 10460)

Note from the Author or Editor:
The persistence is indeed incorrect in Example 5-23, it should be on sorted before the count is called. I've updated this in the repo, but it may take awhile before the update makes it through.

Yong-Siang Shih  Jul 15, 2017  Oct 20, 2017