Errata

Advanced Analytics with PySpark

Errata for Advanced Analytics with PySpark

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date submitted
O'Reilly learning platform Page Chapter 4, Categorical Features Revisited
First block of code

The UDF defined in the code block returns StringType(), which will raise an error if the unencoded training set is transformed by the VectorAssembler in the pipeline. The UDF should be modified to return IntegerType instead, i.e.
unhot_udf = udf(lambda v: v.toArray().tolist().index(1), IntegerType())

Philipp Spengler  Feb 21, 2024 
Printed Page 44
Building a first model

The book includes possibly redundant code:
train_data = train_data = user_artist_df.join(broadcast(....

train_data = train_data = user_artist_df could be simplified by train_data = user_artist_df.join...

Ben Halicki  Aug 18, 2022 
Printed Page 49
First block of code

The book refers to ‘top_prediction_pandas’ on line 2 – this should be ‘top_predictions_pandas’.

Ben Halicki  Aug 18, 2022 
Printed Page 51
Computing AUC

On page 51 (section Computing AUC), it appears some source code has been truncated (indicated by the ... in module def area_under_curve). I have checked the source repository for this code (https://github.com/sryza/aas) so I can see how it is implemented, however, the source code in the repository is in SCALA, not PySpark, so not really helpful. Is there a repository containing the PySpark code for this book?

Ben Halicki  Aug 18, 2022 
PDF Page 51
fourth last line of code

incorrect : all_artist_ids = all_data.select("artist").distinct().count()
correct : all_artist_ids = all_data.select("artist").distinct()

the incorrect version just returns a number, but probably a dataframe of all artist ids was intended

Anonymous  Apr 13, 2023