Errata

Errata for Advanced Analytics with PySpark

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted by	Date submitted
O'Reilly learning platform	Page Chapter 4, Categorical Features Revisited First block of code	The UDF defined in the code block returns StringType(), which will raise an error if the unencoded training set is transformed by the VectorAssembler in the pipeline. The UDF should be modified to return IntegerType instead, i.e. unhot_udf = udf(lambda v: v.toArray().tolist().index(1), IntegerType())	Philipp Spengler	Feb 21, 2024
Printed	Page 44 Building a first model	The book includes possibly redundant code: train_data = train_data = user_artist_df.join(broadcast(.... train_data = train_data = user_artist_df could be simplified by train_data = user_artist_df.join...	Ben Halicki	Aug 18, 2022
Printed	Page 49 First block of code	The book refers to ‘top_prediction_pandas’ on line 2 – this should be ‘top_predictions_pandas’.	Ben Halicki	Aug 18, 2022
Printed	Page 51 Computing AUC	On page 51 (section Computing AUC), it appears some source code has been truncated (indicated by the ... in module def area_under_curve). I have checked the source repository for this code (https://github.com/sryza/aas) so I can see how it is implemented, however, the source code in the repository is in SCALA, not PySpark, so not really helpful. Is there a repository containing the PySpark code for this book?	Ben Halicki	Aug 18, 2022
PDF	Page 51 fourth last line of code	incorrect : all_artist_ids = all_data.select("artist").distinct().count() correct : all_artist_ids = all_data.select("artist").distinct() the incorrect version just returns a number, but probably a dataframe of all artist ids was intended	Anonymous	Apr 13, 2023