Errata

Errata for Data Science on the Google Cloud Platform

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted By	Date submitted	Date corrected
	I 1st paragraph - Running the pipeline in the cloud	In Chapter 4 the section "Running the pipeline in the cloud" I ran the program df06.py with specified arguments but received the following error with a stack trace IOError: [Errno 13] Permission denied: '/usr/local/lib/python2.7/dist-packages/monotonic.pyc' I had to install apache-beam[gcp] to get this to work pip install --user apache-beam[gcp] Note from the Author or Editor: I've now added instruction to run ./simulate/install_packages.sh to README.md in the GitHub repo	Anonymous	Mar 29, 2018	Oct 25, 2019
	I Running the Pipeline in the Cloud	In the section Running the Pipeline in the Cloud, it mentions DataflowPipelineRunner but I think this should be DataflowRunner. Also the output after running the program indicates it ran successfully but the simevents table is not created. Note from the Author or Editor: In Chapter 4, please replace occurrences of "DataflowPipelineRunner" with "DataflowRunner" (there are two places in the chapter)	Asish Patel	Mar 29, 2018	Oct 25, 2019
Printed	Page 61 right before Summary	I think your estimate of costs is misleadingly off by a few orders of magnitude. Because you need to use the flex instead of standard version of App Engine, it cannot currently scale down to zero, and you will be paying for at least one instance continuously (your manual scaling setting also says you will be running one instance at all times). Using at least one instance all the time costs a lot more than using it for 10 minutes a month. You can probably find many Internet discussions about this common billing surprise for the Flex environment. We all wish that flex could scale down to zero instances like standard, but so far as I know GCP cannot yet support it, and if it were supported you would need to change your scaling settings in order to get that behavior. Note from the Author or Editor: The reader is correct. Please change these sentences on page 61: ORIGINAL: Compute in the Flex environment costs about 6¢/hour and we’ll use no more than 10 minutes a month, so that’s less than a penny. Storage in a single-region bucket costs about 1¢/GB/month and that accumulates over time. Assuming that we work up to a storage of about 5 GB for this dataset, our total cost is perhaps 6¢/month — your actual costs will vary, of course, and could be higher or lower. TO: Compute in the Flex environment costs about 6¢/hour and while we’ll use no more than 10 minutes a month, the Flex environment currently doesn't scale down to zero instances. Hence, compute costs will run to about $45/month. Storage in a single-region bucket costs about 1¢/GB/month and that accumulates over time. Assuming that we work up to a storage of about 5 GB for this dataset, the storage cost is negligible. AppEngine Standard does scale down to zero instances, but runs in a sandbox that can not write to the local file system. If we rewrote our ingest to carry out all unzipping etc. in memory, we could get our cost down to a few cents a month.	Ed Barton	Feb 15, 2018	Oct 25, 2019
Printed	Page 104 2nd paragraph	Please change the sentence from: This runs the code in df01.py on the Google Cloud Platform using the Cloud Dataflow service. to: This runs the code in df01.py locally because the pipeline specifies DirectRunner. If we change that to DataflowRunner, then running df01.py will execute the code on the Google Cloud using the Cloud Dataflow service.	Valliappa Lakshmanan	Apr 03, 2018	Oct 25, 2019
Printed	Page 105 2nd paragraph - but primarily in file df02.py	I run the "install_packes.sh" script. Then when running df02.py from the command line in the GCS console I get the following error: "... File "./df02.py", line 37, in <lambda> \| beam.Map(lambda fields: (fields[0], addtimezone(fields[21], fields[26]))) File "./df02.py", line 22, in addtimezone import timezonefinder File "/usr/local/lib/python2.7/dist-packages/timezonefinder/__init__.py", line 2, in <module> from .timezonefinder import TimezoneFinder File "/usr/local/lib/python2.7/dist-packages/timezonefinder/timezonefinder.py", line 300 def closest_timezone_at(self, *, lat, lng, delta_degree=1, exact_computation=False, return_distances=False, ^ SyntaxError: invalid syntax [while running 'Map(<lambda at df02.py:37>)'] " The error seems to emanate from within the timezonefinder package.	Anonymous	Mar 23, 2019	Oct 25, 2019
Printed	Page 157 SQL	Missing >= in WHERE clause: DEP_DELAY>=10. Code sample in github is correct. Note from the Author or Editor: Please fix. The full line should read: WHERE DEP_DELAY >= 10 AND RAND() < 0.01	Michael Shearer	Mar 04, 2018	Oct 25, 2019
Printed	Page 202 4th para.	The helpful footnote 17 re: quotas on page 240 could be moved to page 202 when cluster is first resized and quota limits encountered. Note from the Author or Editor: On page 205, After the word price at this line on p205: Let’s add machines to the cluster so that it has 20 workers, 15 of which are preemptible and so are heavily discounted in price: Add this footnote: Trying to increase the number of workers might have you hitting against (soft) quotas on the maximum number of CPUs, drives, or addresses. If you hit any of these soft quotas, request an increase from the Google Cloud Platform console’s section on quotas: https://console.cloud.google.com/iam-admin/quotas	Michael Shearer	Mar 04, 2018	Oct 25, 2019
Printed	Page 218 2nd para	Refers to s0 not x0 Note from the Author or Editor: s0 could be departure delay should be x0 could be departure delay,	Michael Shearer	Mar 11, 2018	Oct 25, 2019
Printed	Page 240 last para.	Post Dataproc 1.2 HDFS Web Interface Port 50070 has been replaced by Port 9870. See https://cloud.google.com/dataproc/docs/concepts/accessing/cluster-web-interfaces Note from the Author or Editor: On page 240, please change (50070) to (9870) and on page 241, please add to the figure caption this sentence: The screenshot shows port 50070 because at the time of publication, the HDFS port was 50070. You should now use port 9870.	Michael Shearer	Mar 11, 2018	Oct 25, 2019
Printed	Page 277 Maven launch	Maven args should include --fullDataset=true and --project= should be specified. Note from the Author or Editor: Please change the code block that starts "mvn compile" to: mvn compile exec:java \ -Dexec.mainClass=com.google.cloud.training.flights.CreateTrainingDataset \ -Dexec.args="--project=$PROJECT --bucket=$BUCKET --fullDataset=true --maxNumWorkers=$MAX_NUM_WORKERS --autoscalingAlgorithm=THROUGHPUT_BASED"	Michael Shearer	Mar 31, 2018	Oct 25, 2019
Printed	Page 344	Need to specify --project when invoking simulate.py Note from the Author or Editor: please change; python simulate.py --startTime ... to python simulate.py --project <PROJECT-ID> --startTime ...	Michael Shearer	Apr 21, 2018	Oct 25, 2019