Errata

Data Science on the Google Cloud Platform

Errata for Data Science on the Google Cloud Platform

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
I
1st paragraph - Running the pipeline in the cloud

In Chapter 4 the section "Running the pipeline in the cloud" I ran the program df06.py with specified arguments but received the following error with a stack trace
IOError: [Errno 13] Permission denied: '/usr/local/lib/python2.7/dist-packages/monotonic.pyc'

I had to install apache-beam[gcp] to get this to work

pip install --user apache-beam[gcp]

Note from the Author or Editor:
I've now added instruction to run ./simulate/install_packages.sh to README.md in the GitHub repo

Anonymous  Mar 29, 2018  Oct 25, 2019
I
Running the Pipeline in the Cloud

In the section Running the Pipeline in the Cloud, it mentions DataflowPipelineRunner but I think this should be DataflowRunner. Also the output after running the program indicates it ran successfully but the simevents table is not created.



Note from the Author or Editor:
In Chapter 4, please replace occurrences of "DataflowPipelineRunner" with "DataflowRunner" (there are two places in the chapter)

Asish Patel  Mar 29, 2018  Oct 25, 2019
Printed
Page 61
right before Summary

I think your estimate of costs is misleadingly off by a few orders of magnitude. Because you need to use the flex instead of standard version of App Engine, it cannot currently scale down to zero, and you will be paying for at least one instance continuously (your manual scaling setting also says you will be running one instance at all times). Using at least one instance all the time costs a lot more than using it for 10 minutes a month. You can probably find many Internet discussions about this common billing surprise for the Flex environment. We all wish that flex could scale down to zero instances like standard, but so far as I know GCP cannot yet support it, and if it were supported you would need to change your scaling settings in order to get that behavior.

Note from the Author or Editor:
The reader is correct. Please change these sentences on page 61:

ORIGINAL:
Compute in the Flex environment costs about 6¢/hour and we’ll use no
more than 10 minutes a month, so that’s less than a penny. Storage in a single-region bucket costs about 1¢/GB/month and that accumulates over time. Assuming that we work up to a storage of about 5 GB for this dataset, our total cost is perhaps 6¢/month — your actual costs will vary, of course, and could be higher or lower.

TO:
Compute in the Flex environment costs about 6¢/hour and while we’ll use no
more than 10 minutes a month, the Flex environment currently doesn't scale down to zero instances. Hence, compute costs will run to about $45/month. Storage in a single-region bucket costs about 1¢/GB/month and that accumulates over time. Assuming that we work up to a storage of about 5 GB for this dataset, the storage cost is negligible. AppEngine Standard does scale down to zero instances, but runs in a sandbox that can not write to the local file system. If we rewrote our ingest to carry out all unzipping etc. in memory, we could get our cost down to a few cents a month.

Ed Barton  Feb 15, 2018  Oct 25, 2019
Printed
Page 104
2nd paragraph

Please change the sentence from:
This runs the code in df01.py on the Google Cloud Platform using the Cloud Dataflow service.

to:

This runs the code in df01.py locally because the pipeline specifies DirectRunner. If we change that to DataflowRunner, then running df01.py will execute the code on the Google Cloud using the Cloud Dataflow service.

Valliappa Lakshmanan
Valliappa Lakshmanan
 
Apr 03, 2018  Oct 25, 2019
Printed
Page 105
2nd paragraph - but primarily in file df02.py

I run the "install_packes.sh" script. Then when running df02.py from the command line in the GCS console I get the following error:


"...
File "./df02.py", line 37, in <lambda>
| beam.Map(lambda fields: (fields[0], addtimezone(fields[21], fields[26])))
File "./df02.py", line 22, in addtimezone
import timezonefinder
File "/usr/local/lib/python2.7/dist-packages/timezonefinder/__init__.py", line 2, in <module>
from .timezonefinder import TimezoneFinder
File "/usr/local/lib/python2.7/dist-packages/timezonefinder/timezonefinder.py", line 300
def closest_timezone_at(self, *, lat, lng, delta_degree=1, exact_computation=False, return_distances=False,
^
SyntaxError: invalid syntax [while running 'Map(<lambda at df02.py:37>)']
"

The error seems to emanate from within the timezonefinder package.

Anonymous  Mar 23, 2019  Oct 25, 2019
Printed
Page 157
SQL

Missing >= in WHERE clause: DEP_DELAY>=10. Code sample in github is correct.

Note from the Author or Editor:
Please fix. The full line should read:

WHERE DEP_DELAY >= 10 AND RAND() < 0.01

Michael Shearer  Mar 04, 2018  Oct 25, 2019
Printed
Page 202
4th para.

The helpful footnote 17 re: quotas on page 240 could be moved to page 202 when cluster is first resized and quota limits encountered.

Note from the Author or Editor:
On page 205,


After the word price at this line on p205:
Let’s add machines to the cluster so that it has 20 workers, 15 of which are preemptible and so are heavily discounted in price:

Add this footnote:
Trying to increase the number of workers might have you hitting against (soft) quotas on the maximum number of CPUs, drives, or addresses. If you hit any of these soft quotas, request an increase from the Google
Cloud Platform console’s section on quotas: https://console.cloud.google.com/iam-admin/quotas

Michael Shearer  Mar 04, 2018  Oct 25, 2019
Printed
Page 218
2nd para

Refers to s0 not x0

Note from the Author or Editor:
s0 could be departure delay

should be

x0 could be departure delay,

Michael Shearer  Mar 11, 2018  Oct 25, 2019
Printed
Page 240
last para.

Post Dataproc 1.2 HDFS Web Interface Port 50070 has been replaced by Port 9870. See https://cloud.google.com/dataproc/docs/concepts/accessing/cluster-web-interfaces

Note from the Author or Editor:
On page 240, please change
(50070)
to
(9870)

and on page 241, please add to the figure caption this sentence:
The screenshot shows port 50070 because at the time of publication, the HDFS port was 50070. You should now use port 9870.

Michael Shearer  Mar 11, 2018  Oct 25, 2019
Printed
Page 277
Maven launch

Maven args should include --fullDataset=true and --project= should be specified.

Note from the Author or Editor:
Please change the code block that starts "mvn compile" to:

mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.training.flights.CreateTrainingDataset \
-Dexec.args="--project=$PROJECT --bucket=$BUCKET --fullDataset=true --maxNumWorkers=$MAX_NUM_WORKERS --autoscalingAlgorithm=THROUGHPUT_BASED"

Michael Shearer  Mar 31, 2018  Oct 25, 2019
Printed
Page 344

Need to specify --project when invoking simulate.py

Note from the Author or Editor:
please change;
python simulate.py --startTime ...

to

python simulate.py --project <PROJECT-ID> --startTime ...

Michael Shearer  Apr 21, 2018  Oct 25, 2019