Errata

Hadoop: The Definitive Guide

Errata for Hadoop: The Definitive Guide

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
37
3rd paragraph (box note)

boxnote says:

"In this case, you need to implement the close() method so that
you know when the last record has been read, so you can finish processing
the last group of lines."

org.apache.hadoop.mapreduce.Mapper doesn't have close method to override, so implementing close method will not help since it's not called anywhere.

However it has empty cleanup method that can be overridden specifically for such kind of purpose.

Note from the Author or Editor:
Change "In this case, you need to implement the close() method" to "In this case, you need to implement the cleanup() method".

This is on p38 in the 4th ed print edition.

Anonymous  Mar 04, 2015  Apr 17, 2015
Printed, PDF
Page 118
6th paragraph, first line (two lines before "The specific API" subsection).

The text says, "the objects returned by result.get("left") and result.get("left") are of type Utf-8, so we can convert them into Java String objects by calling their toString() methods."

We are discussing the objects returned by result.get on "left" and "right", not "left" and "left".

The text should read, "the objects returned by result.get("left") and result.get("right") are of type Utf-8, so we can convert them into Java String objects by calling their toString() methods."

Note from the Author or Editor:
Change the second result.get("left") to result.get("right").

This appears on p351 in the 4th edition.

Myles Baker  Jan 21, 2015  Apr 17, 2015
ePub
Page N/A (ebook)
N/A (ebook)

From Chapter 1, "A Brief History of Hadoop"

“Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year. All the major Nutch algorithms had been ported to run using MapReduce and NDFS.”

I think this was intended to say (note punctuation after "middle of that year"):

“Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year, all the major Nutch algorithms had been ported to run using MapReduce and NDFS.”

Trevor Harmon  Oct 21, 2014 
PDF
Page 299
Last paragraph

In describing network topology, reference is made to switches and routers connecting machines on a rack. "GB" is used to refer to gigabit when it should probably be written as "Gb."

Note from the Author or Editor:
Change "with a 1 GB switch" to "with a 1 Gb switch"; and "normally 1 GB or better" to "normally 1 Gb or better".

Dima Spivak  Apr 14, 2014 
PDF
Page 65
1st paragraph

A tiny typo: the last sentence of the first paragraph (under the heading "File patterns") reads: "Hadoop provides two FileSystem method for processing globs..." The word "method" should be pluralized (i.e. "methods").

Dima Spivak  Apr 11, 2014 
Printed
Page 134
United States

When reading about reader.sync and reader.getPosition it led me to believe that the example output shown on page 134 would work regardless of the compression used. The positions increment in an orderly fashion as you iterate through the rows. With compression, the read position apparently stays the same for all the records in the decompressed block. In other words I look at two adjacent records and reader.getPosition yields the same value.

Of course this makes sense when you think about how the compressed formats work, but the book is obviously for folks who haven't fully wrapped their minds around Hadoop.

Just a suggestion to note the different behavior when compression is used.

David Larsen  Dec 06, 2013 
PDF
Page 207
1st paragraph

The paragraph says

"However, with the FIFO scheduler, priorities do not support preemption, so a high-priority job can still be blocked by a long-running, low-priority job that started before the high-priority job was scheduled."

This is not true, as confirmed by Tom himself. A high-priority job will actually block any lower priotized jobs when being submitted onto a busy cluster.

Eric Sammer's book is bescribing the correct behaviour:

"The FIFO scheduler supports five levels of job prioritization, from lowest to highest: very low, low, normal, high, very high. Each priority is actually implemented as a sep- arate FIFO queue. All tasks from higher priority queues are processed before lower priority queues and as described earlier, tasks are scheduled in the order of their jobs� submission. The easiest way to visualize prioritized FIFO scheduling is to think of it as five FIFO queues ordered top to bottom by priority. Tasks are then scheduled left to right, top to bottom. This means that all very high priority tasks are processed before any high priority tasks, which are processed before any normal priority tasks, and so on."

Note from the Author or Editor:
This is indeed incorrect. I have reworked and added to the material on schedulers for the fourth edition to cover scheduling in YARN. The FIFO scheduler in YARN doesn't support priorities, so the statement I wrote is actually correct for MR2. In MR1 however, the FIFO scheduler does support priorities, and the behaviour described by Eric is correct.

Kai Voigt  Oct 16, 2013 
Printed
Page 168
paragraph

"Farther down the page" should say "Further down the page"

Note from the Author or Editor:
BTW this section has been re-written for the fourth edition to cover the new web flow for YARN, and the phrase "Further down the page" no longer appears.

Tulio Domingos  May 22, 2013 
Printed
Page 57
paragraph

"Sometimes it is possible to set a URLStreamHandlerFactory". It should say "Sometimes it is NOT possible"

Note from the Author or Editor:
Changed to "Sometimes it is impossible to set a URLStreamHandlerFactory"

Tulio Domingos  May 22, 2013 
ePub
Page 113
Bottom paragraph, first sentence

"Hadoop cannot divine" should be "Hadoop cannot define"

Note from the Author or Editor:
Changed to "Hadoop cannot magically discover" in the fourth edition.

Robert A. Wlodarczyk  Feb 18, 2013 
Printed
Page 304
Footnote (3)

Within footnote 3 on page 304 the sentence:
"See its main page for instructions on how to start ssh-agent"
should probably say man instead of main e.g.:
"See its man page for instructions on how to start-ssh-agent"

Note from the Author or Editor:
Changed to "See its man page for instructions on how to start ssh-agent."

Vijay  Dec 24, 2012 
Printed
Page 309
3rd paragraph

Missed the y on directory.

The log director ---> The log directory

Ryan Tabora  Dec 04, 2012 
625
3rd paragraph

All filenames in the sample listing in Appendix C pg 625 are the same.

% ls -l 1901 | head
011990-99999-1950.gz
011990-99999-1950.gz
...
011990-99999-1950.gz

The sample listing on pg 18 appears correct.

Note from the Author or Editor:
Changed to
% ls 1901 | head
029070-99999-1901.gz
029500-99999-1901.gz
029600-99999-1901.gz
029720-99999-1901.gz
029810-99999-1901.gz
227070-99999-1901.gz

Anonymous  Nov 13, 2012 
Printed
Page 38
2nd and 3rd % command

regular expression in
$HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar

doesn't mach 1.0.3 file naming
hadoop-streaming-1.0.3.jar

perhaps
hadoop*streaming*.jar
to match both old and new file naming

Note from the Author or Editor:
For Hadoop 2 the correct form is "hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar". The fourth edition has been updated to reflect this change.

Mark Anderson  Aug 17, 2012 
PDF
Page 5
3rd paragraph, first sentence

"In many ways, MapReduce can be seen as a complement to a Rational Database Management System (RDBMS)."

I believe that the word "Rational" should be "Relational."

Kathleen

Note from the Author or Editor:
I think this was introduced during copy-editing. Fixed in the next edition (4th).

lenni  Jul 23, 2012 
PDF
Page 79
3rd Paragraph

Minor sentence construction problem:

The tool runs a MapReduce job to process the input files in parallel, so to run it, you need a MapReduce cluster running to use it.

Note from the Author or Editor:
This section has been removed in the fourth edition.

Anonymous  May 15, 2012 
PDF
Page 39
1st paragraph

In this paragraph, discussing the Ruby reducer in Hadoop Streaming, we have;

"In this case, the keys are weather station identifiers, ...".

However as in the Java example the keys are the years.

At least that is how I read the example...

SAStanley  May 04, 2012 
PDF
Page 616
5th paragraph

May need ssh-add to work properly, after cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
and before ssh localhost.
Otherwise may receive "Agent admitted failure to sign using the key" error

Note from the Author or Editor:
Add the following paragraph after the "cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys" line:

"You may also need to run ssh-add if you are running ssh-agent."

Anonymous  Apr 30, 2012 
PDF
Page 86
footer

1. For a comprehensive set of compression benchmarks, https://github.com/ning/jvm-compressor
-benchmark is a good reference for JMV-compatible libraries (includes some native libraries). For
command line tools, see Jeff Gilchrist?s Archive Comparison Test at http://compression.ca/act/act
-summary.html.

It says "JMV-compatible", should be "JVM"

Emīls ?olmanis  Mar 13, 2012